In Week 7 we derived the approximate bias of the 2SLS estimator:
$$ \text{E}(\hat{\beta}_{\text{2SLS}} - \beta) \;\approx\; \frac{L \cdot \rho}{F} $$where $L$ is the number of instruments, $\rho$ is the degree of endogeneity $\text{Cov}(v_i, u_i)$, and $F$ is the population first-stage F-statistic. The key message: the larger the F, the smaller the bias. This motivates today's central question: how large does F need to be for instruments to be considered "strong"?
Staiger & Stock (1997) proposed the Rule of Thumb that a first-stage (sample) F above 10 qualifies instruments as strong. Stock & Yogo (2005) refined this into a set of critical values that depend on the model configuration. By the end of this notebook you will understand exactly where these numbers come from.
We use the toy model from Keane & Neal "A Practical Guide to Weak Instruments" (Annual Review of Economics, 2024), p. 193:
$$ \begin{align*} Y_i &= \beta X_i + u_i\\ X_i &= \pi Z_i + v_i\\ v_i &= \rho\, u_i + \sqrt{1-\rho^2}\, \eta_i \end{align*} $$where
$u_i \sim N(0,1)$
$\eta_i \sim N(0,1)$
$Z_i \sim N(0,1)$
$\beta = 0$
notice: $\sigma_v^2 = \operatorname{Var}(v_i) = 1$
Parameter $\rho$ controls the OLS bias (aka the degree of endogeneity):
$$ \operatorname{Cov}(u_i, v_i) = \operatorname{E}(u_i v_i) = \operatorname{E}(\rho u_i^2) + \operatorname{E}(\sqrt{1-\rho^2}\, u_i \eta_i) = \rho \operatorname{E}(u_i^2) = \rho $$For small values of $\pi$, the value of $\rho$ pins down the OLS bias:
$$ \hat{\beta}_\text{OLS} - \beta = \frac{\sum X_i u_i}{\sum X_i^2} = \frac{\sum (\pi Z_i + v_i) u_i}{\sum (\pi Z_i + v_i)^2} \to_p \frac{\operatorname{E}(u_i v_i)}{(\pi + 1)} \approx \rho, $$when $\pi$ is small.
Parameter $\pi$ governs instrument strength and is set indirectly via the population F-statistic. Recall from Week 7 (and equation 11 of Neal & Keane):
$$ F = N \cdot \frac{R^2}{1-R^2} = N \cdot \frac{\text{ESS}}{\text{RSS}} = N \cdot \frac{\text{Var}(Z\pi)}{\sigma_v^2} = N \cdot \frac{\pi^2 \text{Var}(Z)}{\sigma_v^2} = N \cdot \pi^2 $$(using $\operatorname{Var}(Z)=1$ and $\sigma_v^2=1$). Therefore $\pi = \sqrt{F/N}$, which enables us to set $\pi$ in the above DGP by changing the value of $F$.
The cells below define the simulation machinery used throughout this lecture. Each function is self-contained with its own docstring. Feel free to skim them now and refer back as needed.
using Distributions, Random, Statistics
function dgp_keane_neal(; b=0, n=1000, F, rho)
"""
Generates one sample of size n following the DGP of Keane & Neal (2024) p. 193.
### Input
- `b` -- structural coefficient β (default 0)
- `n` -- sample size (default 1000)
- `F` -- population first-stage F-statistic
- `rho` -- degree of endogeneity ρ
### Output (named tuple)
- `x`, `y`, `z` -- (n×1) vectors for regressor, outcome, and instrument
"""
π = sqrt(F / n)
u = randn(n)
eta = randn(n)
z = randn(n)
v = rho * u + sqrt(1 - rho^2) * eta
x = π * z .+ v
y = b * x .+ u
return (; x, y, z)
end
dgp_keane_neal (generic function with 1 method)
function ols_estimator(x, y)
"""
OLS estimator for the simple linear model y = βx + u.
### Input
- `x` -- (n×1) regressor vector
- `y` -- (n×1) outcome vector
### Output (named tuple)
- `bhat` -- OLS estimate of β
- `se` -- standard error
- `t` -- t-statistic
"""
bhat = x \ y
uhat = y - x * bhat
s = (uhat' * uhat) / length(y)
se = sqrt(s / (x' * x))
t = bhat / se
return (; bhat, se, t)
end
ols_estimator (generic function with 1 method)
function iv_estimator(x, y, z)
"""
Just-identified IV estimator with one endogenous variable and one instrument.
### Input
- `x` -- (n×1) endogenous regressor
- `y` -- (n×1) outcome vector
- `z` -- (n×1) instrument vector
### Output (named tuple)
- `bhat` -- IV estimate of β
- `se` -- standard error (Keane & Neal 2024, p. 190)
- `t` -- t-statistic
### Notes
The SE uses the first-stage ESS = N·π̂²·Var(z) as the relevance measure.
"""
bhat = (z' * y) / (z' * x) # β̂_IV = (z'y)/(z'x)
n = length(y)
pihat = z \ x # first-stage coefficient π̂
ESS = n * pihat^2 * var(z) # first-stage explained sum of squares
uhat = y - x * bhat
s = (uhat' * uhat) / n
se = sqrt(s / ESS)
t = bhat / se
return (; bhat, se, t)
end
iv_estimator (generic function with 1 method)
function simulate_distribution(; b=0, F, rho, n=1000, rep=10000)
"""
Monte Carlo simulation of OLS and IV estimator distributions.
Creates `rep` independent datasets from dgp_keane_neal and collects
estimates, standard errors, and t-statistics for OLS, IV, and the
Anderson-Rubin (AR) test.
### Input
- `b` -- true structural coefficient β (default 0)
- `F` -- population first-stage F-statistic
- `rho` -- degree of endogeneity ρ
- `n` -- sample size (default 1000)
- `rep` -- number of Monte Carlo replications (default 10,000)
### Output (named tuple of rep-length vectors)
- `bols_dst`, `sols_dst`, `tols_dst` -- OLS estimate, SE, t-statistic
- `biv_dst`, `siv_dst`, `tiv_dst` -- IV estimate, SE, t-statistic
- `ar_dst` -- AR t-statistic (OLS of Y on Z)
"""
bols_dst = Vector{Float64}(undef, rep)
sols_dst = Vector{Float64}(undef, rep)
tols_dst = Vector{Float64}(undef, rep)
biv_dst = Vector{Float64}(undef, rep)
siv_dst = Vector{Float64}(undef, rep)
tiv_dst = Vector{Float64}(undef, rep)
ar_dst = Vector{Float64}(undef, rep)
for i in 1:rep
x, y, z = dgp_keane_neal(b=b, F=F, rho=rho, n=n)
bols_dst[i], sols_dst[i], tols_dst[i] = ols_estimator(x, y)
biv_dst[i], siv_dst[i], tiv_dst[i] = iv_estimator(x, y, z)
ar_dst[i] = ols_estimator(z, y).t # AR: regress Y on Z directly
end
return (; bols_dst, sols_dst, tols_dst, biv_dst, siv_dst, tiv_dst, ar_dst)
end
simulate_distribution (generic function with 1 method)
We explore nine parameter combinations that vary instrument strength and the degree of endogeneity:
| DGP | Population F | $\rho$ | Description |
|---|---|---|---|
| 1 | 1.82 | 0.10 | weak IV, low endogeneity |
| 2 | 2.30 | 0.10 | rule of thumb IV, low endogeneity |
| 3 | 73.75 | 0.10 | strong IV, low endogeneity (benchmark) |
| 4 | 1.82 | 0.50 | weak IV, moderate endogeneity |
| 5 | 2.30 | 0.50 | rule of thumb IV, moderate endogeneity |
| 6 | 73.75 | 0.50 | strong IV, moderate endogeneity |
| 7 | 1.82 | 0.90 | weak IV, high endogeneity |
| 8 | 2.30 | 0.90 | rule of thumb IV, high endogeneity |
| 9 | 73.75 | 0.90 | strong IV, high endogeneity |
Why these specific F values? The three F values correspond to Stock & Yogo critical values we will derive later in this notebook:
| Population F | Sample F (95th pct) | Worst-case empirical size |
|---|---|---|
| 1.82 | ≈ 8.96 | 15% |
| 2.30 | = 10.00 | 13.5% (Rule of Thumb) |
| 73.75 | ≈ 104.70 | ≈ 5% (nominal level) |
DGPs 3, 6, 9 serve as our benchmark: they represent how things should look when instruments are strong.
For each of the nine DGPs we generate 10,000 samples of size $n = 1{,}000$.
Random.seed!(1234) # set seed for reproducibility
parms_rho = (0.10, 0.50, 0.90)
parms_F = (1.82, 2.30, 73.75)
dgps = [simulate_distribution(rho=rho, F=F) for rho in parms_rho, F in parms_F];
We plot the empirical distribution of the IV t-statistic across all nine DGPs.
Under standard regularity conditions, the IV t-statistic $t_{\text{IV}} = \hat{\beta}_{\text{IV}} / \text{se}_{IV}$ converges to $N(0,1)$. The red curve below shows that $N(0,1)$ density as a reference. The histograms are normalised to density so that they are comparable to the red curve.
Question to keep in mind: Do all nine distributions look approximately normal?
using Plots, LaTeXStrings
using Plots.PlotMeasures: mm
Plots.theme(:wong2)
gr(fmt=:png)
# ── Global defaults (apply to all subsequent plots) ──
default(
fontfamily = "Computer Modern",
titlefontsize = 13,
guidefontsize = 11,
tickfontsize = 9,
legendfontsize = 9,
left_margin = 12mm,
bottom_margin = 10mm,
gridalpha = 0.15,
framestyle = :box,
lw = 2,
size = (900, 500)
)
xgrid = range(-4, 4, length=200)
plt = plot(layout=(length(parms_rho), length(parms_F)),
size=(1000, 700),
plot_title="Empirical Distribution of IV t-statistic for Different DGPs",
plot_titlefontsize=14)
for (i, rho) in enumerate(parms_rho)
for (j, F) in enumerate(parms_F)
k = length(parms_F) * (i-1) + j
histogram!(plt,
dgps[i,j].tiv_dst,
normalize=true,
subplot=k,
bins=range(-4, 4, length=51),
color="#6C9BC2",
fillalpha=0.5,
linecolor=:white,
linewidth=0.5,
legend=false,
title=L"\rho=%$(rho),\; F=%$(F)",
titlefontsize=10)
plot!(plt, xgrid, pdf.(Normal(0,1), xgrid),
subplot=k, lc="#0072B2", lw=2, label=false)
end
end
display(plt)
The solid line is the $N(0,1)$ density — the ideal target.
Top row ($\rho = 0.10$): The top-right panel (strong IV, $F = 73.75$) is our benchmark. It tracks the red curve closely. The top-left and top-centre panels deviate modestly.
Middle and bottom rows ($\rho = 0.50$ and $\rho = 0.90$): As endogeneity increases, departures from normality worsen. The bottom-left panels show heavy tails.
Why does this matter? When we report significance at the "5% level", we compare $|t_{\text{IV}}|$ to 1.96, which is the 97.5th percentile of $N(0,1)$. If the actual distribution of $t_{\text{IV}}$ is not $N(0,1)$, this comparison is unreliable — we might reject the null far more (or less) often than we intend.
Empirical size is the actual probability of rejecting a true null hypothesis:
$$ \text{empirical size} := \Pr\!\left(|t_{\text{IV}}| > 1.96 \;\Big|\; \beta = 0\right) $$In our DGP we set $\beta = 0$, so every rejection is a false rejection (Type I error). The nominal size is 5%. If the empirical size exceeds 5%, our test rejects too often — we are over-confident in declaring effects "significant".
We can estimate empirical size directly from our simulations: it is simply the fraction of the 10,000 replications for which $|t_{\text{IV}}| > 1.96$.
Stock and Yogo (2005) propose the following approach to formalise when instruments are "strong enough":
By conditioning on the worst case $\rho = 1$, the threshold is conservative: if your instruments pass this test, they pass under any smaller $\rho$ too.
Let's run the simulation:
Random.seed!(2024)
F_range = 0:.25:20
emp_size = [mean(abs.(simulate_distribution(F=F, rho=1).tiv_dst) .> 1.96) for F in F_range]
plot(F_range, emp_size,
legend=false,
xlabel=L"Population $F$",
ylabel="Empirical Size",
xticks=0:2:20,
yticks=0:0.05:1,
size=(900, 500),
lw=2.5,
lc="#0072B2")
hline!([0.15], linestyle=:dash, lc="#009E73", lw=1.5)
hline!([0.05], linestyle=:dash, lc="#009E73", lw=1.5)
ylims!(0, 1)
title!(L"Empirical Size of IV $t$-test vs Population $F$ (worst case: $\rho$ = 1)")
When population F is large (above ≈ 20), the empirical size approaches the 5% nominal level even under the extreme assumption $\rho = 1$.
For weaker instruments:
Stock and Yogo accept that with weak instruments the empirical size cannot be held at 5%. Instead, they ask:
"What is the minimum population F that keeps the worst-case empirical size below a tolerance $\alpha^$?"*
Common choices for $\alpha^*$: 10% or 15%.
For practitioners, the question is more naturally framed in terms of the sample F (the quantity actually reported in software) rather than the population F. We now relate these two to each other.
Recall: population $F$ is given by $F = N \pi^2$ (see equation 11 in Keane & Neal).
In applications, the first stage $F$ is estimated via $\widehat{F} := t^2 = \left( \frac{\widehat{\pi}}{\text{se}(\widehat{\pi})} \right)^2 = \frac{\widehat{\pi}^2}{\widehat{\operatorname{Var}}(\widehat{\pi})} = \frac{\widehat{\pi}^2}{\widehat{\sigma}^2_v/(N \cdot \widehat{\sigma}^2_Z)} =N \widehat{\pi}^2 \frac{\widehat{\sigma}^2_Z}{\widehat{\sigma}^2_v}$. This is the sample F.
It can be shown that the distribution of $\widehat{F}$ relates to $F$ as follows: $\widehat{F} \to_d \chi_1^2(F)$ which is the so-called noncentral $\chi^2$ distribution with noncentrality parameter $F$. Here an explanation:
If $Z \sim N(0, 1)$, then
$$Z^2 \sim \chi^2_1$$This has mean $1$ and variance $2$. It describes the distribution of a squared standard normal — pure noise centered at zero, then squared.
Key properties:
Now suppose $Z$ has a nonzero mean: $Z \sim N(\mu, 1)$. Then
$$Z^2 \sim \chi^2_1(\lambda), \qquad \lambda = \mu^2$$where $\lambda$ is the noncentrality parameter. It measures how far the mean is from zero.
Key properties:
| Mean zero ($\mu = 0$) | Mean nonzero ($\mu \neq 0$) | |
|---|---|---|
| $Z \sim N(\mu, 1)$ | Centered at 0 | Shifted right |
| $Z^2 \sim \;?$ | $\chi^2_1$ (central) | $\chi^2_1(\mu^2)$ (noncentral) |
Squaring a shifted normal gives you a "shifted" $\chi^2$. That's the entire idea.
# Under the DGP, sample F follows a noncentral χ²(1, μ) distribution
# with noncentrality parameter μ = population F.
# We plot the CDFs to read off the 95th percentile of sample F for each population F.
popF_values = [1.82, 2.30, 5.78, 10.00, 29.44, 73.75]
linewidths = [1.5, 2.5, 1.5, 1.5, 1.5, 1.5]
plt_nc = plot(
size=(900, 500),
legend=:bottomright,
xticks=0:10:120,
title=L"Noncentral $\chi_1^2(\mu)$: CDF of $\widehat{F}$ for Various Population $F$ Values",
xlabel=L"Sample $\widehat{F}$",
ylabel="CDF")
for (μ, lw) in zip(popF_values, linewidths)
plot!(plt_nc, x -> cdf(NoncentralChisq(1, μ), x), 0, 120,
label=L"$F = %$(μ)$", lw=lw)
end
hline!([0.95], linestyle=:dash, lc=:gray50, lw=1.2, label=false)
vline!([10], linestyle=:dash, lc=:gray50, lw=1.2, label=false)
annotate!(65, 0.91, text("95th percentile", 9, :gray40))
annotate!(11.5, 0.15, text(L"$\widehat{F} = 10$", 9, :gray40))
display(plt_nc)
How do we read the above picture?
Let's focus on the slightly thicker line for which population F is equal to 2.30.
That line gives the cdf of the noncentral $\chi^2$ distribution with noncentrality parameter 2.30
In words: if the population $F$ is 2.30, then 95\% of sample $F$ should fall below 10.
Lose translation: if the population $F$ is 2.30, then it is unlikely that the sample $F$ exceeds 10.
Contrapositive: if the sample $F$ exceeds 10, then the associated population $F$ isn't 2.30 (but likely higher)
This is a simple intuitive construction of a confidence interval for population $F$:
Your sample $F$ is 10 implies the confidence interval $F>2.30$ which in turn implies worst case empirical size at most 13.5%
Mapping from population F to sample F:
| Case | population F | sample F | Worst case empirical size (when $\rho=1$) |
|---|---|---|---|
| 1 | 1.82 | 8.96 | 15\% |
| 2 | 2.30 | 10.00 | 13.5\% |
| 3 | 5.78 | 16.38 | 10\% |
| 4 | 10.00 | 23.10 | 8.6\% |
| 5 | 29.44 | 50.00 | 6.4\% |
| 6 | 73.75 | 104.70 | 5.0\% |
Bottom line for practitioners:
If we want to be confident that the worst case empirical size will not exceed 15\%, then check that first stage (sample) F exceeds 8.96.
(That's the origin of the Rule of Thumb that F exceed 10.)
The table also shows that if you strive for an empirical size of 5\% then need a sample F of 104.7.
(Source: Table 1 of Keane & Neal 2024, which tabulates Stock & Yogo critical values.)
Practical implications:
Stock & Yogo fix $\rho = 1$ to derive conservative thresholds. How realistic is this assumption?
A back-of-the-envelope calculation from Keane & Neal, using US Panel Study of Income Dynamics data on returns to schooling:
Under this reasoning, $\rho = 1$ is an extreme and unrealistic scenario for most applications.
Does this mean Stock & Yogo are needlessly conservative?
Let's check directly: holding $F = 1.82$ fixed (the weak-instrument case that maps to sample F ≈ 8.96), how does empirical size vary as $\rho$ changes?
Random.seed!(5678)
rho_range = 0:0.05:1.0
emp_size = [mean(abs.(simulate_distribution(F=1.82, rho=rho).tiv_dst) .> 1.96) for rho in rho_range]
plot(rho_range, emp_size,
lw=2.5,
lc="#0072B2",
legend=false,
xlabel=L"Degree of Endogeneity $\rho$",
ylabel="Empirical Size",
size=(900, 500))
hline!([0.05], linestyle=:dash, lc=:black, lw=1.2)
annotate!(0.12, 0.058, text("5% nominal size", 9, :gray40))
vline!([0.45], linestyle=:dot, lc="#009E73", lw=2)
annotate!(0.51, 0.21, text(L"\rho = 0.45", 10, "#009E73"))
ylims!(0, 0.22)
title!(L"Empirical Size vs $\rho$ (population $F$ = 1.82, sample $\widehat{F}$ 95th pct $\approx$ 8.96)")
The plot reveals a nuanced picture:
Bottom line: Stock & Yogo's $\rho = 1$ benchmark is deliberately conservative. In many empirical settings the true degree of endogeneity is moderate, and the size distortion is less severe than the worst case suggests. This does not mean we can ignore instrument strength — a sample F just above 10 still warrants caution — but it does mean the picture is more nuanced than a single threshold implies.
| Concept | Key Result |
|---|---|
| 2SLS bias (Week 7) | $\propto L\rho/F$ — large $F$ suppresses bias |
| Empirical size | Inflated above 5% when $F$ is small; worst case at $\rho=1$ |
| Rule of Thumb | Sample $F > 10$ $\Leftrightarrow$ pop. $F > 2.30$ $\Leftrightarrow$ worst-case size ≈ 13.5% |
| Worst-case realism | $\rho=1$ is conservative; realistic $\rho$ values give better size properties |
What's missing from this analysis? We have only studied size — what happens when the null $H_0: \beta = 0$ is true. We have not studied power — the probability of correctly rejecting a false null when the true effect is non-zero. In Week 9 we will see that: