Lecture 8: When Are Instruments Strong? Understanding Stock & Yogo (2005)¶

Recap of Week 7¶

In Week 7 we derived the approximate bias of the 2SLS estimator:

$$ \text{E}(\hat{\beta}_{\text{2SLS}} - \beta) \;\approx\; \frac{L \cdot \rho}{F} $$

where $L$ is the number of instruments, $\rho$ is the degree of endogeneity $\text{Cov}(v_i, u_i)$, and $F$ is the population first-stage F-statistic. The key message: the larger the F, the smaller the bias. This motivates today's central question: how large does F need to be for instruments to be considered "strong"?

Staiger & Stock (1997) proposed the Rule of Thumb that a first-stage (sample) F above 10 qualifies instruments as strong. Stock & Yogo (2005) refined this into a set of critical values that depend on the model configuration. By the end of this notebook you will understand exactly where these numbers come from.

Road Map¶

  1. Specify a simulation DGP in which instrument strength is controlled by a population F parameter
  2. Simulate the distribution of the IV t-statistic under different DGPs and visually assess normality
  3. Formalise the problem via empirical size — the actual probability of a false rejection
  4. Use the noncentral $\chi^2$ distribution to convert population-F thresholds into practitioner-relevant sample-F critical values
  5. Evaluate whether Stock & Yogo's worst-case scenario ($\rho = 1$) is realistic in practice

Data Generating Process (DGP)¶

We use the toy model from Keane & Neal "A Practical Guide to Weak Instruments" (Annual Review of Economics, 2024), p. 193:

$$ \begin{align*} Y_i &= \beta X_i + u_i\\ X_i &= \pi Z_i + v_i\\ v_i &= \rho\, u_i + \sqrt{1-\rho^2}\, \eta_i \end{align*} $$

where

  • $u_i \sim N(0,1)$

  • $\eta_i \sim N(0,1)$

  • $Z_i \sim N(0,1)$

  • $\beta = 0$

  • notice: $\sigma_v^2 = \operatorname{Var}(v_i) = 1$

Parameter $\rho$ controls the OLS bias (aka the degree of endogeneity):

$$ \operatorname{Cov}(u_i, v_i) = \operatorname{E}(u_i v_i) = \operatorname{E}(\rho u_i^2) + \operatorname{E}(\sqrt{1-\rho^2}\, u_i \eta_i) = \rho \operatorname{E}(u_i^2) = \rho $$

For small values of $\pi$, the value of $\rho$ pins down the OLS bias:

$$ \hat{\beta}_\text{OLS} - \beta = \frac{\sum X_i u_i}{\sum X_i^2} = \frac{\sum (\pi Z_i + v_i) u_i}{\sum (\pi Z_i + v_i)^2} \to_p \frac{\operatorname{E}(u_i v_i)}{(\pi + 1)} \approx \rho, $$

when $\pi$ is small.

Parameter $\pi$ governs instrument strength and is set indirectly via the population F-statistic. Recall from Week 7 (and equation 11 of Neal & Keane):

$$ F = N \cdot \frac{R^2}{1-R^2} = N \cdot \frac{\text{ESS}}{\text{RSS}} = N \cdot \frac{\text{Var}(Z\pi)}{\sigma_v^2} = N \cdot \frac{\pi^2 \text{Var}(Z)}{\sigma_v^2} = N \cdot \pi^2 $$

(using $\operatorname{Var}(Z)=1$ and $\sigma_v^2=1$). Therefore $\pi = \sqrt{F/N}$, which enables us to set $\pi$ in the above DGP by changing the value of $F$.

Julia Functions¶

The cells below define the simulation machinery used throughout this lecture. Each function is self-contained with its own docstring. Feel free to skim them now and refer back as needed.

In [1]:
using Distributions, Random, Statistics

function dgp_keane_neal(; b=0, n=1000, F, rho)

    """
    Generates one sample of size n following the DGP of Keane & Neal (2024) p. 193.

    ### Input
    - `b`   -- structural coefficient β (default 0)
    - `n`   -- sample size (default 1000)
    - `F`   -- population first-stage F-statistic
    - `rho` -- degree of endogeneity ρ

    ### Output (named tuple)
    - `x`, `y`, `z` -- (n×1) vectors for regressor, outcome, and instrument
    """

    π   = sqrt(F / n)
    u   = randn(n)
    eta = randn(n)
    z   = randn(n)
    v   = rho * u + sqrt(1 - rho^2) * eta
    x   = π * z .+ v
    y   = b * x .+ u

    return (; x, y, z)

end
Out[1]:
dgp_keane_neal (generic function with 1 method)
In [2]:
function ols_estimator(x, y)

    """
    OLS estimator for the simple linear model y = βx + u.

    ### Input
    - `x` -- (n×1) regressor vector
    - `y` -- (n×1) outcome vector

    ### Output (named tuple)
    - `bhat` -- OLS estimate of β
    - `se`   -- standard error
    - `t`    -- t-statistic
    """

    bhat = x \ y
    uhat = y - x * bhat
    s    = (uhat' * uhat) / length(y)
    se   = sqrt(s / (x' * x))
    t    = bhat / se

    return (; bhat, se, t)

end
Out[2]:
ols_estimator (generic function with 1 method)
In [3]:
function iv_estimator(x, y, z)

    """
    Just-identified IV estimator with one endogenous variable and one instrument.

    ### Input
    - `x` -- (n×1) endogenous regressor
    - `y` -- (n×1) outcome vector
    - `z` -- (n×1) instrument vector

    ### Output (named tuple)
    - `bhat` -- IV estimate of β
    - `se`   -- standard error (Keane & Neal 2024, p. 190)
    - `t`    -- t-statistic

    ### Notes
    The SE uses the first-stage ESS = N·π̂²·Var(z) as the relevance measure.
    """

    bhat  = (z' * y) / (z' * x)   # β̂_IV = (z'y)/(z'x)

    n     = length(y)
    pihat = z \ x                  # first-stage coefficient π̂
    ESS   = n * pihat^2 * var(z)   # first-stage explained sum of squares
    uhat  = y - x * bhat
    s     = (uhat' * uhat) / n
    se    = sqrt(s / ESS)

    t = bhat / se

    return (; bhat, se, t)

end
Out[3]:
iv_estimator (generic function with 1 method)
In [4]:
function simulate_distribution(; b=0, F, rho, n=1000, rep=10000)

    """
    Monte Carlo simulation of OLS and IV estimator distributions.

    Creates `rep` independent datasets from dgp_keane_neal and collects
    estimates, standard errors, and t-statistics for OLS, IV, and the
    Anderson-Rubin (AR) test.

    ### Input
    - `b`   -- true structural coefficient β (default 0)
    - `F`   -- population first-stage F-statistic
    - `rho` -- degree of endogeneity ρ
    - `n`   -- sample size (default 1000)
    - `rep` -- number of Monte Carlo replications (default 10,000)

    ### Output (named tuple of rep-length vectors)
    - `bols_dst`, `sols_dst`, `tols_dst` -- OLS estimate, SE, t-statistic
    - `biv_dst`,  `siv_dst`,  `tiv_dst`  -- IV  estimate, SE, t-statistic
    - `ar_dst`                            -- AR t-statistic (OLS of Y on Z)
    """

    bols_dst = Vector{Float64}(undef, rep)
    sols_dst = Vector{Float64}(undef, rep)
    tols_dst = Vector{Float64}(undef, rep)
    biv_dst  = Vector{Float64}(undef, rep)
    siv_dst  = Vector{Float64}(undef, rep)
    tiv_dst  = Vector{Float64}(undef, rep)
    ar_dst   = Vector{Float64}(undef, rep)

    for i in 1:rep
        x, y, z = dgp_keane_neal(b=b, F=F, rho=rho, n=n)

        bols_dst[i], sols_dst[i], tols_dst[i] = ols_estimator(x, y)
        biv_dst[i],  siv_dst[i],  tiv_dst[i]  = iv_estimator(x, y, z)
        ar_dst[i] = ols_estimator(z, y).t   # AR: regress Y on Z directly
    end

    return (; bols_dst, sols_dst, tols_dst, biv_dst, siv_dst, tiv_dst, ar_dst)

end
Out[4]:
simulate_distribution (generic function with 1 method)

Creating Simulated Distributions¶

We explore nine parameter combinations that vary instrument strength and the degree of endogeneity:

DGP Population F $\rho$ Description
1 1.82 0.10 weak IV, low endogeneity
2 2.30 0.10 rule of thumb IV, low endogeneity
3 73.75 0.10 strong IV, low endogeneity (benchmark)
4 1.82 0.50 weak IV, moderate endogeneity
5 2.30 0.50 rule of thumb IV, moderate endogeneity
6 73.75 0.50 strong IV, moderate endogeneity
7 1.82 0.90 weak IV, high endogeneity
8 2.30 0.90 rule of thumb IV, high endogeneity
9 73.75 0.90 strong IV, high endogeneity

Why these specific F values? The three F values correspond to Stock & Yogo critical values we will derive later in this notebook:

Population F Sample F (95th pct) Worst-case empirical size
1.82 ≈ 8.96 15%
2.30 = 10.00 13.5% (Rule of Thumb)
73.75 ≈ 104.70 ≈ 5% (nominal level)

DGPs 3, 6, 9 serve as our benchmark: they represent how things should look when instruments are strong.

For each of the nine DGPs we generate 10,000 samples of size $n = 1{,}000$.

In [5]:
Random.seed!(1234)   # set seed for reproducibility

parms_rho = (0.10, 0.50, 0.90)
parms_F   = (1.82, 2.30, 73.75)

dgps = [simulate_distribution(rho=rho, F=F) for rho in parms_rho, F in parms_F];

Distribution of the IV t-Statistic¶

We plot the empirical distribution of the IV t-statistic across all nine DGPs.

Under standard regularity conditions, the IV t-statistic $t_{\text{IV}} = \hat{\beta}_{\text{IV}} / \text{se}_{IV}$ converges to $N(0,1)$. The red curve below shows that $N(0,1)$ density as a reference. The histograms are normalised to density so that they are comparable to the red curve.

Question to keep in mind: Do all nine distributions look approximately normal?

In [6]:
using Plots, LaTeXStrings
using Plots.PlotMeasures: mm
Plots.theme(:wong2)
gr(fmt=:png)

# ── Global defaults (apply to all subsequent plots) ──
default(
    fontfamily    = "Computer Modern",
    titlefontsize = 13,
    guidefontsize = 11,
    tickfontsize  = 9,
    legendfontsize = 9,
    left_margin   = 12mm,
    bottom_margin = 10mm,
    gridalpha     = 0.15,
    framestyle    = :box,
    lw            = 2,
    size          = (900, 500)
)

xgrid = range(-4, 4, length=200)

plt = plot(layout=(length(parms_rho), length(parms_F)),
    size=(1000, 700),
    plot_title="Empirical Distribution of IV t-statistic for Different DGPs",
    plot_titlefontsize=14)

for (i, rho) in enumerate(parms_rho)
    for (j, F) in enumerate(parms_F)
        k = length(parms_F) * (i-1) + j
        histogram!(plt,
            dgps[i,j].tiv_dst,
            normalize=true,
            subplot=k,
            bins=range(-4, 4, length=51),
            color="#6C9BC2",
            fillalpha=0.5,
            linecolor=:white,
            linewidth=0.5,
            legend=false,
            title=L"\rho=%$(rho),\; F=%$(F)",
            titlefontsize=10)
        plot!(plt, xgrid, pdf.(Normal(0,1), xgrid),
            subplot=k, lc="#0072B2", lw=2, label=false)
    end
end

display(plt)

Reading the Plots¶

The solid line is the $N(0,1)$ density — the ideal target.

Top row ($\rho = 0.10$): The top-right panel (strong IV, $F = 73.75$) is our benchmark. It tracks the red curve closely. The top-left and top-centre panels deviate modestly.

Middle and bottom rows ($\rho = 0.50$ and $\rho = 0.90$): As endogeneity increases, departures from normality worsen. The bottom-left panels show heavy tails.

Why does this matter? When we report significance at the "5% level", we compare $|t_{\text{IV}}|$ to 1.96, which is the 97.5th percentile of $N(0,1)$. If the actual distribution of $t_{\text{IV}}$ is not $N(0,1)$, this comparison is unreliable — we might reject the null far more (or less) often than we intend.

Empirical Size and the Stock & Yogo (2005) Approach¶

Formalising the Problem¶

Empirical size is the actual probability of rejecting a true null hypothesis:

$$ \text{empirical size} := \Pr\!\left(|t_{\text{IV}}| > 1.96 \;\Big|\; \beta = 0\right) $$

In our DGP we set $\beta = 0$, so every rejection is a false rejection (Type I error). The nominal size is 5%. If the empirical size exceeds 5%, our test rejects too often — we are over-confident in declaring effects "significant".

We can estimate empirical size directly from our simulations: it is simply the fraction of the 10,000 replications for which $|t_{\text{IV}}| > 1.96$.

Stock & Yogo's Strategy¶

Stock and Yogo (2005) propose the following approach to formalise when instruments are "strong enough":

  1. Fix $\rho = 1$ — the worst case for endogeneity (OLS bias equals its maximum possible value).
  2. Vary the population F over a grid from 0 to 20.
  3. For each F, simulate the empirical size.
  4. Declare instruments "strong" if F exceeds the threshold at which empirical size stays below a chosen tolerance $\alpha^*$ (e.g. 15% or 10%).

By conditioning on the worst case $\rho = 1$, the threshold is conservative: if your instruments pass this test, they pass under any smaller $\rho$ too.

Let's run the simulation:

In [7]:
Random.seed!(2024)

F_range  = 0:.25:20
emp_size = [mean(abs.(simulate_distribution(F=F, rho=1).tiv_dst) .> 1.96) for F in F_range]

plot(F_range, emp_size,
    legend=false,
    xlabel=L"Population $F$",
    ylabel="Empirical Size",
    xticks=0:2:20,
    yticks=0:0.05:1,
    size=(900, 500),
    lw=2.5,
    lc="#0072B2")
hline!([0.15], linestyle=:dash, lc="#009E73", lw=1.5)
hline!([0.05], linestyle=:dash, lc="#009E73",  lw=1.5)
ylims!(0, 1)
title!(L"Empirical Size of IV $t$-test vs Population $F$ (worst case: $\rho$ = 1)")
Out[7]:

Reading the Size Plot¶

When population F is large (above ≈ 20), the empirical size approaches the 5% nominal level even under the extreme assumption $\rho = 1$.

For weaker instruments:

  • At $F \approx 2.30$, the empirical size is roughly 13.5% — you falsely reject the null about 2.7 times as often as you should.
  • At $F \approx 1.82$, it reaches roughly 15% — a three-fold inflation.

Stock & Yogo's Threshold Rule¶

Stock and Yogo accept that with weak instruments the empirical size cannot be held at 5%. Instead, they ask:

"What is the minimum population F that keeps the worst-case empirical size below a tolerance $\alpha^$?"*

Common choices for $\alpha^*$: 10% or 15%.

Digression: Population F vs Sample F¶

For practitioners, the question is more naturally framed in terms of the sample F (the quantity actually reported in software) rather than the population F. We now relate these two to each other.

Recall: population $F$ is given by $F = N \pi^2$ (see equation 11 in Keane & Neal).

In applications, the first stage $F$ is estimated via $\widehat{F} := t^2 = \left( \frac{\widehat{\pi}}{\text{se}(\widehat{\pi})} \right)^2 = \frac{\widehat{\pi}^2}{\widehat{\operatorname{Var}}(\widehat{\pi})} = \frac{\widehat{\pi}^2}{\widehat{\sigma}^2_v/(N \cdot \widehat{\sigma}^2_Z)} =N \widehat{\pi}^2 \frac{\widehat{\sigma}^2_Z}{\widehat{\sigma}^2_v}$. This is the sample F.

It can be shown that the distribution of $\widehat{F}$ relates to $F$ as follows: $\widehat{F} \to_d \chi_1^2(F)$ which is the so-called noncentral $\chi^2$ distribution with noncentrality parameter $F$. Here an explanation:

Primer: Central $\chi^2_1$¶

If $Z \sim N(0, 1)$, then

$$Z^2 \sim \chi^2_1$$

This has mean $1$ and variance $2$. It describes the distribution of a squared standard normal — pure noise centered at zero, then squared.

Key properties:

  • Supported on $[0, \infty)$
  • Heavily right-skewed
  • If $Q_1 \sim \chi^2_1$ and $Q_2 \sim \chi^2_1$ are independent, then $Q_1 + Q_2 \sim \chi^2_2$ (this is how the general $\chi^2_k$ is built up)

Primer: Noncentral $\chi^2_1$¶

Now suppose $Z$ has a nonzero mean: $Z \sim N(\mu, 1)$. Then

$$Z^2 \sim \chi^2_1(\lambda), \qquad \lambda = \mu^2$$

where $\lambda$ is the noncentrality parameter. It measures how far the mean is from zero.

Key properties:

  • Mean is $1 + \lambda$ (shifted right by $\lambda$ compared to the central case)
  • Variance is $2(1 + 2\lambda)$
  • When $\lambda = 0$, this reduces to the ordinary $\chi^2_1$
  • Larger $\lambda$ shifts the entire distribution to the right

Quick visual intuition¶

Mean zero ($\mu = 0$) Mean nonzero ($\mu \neq 0$)
$Z \sim N(\mu, 1)$ Centered at 0 Shifted right
$Z^2 \sim \;?$ $\chi^2_1$ (central) $\chi^2_1(\mu^2)$ (noncentral)

Squaring a shifted normal gives you a "shifted" $\chi^2$. That's the entire idea.

In [8]:
# Under the DGP, sample F follows a noncentral χ²(1, μ) distribution
# with noncentrality parameter μ = population F.
# We plot the CDFs to read off the 95th percentile of sample F for each population F.

popF_values = [1.82, 2.30, 5.78, 10.00, 29.44, 73.75]
linewidths  = [1.5,  2.5,  1.5,  1.5,   1.5,   1.5]

plt_nc = plot(
    size=(900, 500),
    legend=:bottomright,
    xticks=0:10:120,
    title=L"Noncentral $\chi_1^2(\mu)$: CDF of $\widehat{F}$ for Various Population $F$ Values",
    xlabel=L"Sample $\widehat{F}$",
    ylabel="CDF")

for (μ, lw) in zip(popF_values, linewidths)
    plot!(plt_nc, x -> cdf(NoncentralChisq(1, μ), x), 0, 120,
        label=L"$F = %$(μ)$", lw=lw)
end

hline!([0.95], linestyle=:dash, lc=:gray50, lw=1.2, label=false)
vline!([10],   linestyle=:dash, lc=:gray50, lw=1.2, label=false)
annotate!(65, 0.91, text("95th percentile", 9, :gray40))
annotate!(11.5, 0.15, text(L"$\widehat{F} = 10$", 9, :gray40))

display(plt_nc)

Reading the Noncentral $\chi^2$ Plot¶

How do we read the above picture?

Let's focus on the slightly thicker line for which population F is equal to 2.30.

That line gives the cdf of the noncentral $\chi^2$ distribution with noncentrality parameter 2.30

In words: if the population $F$ is 2.30, then 95\% of sample $F$ should fall below 10.

Lose translation: if the population $F$ is 2.30, then it is unlikely that the sample $F$ exceeds 10.

Contrapositive: if the sample $F$ exceeds 10, then the associated population $F$ isn't 2.30 (but likely higher)

This is a simple intuitive construction of a confidence interval for population $F$:

Your sample $F$ is 10 implies the confidence interval $F>2.30$ which in turn implies worst case empirical size at most 13.5%

Mapping from population F to sample F:

Case population F sample F Worst case empirical size (when $\rho=1$)
1 1.82 8.96 15\%
2 2.30 10.00 13.5\%
3 5.78 16.38 10\%
4 10.00 23.10 8.6\%
5 29.44 50.00 6.4\%
6 73.75 104.70 5.0\%

Bottom line for practitioners:

If we want to be confident that the worst case empirical size will not exceed 15\%, then check that first stage (sample) F exceeds 8.96.

(That's the origin of the Rule of Thumb that F exceed 10.)

The table also shows that if you strive for an empirical size of 5\% then need a sample F of 104.7.

(Source: Table 1 of Keane & Neal 2024, which tabulates Stock & Yogo critical values.)

Practical implications:

  • The Staiger & Stock Rule of Thumb (sample F > 10) corresponds to a worst-case empirical size of ≈ 13.5%, not 5%. It is a lower bound, not a gold standard.
  • Achieving a worst-case size close to the 5% nominal level requires sample F above 104.7 — a much more demanding criterion that many published studies cannot meet.
  • Keane & Neal (2024) recommend always reporting the first-stage sample F and being transparent about what size distortion it implies.

Is the Worst Case too Pessimistic?¶

Stock & Yogo fix $\rho = 1$ to derive conservative thresholds. How realistic is this assumption?

A back-of-the-envelope calculation from Keane & Neal, using US Panel Study of Income Dynamics data on returns to schooling:

  • The sample correlation between log earnings and years of schooling is about 0.45.
  • This correlation has two components: the causal effect of schooling on earnings, and the confounding effect of unobserved ability.
  • In the extreme case where the causal effect were zero, the entire 0.45 correlation would reflect ability bias — so $|\rho| \leq 0.45$.
  • In practice, the causal effect is positive, so $|\rho|$ must be well below 0.45.

Under this reasoning, $\rho = 1$ is an extreme and unrealistic scenario for most applications.

Does this mean Stock & Yogo are needlessly conservative?
Let's check directly: holding $F = 1.82$ fixed (the weak-instrument case that maps to sample F ≈ 8.96), how does empirical size vary as $\rho$ changes?

In [9]:
Random.seed!(5678)

rho_range = 0:0.05:1.0
emp_size  = [mean(abs.(simulate_distribution(F=1.82, rho=rho).tiv_dst) .> 1.96) for rho in rho_range]

plot(rho_range, emp_size,
    lw=2.5,
    lc="#0072B2",
    legend=false,
    xlabel=L"Degree of Endogeneity $\rho$",
    ylabel="Empirical Size",
    size=(900, 500))
hline!([0.05], linestyle=:dash, lc=:black, lw=1.2)
annotate!(0.12, 0.058, text("5% nominal size", 9, :gray40))
vline!([0.45], linestyle=:dot, lc="#009E73", lw=2)
annotate!(0.51, 0.21, text(L"\rho = 0.45", 10, "#009E73"))
ylims!(0, 0.22)
title!(L"Empirical Size vs $\rho$ (population $F$ = 1.82, sample $\widehat{F}$ 95th pct $\approx$ 8.96)")
Out[9]:

What Does This Tell Us?¶

The plot reveals a nuanced picture:

  • At $\rho = 1$ (worst case): empirical size peaks around 15% — consistent with the Stock & Yogo table entry for $F = 1.82$.
  • At $\rho \approx 0.45$ (realistic maximum in returns-to-schooling): empirical size is only around 5–6%, close to the nominal level.
  • At very low $\rho$: empirical size can fall below 5%. This is not entirely reassuring — a test that under-rejects is conservative but hints that the IV estimator behaves oddly even when endogeneity is mild.

Bottom line: Stock & Yogo's $\rho = 1$ benchmark is deliberately conservative. In many empirical settings the true degree of endogeneity is moderate, and the size distortion is less severe than the worst case suggests. This does not mean we can ignore instrument strength — a sample F just above 10 still warrants caution — but it does mean the picture is more nuanced than a single threshold implies.

Summary and Preview of Week 9¶

Concept Key Result
2SLS bias (Week 7) $\propto L\rho/F$ — large $F$ suppresses bias
Empirical size Inflated above 5% when $F$ is small; worst case at $\rho=1$
Rule of Thumb Sample $F > 10$ $\Leftrightarrow$ pop. $F > 2.30$ $\Leftrightarrow$ worst-case size ≈ 13.5%
Worst-case realism $\rho=1$ is conservative; realistic $\rho$ values give better size properties

What's missing from this analysis? We have only studied size — what happens when the null $H_0: \beta = 0$ is true. We have not studied power — the probability of correctly rejecting a false null when the true effect is non-zero. In Week 9 we will see that:

  • Weak instruments also dramatically reduce statistical power, so even when size is acceptable, you may miss genuine effects.
  • There is an additional problem: the standard IV t-test suffers from power asymmetry — it is easier to find "significant" results in one direction than the other when instruments are weak.
  • The Anderson–Rubin (AR) test provides a simple, robust alternative that Keane & Neal (2024) recommend over the standard IV t-test.