Lecture 9: Power, Bias, and the AR Test — A Deeper Look at Weak Instruments¶

Summary of Week 8¶

Last week, following Stock & Yogo (2005), we answered the question: how large does the first-stage F-statistic need to be for instruments to be considered "strong"?

Quick recap:

  • Instrument strength is captured by the first-stage F-statistic (regression of $X$ on $Z$). A larger F implies a stronger instrument.
  • Stock & Yogo fix $\rho = 1$ (worst-case endogeneity) and vary population $F$ to study the empirical size of the IV t-test — the actual probability of a false rejection.
  • Rule of Thumb: A first-stage sample $F$ above 10 keeps the worst-case empirical size below ≈ 13.5%. Achieving 5% nominal size requires sample $F$ above 104.

The mapping between population $F$ and sample $F$ (via the noncentral $\chi^2_1$ distribution):

Population F Sample F (95th pct) Worst-case size
1.82 8.96 15%
2.30 10.00 13.5%
5.78 16.38 10%
10.00 23.10 8.6%
29.44 50.00 6.4%
73.75 104.70 5.0%

(Source: Table 1 of Keane & Neal 2024.)

Road Map for Week 9¶

Week 8 focused entirely on size — what happens when $H_0:\beta=0$ is true. This week we extend the analysis:

  1. Standard errors under weak IV: Are the IV standard errors reliable?
  2. The funnel plot: A diagnostic scatter plot of IV estimates against their standard errors that reveals a structural problem with weak instruments.
  3. Power: What is the probability of correctly rejecting a false null?
  4. Power asymmetry: A subtle distortion that makes the IV t-test reject more easily in one direction than the other.
  5. The Anderson–Rubin (AR) test: A simple, robust alternative recommended by Keane & Neal (2024).

Keane and Neal (2024)¶

Keane and Neal point out that the story gets more complicated once we pay attention to the power function.

While it is important to have a test with good size properties, we also want the test to be powerful — able to detect a real effect when one exists.

Recall:

Size is the probability to reject the null when it is true

Power is the probability to reject the null when it is false

Practical Lessons from Week 8¶

We learned that values for $\rho$ of practical relevance fall between zero and 0.50. For our simulations we therefore focus on

Values for $\rho$ of practical relevance
0.00 (no endogeneity)
0.10
0.30
0.50

Data Generating Process (DGP)¶

We continue with the toy model of Keane & Neal (2024) p. 193, the same as last week:

$$ \begin{align*} Y_i &= \beta X_i + u_i\\ X_i &= \pi Z_i + v_i\\ v_i &= \rho\, u_i + \sqrt{1-\rho^2}\, \eta_i \end{align*} $$

where

  • $u_i \sim N(0,1)$, $\;\eta_i \sim N(0,1)$, $\;Z_i \sim N(0,1)$
  • $\beta = 0$ (in the size and power-baseline simulations)
  • $\operatorname{Var}(v_i) = 1$

We set $\pi = \sqrt{F/N}$ to control instrument strength via the population $F$-statistic.

  • Bias of OLS $\approx \rho$ (provided $\pi$ is small)

Julia Functions¶

The cells below define the simulation machinery for this lecture. The functions dgp_keane_neal, ols_estimator, iv_estimator, and simulate_distribution are identical to those in Week 8. The power_function is new: it computes empirical power across a grid of true $\beta$ values. Feel free to skim them now and refer back as needed.

In [1]:
using Distributions, Random, Statistics
using Plots, LaTeXStrings
using Plots.PlotMeasures: mm
In [2]:
Plots.theme(:wong2)
gr(fmt=:png)

default(
    fontfamily     = "Computer Modern",
    titlefontsize  = 12,
    guidefontsize  = 11,
    tickfontsize   = 9,
    legendfontsize = 9,
    left_margin    = 12mm,
    bottom_margin  = 10mm,
    gridalpha      = 0.15,
    framestyle     = :box,
    lw             = 2,
    size           = (900, 500)
)
In [3]:
function dgp_keane_neal(; b=0, n=1000, F, rho)

    """
    Generates one sample of size n following the DGP of Keane & Neal (2024) p. 193.

    ### Input
    - `b`   -- structural coefficient β (default 0)
    - `n`   -- sample size (default 1000)
    - `F`   -- population first-stage F-statistic
    - `rho` -- degree of endogeneity ρ

    ### Output (named tuple)
    - `x`, `y`, `z` -- (n×1) vectors for regressor, outcome, and instrument
    """

    π   = sqrt(F / n)
    u   = randn(n)
    eta = randn(n)
    z   = randn(n)
    v   = rho * u + sqrt(1 - rho^2) * eta
    x   = π * z .+ v
    y   = b * x .+ u

    return (; x, y, z)

end
Out[3]:
dgp_keane_neal (generic function with 1 method)
In [4]:
function ols_estimator(x, y)

    """
    OLS estimator for the simple linear model y = βx + u.

    ### Input
    - `x` -- (n×1) regressor vector
    - `y` -- (n×1) outcome vector

    ### Output (named tuple)
    - `bhat` -- OLS estimate of β
    - `se`   -- standard error
    - `t`    -- t-statistic
    """

    bhat = x \ y
    uhat = y - x * bhat
    s    = (uhat' * uhat) / length(y)
    se   = sqrt(s / (x' * x))
    t    = bhat / se

    return (; bhat, se, t)

end
Out[4]:
ols_estimator (generic function with 1 method)
In [5]:
function iv_estimator(x, y, z)

    """
    Just-identified IV estimator with one endogenous variable and one instrument.

    ### Input
    - `x` -- (n×1) endogenous regressor
    - `y` -- (n×1) outcome vector
    - `z` -- (n×1) instrument vector

    ### Output (named tuple)
    - `bhat` -- IV estimate of β
    - `se`   -- standard error (Keane & Neal 2024, p. 190)
    - `t`    -- t-statistic

    ### Notes
    The SE uses the first-stage ESS = N·π̂²·Var(z) as the relevance measure.
    """

    bhat  = (z' * y) / (z' * x)   # β̂_IV = (z'y)/(z'x)

    n     = length(y)
    pihat = z \ x                  # first-stage coefficient π̂
    ESS   = n * pihat^2 * var(z)   # first-stage explained sum of squares
    uhat  = y - x * bhat
    s     = (uhat' * uhat) / n
    se    = sqrt(s / ESS)

    t = bhat / se

    return (; bhat, se, t)

end
Out[5]:
iv_estimator (generic function with 1 method)
In [6]:
function simulate_distribution(; b=0, F, rho, n=1000, rep=10000)

    """
    Monte Carlo simulation of OLS and IV estimator distributions.

    Creates `rep` independent datasets from dgp_keane_neal and collects
    estimates, standard errors, and t-statistics for OLS, IV, and the
    Anderson-Rubin (AR) test statistic.

    ### Input
    - `b`   -- true structural coefficient β (default 0)
    - `F`   -- population first-stage F-statistic
    - `rho` -- degree of endogeneity ρ
    - `n`   -- sample size (default 1000)
    - `rep` -- number of Monte Carlo replications (default 10,000)

    ### Output (named tuple of rep-length vectors)
    - `bols_dst`, `sols_dst`, `tols_dst` -- OLS estimate, SE, t-statistic
    - `biv_dst`,  `siv_dst`,  `tiv_dst`  -- IV  estimate, SE, t-statistic
    - `ar_dst`                            -- AR t-statistic (OLS of Y on Z)
    """

    bols_dst = Vector{Float64}(undef, rep)
    sols_dst = Vector{Float64}(undef, rep)
    tols_dst = Vector{Float64}(undef, rep)
    biv_dst  = Vector{Float64}(undef, rep)
    siv_dst  = Vector{Float64}(undef, rep)
    tiv_dst  = Vector{Float64}(undef, rep)
    ar_dst   = Vector{Float64}(undef, rep)

    for i in 1:rep
        x, y, z = dgp_keane_neal(b=b, F=F, rho=rho, n=n)

        bols_dst[i], sols_dst[i], tols_dst[i] = ols_estimator(x, y)
        biv_dst[i],  siv_dst[i],  tiv_dst[i]  = iv_estimator(x, y, z)
        ar_dst[i] = ols_estimator(z, y).t   # AR: regress Y on Z directly
    end

    return (; bols_dst, sols_dst, tols_dst, biv_dst, siv_dst, tiv_dst, ar_dst)

end
Out[6]:
simulate_distribution (generic function with 1 method)
In [7]:
function power_function(; brange=-1.00:0.10:1.00, F, rho, n=1000, rep=10000)

    """
    Computes empirical power of the OLS t-test, IV t-test, and AR test.

    Power = Pr(reject H₀: β = 0 | true β).

    ### Input
    - `brange` -- range of true β values (default -1.0:0.1:1.0)
    - `F`      -- population first-stage F-statistic
    - `rho`    -- degree of endogeneity ρ
    - `n`      -- sample size (default 1000)
    - `rep`    -- Monte Carlo replications per β value (default 10,000)

    ### Output
    - `brange`    -- the β grid (passed through)
    - `power_ols` -- empirical power of OLS t-test at each β
    - `power_iv`  -- empirical power of IV  t-test at each β
    - `power_ar`  -- empirical power of AR  test  at each β
    """

    power_ols = similar(brange)
    power_iv  = similar(brange)
    power_ar  = similar(brange)

    for (i, b) in enumerate(brange)
        sim         = simulate_distribution(b=b, F=F, rho=rho, n=n, rep=rep)
        power_ols[i] = mean(abs.(sim.tols_dst) .> 1.96)
        power_iv[i]  = mean(abs.(sim.tiv_dst)  .> 1.96)
        power_ar[i]  = mean(abs.(sim.ar_dst)   .> 1.96)
    end

    return brange, power_ols, power_iv, power_ar

end
Out[7]:
power_function (generic function with 1 method)

Creating DGPs¶

I'm creating two containers that store my simulated distributions:

  • dgp_zero: 10,000 samples from a DGP in which $\rho=0$ and $F=73.75$ (no endogeneity, strong IV). This serves as the ideal reference case for OLS.

  • dgps: a $3 \times 5$ array of simulations, one for each combination of $\rho \in \{0.10, 0.30, 0.50\}$ and $F \in \{1.82, 2.30, 10, 29.44, 73.75\}$.

The five $F$ values span the range from the weakest case (population $F = 1.82$, sample $F \approx 9$) to a very strong instrument (population $F = 73.75$, sample $F \approx 105$).

In [8]:
Random.seed!(1234)   # set seed for reproducibility

# Reference case: no endogeneity, strong IV
dgp_zero = simulate_distribution(rho=0, F=73.75)

# Grid of DGPs
parms_rho = (0.10, 0.30, 0.50)
parms_F   = (1.82, 2.30, 10.00, 29.44, 73.75)

dgps = [simulate_distribution(rho=rho, F=F) for rho in parms_rho, F in parms_F];

Standard Errors¶

Let's start by looking at histograms of IV standard errors under different parameter combinations.

Reading the grid: Each panel corresponds to one $(\rho, F)$ pair. The top-right panel ($\rho=0.10$, $F=73.75$) is our benchmark — strong IV with low endogeneity. We truncate each histogram at the 97th percentile to make the shape visible; heavy-tailed distributions under weak IV would extend much further to the right.

In [9]:
plt = plot(
    layout      = (length(parms_rho), length(parms_F)),
    size        = (1800, 700),
    plot_title  = "Empirical Distribution of IV Standard Errors (truncated at 97th percentile)",
    plot_titlefontsize = 13)

for (i, rho) in enumerate(parms_rho), (j, F) in enumerate(parms_F)
    k = length(parms_F) * (i-1) + j
    histogram!(plt,
        dgps[i,j].siv_dst,
        normalize  = true,
        subplot    = k,
        bins       = range(0, quantile(dgps[i,j].siv_dst, 0.97), length=51),
        color      = "#6C9BC2",
        fillalpha  = 0.5,
        linecolor  = :white,
        linewidth  = 0.5,
        legend     = false,
        title      = L"\rho = %$(rho),\; F = %$(F)",
        titlefontsize = 10)
end

display(plt)

Reading the Histograms¶

The top-right panel is the benchmark: a compact, right-skewed distribution of standard errors, as expected for a well-behaved IV estimator.

Moving left (weaker $F$) or down (higher $\rho$), the tails grow dramatically. Under weak instruments the standard errors are sometimes enormous — and those enormous values are not even visible because we truncated the histogram at the 97th percentile!

Why does this matter? Occasionally the standard errors can be very small, which would lead the IV t-test to incorrectly flag the estimate as precise and significant. The next section makes this concrete.

Plotting IV Estimates vs Their Standard Errors¶

Keane & Neal had the clever idea to scatter-plot each IV estimate $\hat{\beta}_{\text{IV}}$ against its standard error. This "funnel plot" reveals a structural flaw of the IV estimator under weak instruments.

We first study the best possible scenario: OLS under $\rho=0$ (no endogeneity) with a strong IV. I use the container dgp_zero for this exercise.

In [10]:
plot(dgp_zero.bols_dst, dgp_zero.sols_dst,
    seriestype  = :scatter,
    markersize  = 1.5,
    markeralpha = 0.3,
    markerstrokewidth = 0,
    legend      = false,
    size        = (800, 600),
    title       = "$(length(dgp_zero.bols_dst)) OLS estimates vs their standard errors\n(ρ = 0, popF = 73.75)",
    xlabel      = L"OLS estimate $\hat{\beta}_{\mathrm{OLS}}$",
    ylabel      = "Standard error")
Out[10]:

The cloud of estimates is roughly circular — there is no apparent relationship between the OLS estimate and its standard error.

Let's now add colour to highlight which estimates lead to a rejection of $H_0:\beta=0$ at the 5% level (i.e. $|t_{\text{OLS}}| > 1.96$), and overlay the approximate rejection boundaries $\text{se} = |\hat{\beta}|/1.96 \approx |\hat{\beta}|/2$.

In [11]:
rejected_ols = abs.(dgp_zero.tols_dst) .> 1.96

plt_ols = plot(dgp_zero.bols_dst, dgp_zero.sols_dst,
    seriestype        = :scatter,
    markersize        = 1.5,
    markeralpha       = 0.3,
    markerstrokewidth = 0,
    mc                = "#0072B2",
    legend            = false,
    size              = (800, 600),
    title             = "$(length(dgp_zero.bols_dst)) OLS estimates vs their standard errors\n(ρ = 0, popF = 73.75)",
    xlabel            = L"OLS estimate $\hat{\beta}_{\mathrm{OLS}}$",
    ylabel            = "Standard error")

plot!(plt_ols, dgp_zero.bols_dst[rejected_ols], dgp_zero.sols_dst[rejected_ols],
    seriestype        = :scatter,
    markersize        = 2.5,
    markeralpha       = 0.6,
    markerstrokewidth = 0,
    mc                = "#D55E00")

plot!(plt_ols, [0, 4], [0,  2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
plot!(plt_ols, [0, 4], [0, -2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
Out[11]:

Interpretation¶

These are 10,000 combinations of $\hat{\beta}_{\text{OLS}}$ and its standard error. In all DGPs the true $\beta$ was zero.

  • The OLS estimates are centred near zero with a small spread — exactly as expected.
  • The dashed blue lines mark the approximate rejection boundary $\text{se} = |\hat{\beta}|/2$. Points below the lines have small enough standard errors relative to the estimate to trigger rejection.
  • The orange points are estimates for which the OLS t-test (incorrectly) rejects $H_0:\beta=0$.
  • Key observation: there is no clear association between OLS estimates and their standard errors. The rejected estimates are simply those that happen to be far from zero.

Let's verify this by computing the correlation:

In [12]:
println("Correlation between OLS estimates and their SEs: ",
        round(cor(dgp_zero.bols_dst, dgp_zero.sols_dst), digits=4))
Correlation between OLS estimates and their SEs: 0.0021

A near-zero correlation is exactly what we want: standard errors should be uninformative about the direction or magnitude of the estimate.

Now let's do the same for the IV estimator under different parameter combinations. We start with the best case: low endogeneity ($\rho=0.10$) and a strong instrument ($F=73.75$).

In [13]:
rejected_iv_good = abs.(dgps[1,5].tiv_dst) .> 1.96

plt_iv_good = plot(dgps[1,5].biv_dst, dgps[1,5].siv_dst,
    seriestype        = :scatter,
    markersize        = 1.5,
    markeralpha       = 0.3,
    markerstrokewidth = 0,
    mc                = "#0072B2",
    legend            = false,
    size              = (800, 600),
    title             = "$(length(dgps[1,5].biv_dst)) IV estimates vs their standard errors\n(ρ = 0.10, popF = 73.75)",
    xlabel            = L"IV estimate $\hat{\beta}_{\mathrm{IV}}$",
    ylabel            = "Standard error")

plot!(plt_iv_good, dgps[1,5].biv_dst[rejected_iv_good], dgps[1,5].siv_dst[rejected_iv_good],
    seriestype        = :scatter,
    markersize        = 2.5,
    markeralpha       = 0.6,
    markerstrokewidth = 0,
    mc                = "#D55E00")

plot!(plt_iv_good, [0,4], [0, 2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
plot!(plt_iv_good, [0,4], [0,-2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
Out[13]:

The plot looks similar to the OLS reference case, though everything is spread out much more (wider ranges for both IV estimates and standard errors). Reassuringly, the correlation between estimates and standard errors is still near zero.

Now let's look at the worst case: $\rho = 0.50$ and $F=1.82$ (very weak instrument, moderate endogeneity).

In [14]:
rejected_iv_weak = abs.(dgps[3,1].tiv_dst) .> 1.96

plt_iv_weak = plot(dgps[3,1].biv_dst, dgps[3,1].siv_dst,
    seriestype        = :scatter,
    markersize        = 1.5,
    markeralpha       = 0.3,
    markerstrokewidth = 0,
    mc                = "#0072B2",
    legend            = false,
    size              = (800, 600),
    title             = "$(length(dgps[3,1].biv_dst)) IV estimates vs their standard errors\n(ρ = 0.50, popF = 1.82)",
    xlabel            = L"IV estimate $\hat{\beta}_{\mathrm{IV}}$",
    ylabel            = "Standard error")

plot!(plt_iv_weak, dgps[3,1].biv_dst[rejected_iv_weak], dgps[3,1].siv_dst[rejected_iv_weak],
    seriestype        = :scatter,
    markersize        = 2.5,
    markeralpha       = 0.6,
    markerstrokewidth = 0,
    mc                = "#D55E00")

plot!(plt_iv_weak, [0,4], [0, 2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
plot!(plt_iv_weak, [0,4], [0,-2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
Out[14]:

The extreme outliers make the plot unreadable. Let's trim the axes to the 99th percentile of the standard error and restrict the x-axis to $[-4, 4]$:

In [15]:
biv_w  = dgps[3,1].biv_dst
siv_w  = dgps[3,1].siv_dst
tiv_w  = dgps[3,1].tiv_dst
se_cap = quantile(siv_w, 0.99)

rejected_iv_trim = abs.(tiv_w) .> 1.96

plt_iv_trim = plot(biv_w, siv_w,
    seriestype        = :scatter,
    markersize        = 1.5,
    markeralpha       = 0.3,
    markerstrokewidth = 0,
    mc                = "#0072B2",
    xlims             = (-4, 4),
    ylims             = (0, min(4, se_cap)),
    legend            = false,
    size              = (800, 600),
    title             = "$(length(biv_w)) IV estimates vs their standard errors (outliers removed)\n(ρ = 0.50, popF = 1.82)",
    xlabel            = L"IV estimate $\hat{\beta}_{\mathrm{IV}}$",
    ylabel            = "Standard error")

plot!(plt_iv_trim, biv_w[rejected_iv_trim], siv_w[rejected_iv_trim],
    seriestype        = :scatter,
    markersize        = 2.5,
    markeralpha       = 0.6,
    markerstrokewidth = 0,
    mc                = "#D55E00")

plot!(plt_iv_trim, [0,4], [0, 2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
plot!(plt_iv_trim, [0,4], [0,-2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
vline!(plt_iv_trim, [0.50], lw=2, lc="#D55E00", linestyle=:dot)
Out[15]:

A Disturbing Pattern¶

Even after removing outliers, something is clearly wrong. Unlike the OLS or strong-IV cases, there is now a visible negative association between IV estimates and their standard errors — large positive estimates tend to have small standard errors, while large negative estimates tend to have large standard errors.

The orange dotted vertical line marks the OLS bias at $\rho = 0.50$: the position where IV estimates are "attracted" toward when instruments are weak.

Notice also that the orange (rejected) points cluster on the positive side only. This is power asymmetry: the test rejects for positive estimates far more often than for negative ones of the same absolute magnitude.

Let's quantify the correlation (restricting to the non-outlier region):

In [16]:
keep = (abs.(biv_w) .<= 4) .& (siv_w .<= se_cap)
println("Correlation between IV estimates and their SEs (outliers removed): ",
        round(cor(biv_w[keep], siv_w[keep]), digits=4))
Correlation between IV estimates and their SEs (outliers removed): -0.1934

Why Are Large IV Estimates Associated with Small Standard Errors?¶

Keane & Neal explain this asymmetry on page 190 (their "Problem 2" and "Problem 4"). The argument is worth spelling out.

Recall the DGP (everything is scalar): $$ \begin{align*} Y_i &= \beta X_i + u_i\\ X_i &= \pi Z_i + v_i \end{align*} $$

The IV estimator is $$ \hat{\beta}_{\text{IV}} = \frac{s_{ZY}}{s_{XZ}} = \beta + \frac{s_{Zu}}{s_{XZ}} $$

Using the projection $v_i = \rho u_i + \eta_i$, we can decompose the denominator as: $$ s_{XZ} = \pi s_Z^2 + s_{Zv} = \pi s_Z^2 + \rho s_{Zu} + s_{Z\eta} \approx \sigma_{XZ} + \rho s_{Zu} $$

On page 189, Keane & Neal give a heuristic definition of a strong instrument:

An instrument is strong when $s_{XZ}$ and $\sigma_{XZ}$ have the same sign — that is, when the random component $\rho\, s_{Zu}$ is not large enough to flip the sign of the denominator.

In the strong IV case ($\sigma_{XZ}$ dominates), the sign of $\hat{\beta}_{\text{IV}} = \beta + s_{Zu}/s_{XZ}$ is determined entirely by the sign of $s_{Zu}$. Because $s_{Zu}$ is symmetric around zero, positive and negative IV estimates are equally likely:

In the strong IV case, the median of $\hat{\beta}_{\text{IV}}$ is zero (= the true $\beta$). ✓

In the weak IV case ($\sigma_{XZ}$ is small relative to $\rho s_{Zu}$), something different happens:

  • A large positive $s_{Zu}$ makes $\hat{\beta}_{\text{IV}} > \beta$. This is the usual bias direction.
  • A large negative $s_{Zu}$ (i.e. $|s_{Zu}|$ large) flips the sign of $s_{XZ}$ (since $\sigma_{XZ} + \rho s_{Zu} < 0$ when $\rho > 0$). Then both numerator ($s_{Zu} < 0$) and denominator ($s_{XZ} < 0$) are negative, so $\hat{\beta}_{\text{IV}} > \beta$ again!

In other words, both large positive and large negative realisations of $s_{Zu}$ lead to $\hat{\beta}_{\text{IV}} > \beta$:

In the weak IV case, the median of $\hat{\beta}_{\text{IV}}$ is biased in the direction of $\operatorname{sign}(\rho)$.

The Compounding Effect on Standard Errors¶

Recall from the Week 7 lecture notes: $$ \operatorname{se}(\hat{\beta}_{\text{IV}}) = \frac{s_u \cdot s_Z}{\sqrt{N} \cdot |s_{XZ}|} \approx \frac{s_u \cdot s_Z}{\sqrt{N} \cdot |\sigma_{XZ} + \rho\, s_{Zu}|} $$

Since $\rho > 0$, when $|s_{Zu}|$ is large the denominator $|\sigma_{XZ} + \rho\, s_{Zu}|$ is also large (regardless of sign), and so the standard error is small.

In conclusion, two things happen simultaneously when $|s_{Zu}|$ is large and $\rho > 0$:

  1. The IV estimator is biased toward OLS (positive direction for $\rho > 0$)
  2. Its standard error is spuriously small

This means that the underlying t-test rejects more often — but only in the positive direction. That is precisely the power asymmetry visible in the funnel plot above.

In [17]:
plt_grid = plot(
    layout     = (length(parms_rho), length(parms_F)),
    size       = (1800, 800),
    plot_title = "IV Estimate vs Standard Error for Different DGPs",
    plot_titlefontsize = 13)

for (i, rho) in enumerate(parms_rho)
    for (j, F) in enumerate(parms_F)
        k      = length(parms_F) * (i-1) + j
        b_vec  = dgps[i,j].biv_dst
        s_vec  = dgps[i,j].siv_dst
        t_vec  = dgps[i,j].tiv_dst
        rejected = abs.(t_vec) .> 1.96
        ylim   = min(4, quantile(s_vec, 0.99))

        plot!(plt_grid, b_vec, s_vec,
            seriestype        = :scatter,
            markersize        = 1,
            markeralpha       = 0.2,
            markerstrokewidth = 0,
            mc                = "#0072B2",
            subplot           = k,
            xlims             = (-4, 4),
            ylims             = (0, ylim),
            legend            = false,
            title             = L"\rho = %$(rho),\; F = %$(F)",
            titlefontsize     = 10)

        plot!(plt_grid, b_vec[rejected], s_vec[rejected],
            seriestype        = :scatter,
            markersize        = 1.5,
            markeralpha       = 0.4,
            markerstrokewidth = 0,
            mc                = "#D55E00",
            subplot           = k)

        plot!(plt_grid, [0,4], [0, 2], seriestype=:straightline, subplot=k,
            lc="#0072B2", linestyle=:dash, lw=1)
        plot!(plt_grid, [0,4], [0,-2], seriestype=:straightline, subplot=k,
            lc="#0072B2", linestyle=:dash, lw=1)
        vline!(plt_grid, [rho], subplot=k, lw=1.5, lc="#D55E00", linestyle=:dot)
    end
end

display(plt_grid)

Power Functions¶

The funnel plots above show that weak instruments distort the standard errors in a systematic way. But there is another consequence that is just as important: low power.

Recall: if weak instruments produce spuriously large rejections in one direction, it stands to reason that there are fewer rejections in the other direction. Power is being "borrowed" from one side of the distribution and "spent" on the other. Let's make this precise.

How to compute power:

Fix $\rho$ and $F$. Then:

  1. Set the true $\beta = -1.0$
  2. Generate 10,000 samples, compute 10,000 t-statistics
  3. Record the fraction of samples for which $|t| > 1.96$ — this is the empirical power at $\beta = -1.0$
  4. Repeat for $\beta = -0.9, -0.8, \ldots, 1.0$

The resulting power curve shows how likely the test is to detect a non-zero $\beta$ at each point in the parameter space. When $\beta = 0$ the power equals the size.

In [18]:
# Uses a fine beta grid for the benchmark case (this may take a minute)
brange_fine = -1:0.025:1
brange, pow_ols, pow_iv, pow_ar = power_function(brange=brange_fine, F=73.75, rho=0)

plot(brange, [pow_ols, pow_iv],
    xticks   = -1:0.1:1,
    label    = ["OLS" "IV"],
    linewidth = 2.5,
    linestyle = [:solid :dash],
    lc        = ["#0072B2" "#E69F00"],
    legend    = :bottomright,
    size      = (900, 500),
    title     = L"Empirical Power: OLS vs IV ($\rho = 0$, popF = 73.75)",
    xlabel    = L"True $\beta$",
    ylabel    = "Power")
hline!([0.05], linestyle=:dot, lc=:gray50, lw=1.2, label="5% nominal size")
ylims!(0, 1)
Out[18]:

Reading the Power Curve¶

  • The blue solid line is the OLS power curve. It shows the probability of rejecting $H_0:\beta=0$ for each true value of $\beta$. When $\beta=0$, power = size ≈ 5%.
  • The orange dashed line is the IV power curve. It lies below OLS throughout.
  • The horizontal dotted line marks the 5% nominal size.

For example: when the true $\beta = 0.10$, OLS rejects with probability ≈ 90%, while IV rejects with only ≈ 10% probability. This illustrates the fundamental precision loss from using IV when OLS is valid.

Lesson: Always use OLS when there is no endogeneity. IV sacrifices power, and when $\rho=0$ there is no benefit to justify that cost.

Digression: Effect Size¶

Keane & Neal note that in their DGP, a value of $\beta = 0.20$ constitutes a large effect ("This is a large effect in typical empirical applications"). Why?

In our normalised DGP, a one-standard-deviation increase in $X$ raises $Y$ by $\beta$ standard deviations. A value of $\beta = 0.20$ is a substantial effect by the standards of the economics literature.

Implications at $\beta = 0.20$:

  • OLS power: essentially 100% — you would almost certainly detect such an effect.
  • IV power: only around 40% — you have less than a coin-flip chance of detecting it.

This illustrates why using IV "just to be safe" has a real cost even when the instruments are strong.

Now let's examine how the power curves look under endogeneity ($\rho > 0$) and for the weaker instruments.

In [19]:
# This cell computes 15 power curves (takes a few minutes)
plt_pow = plot(
    layout     = (length(parms_rho), length(parms_F)),
    size       = (1800, 800),
    plot_title = "Empirical Power: OLS (solid) vs IV (dashed)",
    plot_titlefontsize = 13)

for (i, rho) in enumerate(parms_rho)
    for (j, F) in enumerate(parms_F)
        k = length(parms_F) * (i-1) + j
        brange, pow_ols, pow_iv, pow_ar = power_function(F=F, rho=rho)
        emp_size = round(100 * pow_iv[brange .== 0.0][], digits=2)

        plot!(plt_pow, brange, [pow_ols, pow_iv],
            label     = ["OLS" "IV"],
            linestyle = [:solid :dash],
            lc        = ["#0072B2" "#E69F00"],
            legend    = false,
            subplot   = k,
            margin    = 5mm,
            title     = L"\rho = %$(rho),\; F = %$(F)" * "\nSize = $(emp_size)%",
            titlefontsize = 9)

        hline!([0.05], linestyle=:dot, lc=:gray50, lw=1, subplot=k, legend=false)
        ylims!(0, 1)
        xlabel!(L"True $\beta$")
    end
end

display(plt_pow)

Reading the Power Grid¶

The OLS power curve (blue solid):

  • Under endogeneity ($\rho > 0$), the OLS curve shifts to the left. For example, when $\rho = 0.3$, OLS rejects $H_0:\beta=0$ with near-100% probability for any $\beta > 0$, even small ones. This looks like high power but is an artifact of bias: OLS is detecting the endogeneity bias, not the true $\beta$.
  • The cost: OLS has very low power for negative $\beta$, because the bias pushes estimates toward positive values. If the true effect is negative and small, OLS will miss it.

The IV power curve (orange dashed):

  • Even at population $F = 10$ (which maps to sample $F \approx 23$, well above the rule of thumb), IV power is very low.
  • The IV power curve is roughly symmetric around $\beta = 0$ only when $F$ is large. For weak instruments, power asymmetry makes the curve lopsided.
  • The rule of thumb ($F > 2.30$, sample $F > 10$) is clearly insufficient for good power.

The empirical size (power at $\beta=0$) is shown in each panel title. For weak instruments it can exceed 5% considerably, consistent with Week 8 findings.

Bottom line: The rule of thumb that sample $F > 10$ is not sufficient for good power. Even $F = 10$ can produce very low power and distorted tests. Keane & Neal's recommendation is that sample $F$ should be much larger than 10 — and even then, one should report the IV standard error and effect size transparently.

The AR Test¶

Keane & Neal propose a simple remedy to the power asymmetry problem.

Anderson-Rubin (AR) test: Instead of running the IV regression and testing $H_0:\beta=0$ using the IV t-statistic, simply regress $Y$ directly on $Z$ (the reduced form) and test whether the coefficient is zero. The AR t-statistic is just the OLS t-statistic from this reduced-form regression.

Why does this work? Under the null $H_0:\beta=0$, the structural equation becomes $Y_i = u_i$, which is uncorrelated with $Z_i$ (by the exclusion restriction). So the AR statistic is a valid test of the null. Under the alternative, a non-zero $\beta$ means $Z$ does predict $Y$ through $X$, so the test has power.

Crucially, the AR test bypasses the problematic denominator $s_{XZ}$ that causes the standard IV t-test to be unreliable under weak instruments.

In [20]:
# This cell re-uses the power_function results (re-computes to match earlier)
plt_ar = plot(
    layout     = (length(parms_rho), length(parms_F)),
    size       = (1800, 800),
    plot_title = "Empirical Power: AR (solid) vs IV (dashed)",
    plot_titlefontsize = 13)

for (i, rho) in enumerate(parms_rho)
    for (j, F) in enumerate(parms_F)
        k = length(parms_F) * (i-1) + j
        brange, pow_ols, pow_iv, pow_ar = power_function(F=F, rho=rho)
        emp_size = round(100 * pow_ar[brange .== 0.0][], digits=2)

        plot!(plt_ar, brange, [pow_ar, pow_iv],
            label     = ["AR" "IV"],
            linestyle = [:solid :dash],
            lc        = ["#009E73" "#E69F00"],
            legend    = false,
            subplot   = k,
            margin    = 5mm,
            title     = L"\rho = %$(rho),\; F = %$(F)" * "\nSize = $(emp_size)%",
            titlefontsize = 9)

        hline!([0.05], linestyle=:dot, lc=:gray50, lw=1, subplot=k, legend=false)
        ylims!(0, 1)
        xlabel!(L"True $\beta$")
    end
end

display(plt_ar)

Discussion and Takeaways¶

What the AR vs IV Comparison Shows¶

  • The AR power curve (green) is symmetric around $\beta=0$: it is equally likely to reject for positive and negative effects of equal magnitude. This is the correct behaviour for a well-calibrated test.
  • The IV power curve (orange) is asymmetric under weak instruments and endogeneity: it rejects too easily for estimates near the OLS bias and too rarely for estimates in the opposite direction.
  • For strong instruments ($F = 73.75$) the AR and IV curves are nearly identical — as they should be.

Practical Recommendations (Keane & Neal)¶

Recommendation Rationale
Do not use the standard IV t-test for inference when instruments might be weak Power asymmetry leads to misleading rejections in one direction
Use the AR test instead Symmetric power, valid size even under weak instruments
Sample F should be much larger than 10 Even $\widehat{F} = 23$ (popF = 10) produces poor power
Report the first-stage F transparently Readers can judge the severity of the weak-IV problem

How Large Should F Be?¶

Looking at the AR power curves, acceptable power (say ≥ 80% at $\beta = 0.20$) requires a population $F$ near 29 or above, which corresponds to a sample $F$ of about 50. The familiar rule of thumb (sample $F > 10$) is nowhere near sufficient.

Connection to Week 7¶

Recall from Week 7 that the approximate 2SLS bias is $\approx L\rho/F$. This bias formula already hinted that $F$ must be large relative to $L$ and $\rho$. The current analysis makes the same point from the perspective of hypothesis testing: a small $F$ not only biases the estimate but also distorts both size and power in ways that make standard inference unreliable.