Last week, following Stock & Yogo (2005), we answered the question: how large does the first-stage F-statistic need to be for instruments to be considered "strong"?
Quick recap:
The mapping between population $F$ and sample $F$ (via the noncentral $\chi^2_1$ distribution):
| Population F | Sample F (95th pct) | Worst-case size |
|---|---|---|
| 1.82 | 8.96 | 15% |
| 2.30 | 10.00 | 13.5% |
| 5.78 | 16.38 | 10% |
| 10.00 | 23.10 | 8.6% |
| 29.44 | 50.00 | 6.4% |
| 73.75 | 104.70 | 5.0% |
(Source: Table 1 of Keane & Neal 2024.)
Week 8 focused entirely on size — what happens when $H_0:\beta=0$ is true. This week we extend the analysis:
Keane and Neal point out that the story gets more complicated once we pay attention to the power function.
While it is important to have a test with good size properties, we also want the test to be powerful — able to detect a real effect when one exists.
Recall:
Size is the probability to reject the null when it is true
Power is the probability to reject the null when it is false
We learned that values for $\rho$ of practical relevance fall between zero and 0.50. For our simulations we therefore focus on
| Values for $\rho$ of practical relevance |
|---|
| 0.00 (no endogeneity) |
| 0.10 |
| 0.30 |
| 0.50 |
We continue with the toy model of Keane & Neal (2024) p. 193, the same as last week:
$$ \begin{align*} Y_i &= \beta X_i + u_i\\ X_i &= \pi Z_i + v_i\\ v_i &= \rho\, u_i + \sqrt{1-\rho^2}\, \eta_i \end{align*} $$where
We set $\pi = \sqrt{F/N}$ to control instrument strength via the population $F$-statistic.
The cells below define the simulation machinery for this lecture. The functions dgp_keane_neal, ols_estimator, iv_estimator, and simulate_distribution are identical to those in Week 8. The power_function is new: it computes empirical power across a grid of true $\beta$ values. Feel free to skim them now and refer back as needed.
using Distributions, Random, Statistics
using Plots, LaTeXStrings
using Plots.PlotMeasures: mm
Plots.theme(:wong2)
gr(fmt=:png)
default(
fontfamily = "Computer Modern",
titlefontsize = 12,
guidefontsize = 11,
tickfontsize = 9,
legendfontsize = 9,
left_margin = 12mm,
bottom_margin = 10mm,
gridalpha = 0.15,
framestyle = :box,
lw = 2,
size = (900, 500)
)
function dgp_keane_neal(; b=0, n=1000, F, rho)
"""
Generates one sample of size n following the DGP of Keane & Neal (2024) p. 193.
### Input
- `b` -- structural coefficient β (default 0)
- `n` -- sample size (default 1000)
- `F` -- population first-stage F-statistic
- `rho` -- degree of endogeneity ρ
### Output (named tuple)
- `x`, `y`, `z` -- (n×1) vectors for regressor, outcome, and instrument
"""
π = sqrt(F / n)
u = randn(n)
eta = randn(n)
z = randn(n)
v = rho * u + sqrt(1 - rho^2) * eta
x = π * z .+ v
y = b * x .+ u
return (; x, y, z)
end
dgp_keane_neal (generic function with 1 method)
function ols_estimator(x, y)
"""
OLS estimator for the simple linear model y = βx + u.
### Input
- `x` -- (n×1) regressor vector
- `y` -- (n×1) outcome vector
### Output (named tuple)
- `bhat` -- OLS estimate of β
- `se` -- standard error
- `t` -- t-statistic
"""
bhat = x \ y
uhat = y - x * bhat
s = (uhat' * uhat) / length(y)
se = sqrt(s / (x' * x))
t = bhat / se
return (; bhat, se, t)
end
ols_estimator (generic function with 1 method)
function iv_estimator(x, y, z)
"""
Just-identified IV estimator with one endogenous variable and one instrument.
### Input
- `x` -- (n×1) endogenous regressor
- `y` -- (n×1) outcome vector
- `z` -- (n×1) instrument vector
### Output (named tuple)
- `bhat` -- IV estimate of β
- `se` -- standard error (Keane & Neal 2024, p. 190)
- `t` -- t-statistic
### Notes
The SE uses the first-stage ESS = N·π̂²·Var(z) as the relevance measure.
"""
bhat = (z' * y) / (z' * x) # β̂_IV = (z'y)/(z'x)
n = length(y)
pihat = z \ x # first-stage coefficient π̂
ESS = n * pihat^2 * var(z) # first-stage explained sum of squares
uhat = y - x * bhat
s = (uhat' * uhat) / n
se = sqrt(s / ESS)
t = bhat / se
return (; bhat, se, t)
end
iv_estimator (generic function with 1 method)
function simulate_distribution(; b=0, F, rho, n=1000, rep=10000)
"""
Monte Carlo simulation of OLS and IV estimator distributions.
Creates `rep` independent datasets from dgp_keane_neal and collects
estimates, standard errors, and t-statistics for OLS, IV, and the
Anderson-Rubin (AR) test statistic.
### Input
- `b` -- true structural coefficient β (default 0)
- `F` -- population first-stage F-statistic
- `rho` -- degree of endogeneity ρ
- `n` -- sample size (default 1000)
- `rep` -- number of Monte Carlo replications (default 10,000)
### Output (named tuple of rep-length vectors)
- `bols_dst`, `sols_dst`, `tols_dst` -- OLS estimate, SE, t-statistic
- `biv_dst`, `siv_dst`, `tiv_dst` -- IV estimate, SE, t-statistic
- `ar_dst` -- AR t-statistic (OLS of Y on Z)
"""
bols_dst = Vector{Float64}(undef, rep)
sols_dst = Vector{Float64}(undef, rep)
tols_dst = Vector{Float64}(undef, rep)
biv_dst = Vector{Float64}(undef, rep)
siv_dst = Vector{Float64}(undef, rep)
tiv_dst = Vector{Float64}(undef, rep)
ar_dst = Vector{Float64}(undef, rep)
for i in 1:rep
x, y, z = dgp_keane_neal(b=b, F=F, rho=rho, n=n)
bols_dst[i], sols_dst[i], tols_dst[i] = ols_estimator(x, y)
biv_dst[i], siv_dst[i], tiv_dst[i] = iv_estimator(x, y, z)
ar_dst[i] = ols_estimator(z, y).t # AR: regress Y on Z directly
end
return (; bols_dst, sols_dst, tols_dst, biv_dst, siv_dst, tiv_dst, ar_dst)
end
simulate_distribution (generic function with 1 method)
function power_function(; brange=-1.00:0.10:1.00, F, rho, n=1000, rep=10000)
"""
Computes empirical power of the OLS t-test, IV t-test, and AR test.
Power = Pr(reject H₀: β = 0 | true β).
### Input
- `brange` -- range of true β values (default -1.0:0.1:1.0)
- `F` -- population first-stage F-statistic
- `rho` -- degree of endogeneity ρ
- `n` -- sample size (default 1000)
- `rep` -- Monte Carlo replications per β value (default 10,000)
### Output
- `brange` -- the β grid (passed through)
- `power_ols` -- empirical power of OLS t-test at each β
- `power_iv` -- empirical power of IV t-test at each β
- `power_ar` -- empirical power of AR test at each β
"""
power_ols = similar(brange)
power_iv = similar(brange)
power_ar = similar(brange)
for (i, b) in enumerate(brange)
sim = simulate_distribution(b=b, F=F, rho=rho, n=n, rep=rep)
power_ols[i] = mean(abs.(sim.tols_dst) .> 1.96)
power_iv[i] = mean(abs.(sim.tiv_dst) .> 1.96)
power_ar[i] = mean(abs.(sim.ar_dst) .> 1.96)
end
return brange, power_ols, power_iv, power_ar
end
power_function (generic function with 1 method)
I'm creating two containers that store my simulated distributions:
dgp_zero: 10,000 samples from a DGP in which $\rho=0$ and $F=73.75$ (no endogeneity, strong IV). This serves as the ideal reference case for OLS.
dgps: a $3 \times 5$ array of simulations, one for each combination of $\rho \in \{0.10, 0.30, 0.50\}$ and $F \in \{1.82, 2.30, 10, 29.44, 73.75\}$.
The five $F$ values span the range from the weakest case (population $F = 1.82$, sample $F \approx 9$) to a very strong instrument (population $F = 73.75$, sample $F \approx 105$).
Random.seed!(1234) # set seed for reproducibility
# Reference case: no endogeneity, strong IV
dgp_zero = simulate_distribution(rho=0, F=73.75)
# Grid of DGPs
parms_rho = (0.10, 0.30, 0.50)
parms_F = (1.82, 2.30, 10.00, 29.44, 73.75)
dgps = [simulate_distribution(rho=rho, F=F) for rho in parms_rho, F in parms_F];
Let's start by looking at histograms of IV standard errors under different parameter combinations.
Reading the grid: Each panel corresponds to one $(\rho, F)$ pair. The top-right panel ($\rho=0.10$, $F=73.75$) is our benchmark — strong IV with low endogeneity. We truncate each histogram at the 97th percentile to make the shape visible; heavy-tailed distributions under weak IV would extend much further to the right.
plt = plot(
layout = (length(parms_rho), length(parms_F)),
size = (1800, 700),
plot_title = "Empirical Distribution of IV Standard Errors (truncated at 97th percentile)",
plot_titlefontsize = 13)
for (i, rho) in enumerate(parms_rho), (j, F) in enumerate(parms_F)
k = length(parms_F) * (i-1) + j
histogram!(plt,
dgps[i,j].siv_dst,
normalize = true,
subplot = k,
bins = range(0, quantile(dgps[i,j].siv_dst, 0.97), length=51),
color = "#6C9BC2",
fillalpha = 0.5,
linecolor = :white,
linewidth = 0.5,
legend = false,
title = L"\rho = %$(rho),\; F = %$(F)",
titlefontsize = 10)
end
display(plt)
The top-right panel is the benchmark: a compact, right-skewed distribution of standard errors, as expected for a well-behaved IV estimator.
Moving left (weaker $F$) or down (higher $\rho$), the tails grow dramatically. Under weak instruments the standard errors are sometimes enormous — and those enormous values are not even visible because we truncated the histogram at the 97th percentile!
Why does this matter? Occasionally the standard errors can be very small, which would lead the IV t-test to incorrectly flag the estimate as precise and significant. The next section makes this concrete.
Keane & Neal had the clever idea to scatter-plot each IV estimate $\hat{\beta}_{\text{IV}}$ against its standard error. This "funnel plot" reveals a structural flaw of the IV estimator under weak instruments.
We first study the best possible scenario: OLS under $\rho=0$ (no endogeneity) with a strong IV. I use the container dgp_zero for this exercise.
plot(dgp_zero.bols_dst, dgp_zero.sols_dst,
seriestype = :scatter,
markersize = 1.5,
markeralpha = 0.3,
markerstrokewidth = 0,
legend = false,
size = (800, 600),
title = "$(length(dgp_zero.bols_dst)) OLS estimates vs their standard errors\n(ρ = 0, popF = 73.75)",
xlabel = L"OLS estimate $\hat{\beta}_{\mathrm{OLS}}$",
ylabel = "Standard error")
The cloud of estimates is roughly circular — there is no apparent relationship between the OLS estimate and its standard error.
Let's now add colour to highlight which estimates lead to a rejection of $H_0:\beta=0$ at the 5% level (i.e. $|t_{\text{OLS}}| > 1.96$), and overlay the approximate rejection boundaries $\text{se} = |\hat{\beta}|/1.96 \approx |\hat{\beta}|/2$.
rejected_ols = abs.(dgp_zero.tols_dst) .> 1.96
plt_ols = plot(dgp_zero.bols_dst, dgp_zero.sols_dst,
seriestype = :scatter,
markersize = 1.5,
markeralpha = 0.3,
markerstrokewidth = 0,
mc = "#0072B2",
legend = false,
size = (800, 600),
title = "$(length(dgp_zero.bols_dst)) OLS estimates vs their standard errors\n(ρ = 0, popF = 73.75)",
xlabel = L"OLS estimate $\hat{\beta}_{\mathrm{OLS}}$",
ylabel = "Standard error")
plot!(plt_ols, dgp_zero.bols_dst[rejected_ols], dgp_zero.sols_dst[rejected_ols],
seriestype = :scatter,
markersize = 2.5,
markeralpha = 0.6,
markerstrokewidth = 0,
mc = "#D55E00")
plot!(plt_ols, [0, 4], [0, 2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
plot!(plt_ols, [0, 4], [0, -2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
These are 10,000 combinations of $\hat{\beta}_{\text{OLS}}$ and its standard error. In all DGPs the true $\beta$ was zero.
Let's verify this by computing the correlation:
println("Correlation between OLS estimates and their SEs: ",
round(cor(dgp_zero.bols_dst, dgp_zero.sols_dst), digits=4))
Correlation between OLS estimates and their SEs: 0.0021
A near-zero correlation is exactly what we want: standard errors should be uninformative about the direction or magnitude of the estimate.
Now let's do the same for the IV estimator under different parameter combinations. We start with the best case: low endogeneity ($\rho=0.10$) and a strong instrument ($F=73.75$).
rejected_iv_good = abs.(dgps[1,5].tiv_dst) .> 1.96
plt_iv_good = plot(dgps[1,5].biv_dst, dgps[1,5].siv_dst,
seriestype = :scatter,
markersize = 1.5,
markeralpha = 0.3,
markerstrokewidth = 0,
mc = "#0072B2",
legend = false,
size = (800, 600),
title = "$(length(dgps[1,5].biv_dst)) IV estimates vs their standard errors\n(ρ = 0.10, popF = 73.75)",
xlabel = L"IV estimate $\hat{\beta}_{\mathrm{IV}}$",
ylabel = "Standard error")
plot!(plt_iv_good, dgps[1,5].biv_dst[rejected_iv_good], dgps[1,5].siv_dst[rejected_iv_good],
seriestype = :scatter,
markersize = 2.5,
markeralpha = 0.6,
markerstrokewidth = 0,
mc = "#D55E00")
plot!(plt_iv_good, [0,4], [0, 2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
plot!(plt_iv_good, [0,4], [0,-2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
The plot looks similar to the OLS reference case, though everything is spread out much more (wider ranges for both IV estimates and standard errors). Reassuringly, the correlation between estimates and standard errors is still near zero.
Now let's look at the worst case: $\rho = 0.50$ and $F=1.82$ (very weak instrument, moderate endogeneity).
rejected_iv_weak = abs.(dgps[3,1].tiv_dst) .> 1.96
plt_iv_weak = plot(dgps[3,1].biv_dst, dgps[3,1].siv_dst,
seriestype = :scatter,
markersize = 1.5,
markeralpha = 0.3,
markerstrokewidth = 0,
mc = "#0072B2",
legend = false,
size = (800, 600),
title = "$(length(dgps[3,1].biv_dst)) IV estimates vs their standard errors\n(ρ = 0.50, popF = 1.82)",
xlabel = L"IV estimate $\hat{\beta}_{\mathrm{IV}}$",
ylabel = "Standard error")
plot!(plt_iv_weak, dgps[3,1].biv_dst[rejected_iv_weak], dgps[3,1].siv_dst[rejected_iv_weak],
seriestype = :scatter,
markersize = 2.5,
markeralpha = 0.6,
markerstrokewidth = 0,
mc = "#D55E00")
plot!(plt_iv_weak, [0,4], [0, 2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
plot!(plt_iv_weak, [0,4], [0,-2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
The extreme outliers make the plot unreadable. Let's trim the axes to the 99th percentile of the standard error and restrict the x-axis to $[-4, 4]$:
biv_w = dgps[3,1].biv_dst
siv_w = dgps[3,1].siv_dst
tiv_w = dgps[3,1].tiv_dst
se_cap = quantile(siv_w, 0.99)
rejected_iv_trim = abs.(tiv_w) .> 1.96
plt_iv_trim = plot(biv_w, siv_w,
seriestype = :scatter,
markersize = 1.5,
markeralpha = 0.3,
markerstrokewidth = 0,
mc = "#0072B2",
xlims = (-4, 4),
ylims = (0, min(4, se_cap)),
legend = false,
size = (800, 600),
title = "$(length(biv_w)) IV estimates vs their standard errors (outliers removed)\n(ρ = 0.50, popF = 1.82)",
xlabel = L"IV estimate $\hat{\beta}_{\mathrm{IV}}$",
ylabel = "Standard error")
plot!(plt_iv_trim, biv_w[rejected_iv_trim], siv_w[rejected_iv_trim],
seriestype = :scatter,
markersize = 2.5,
markeralpha = 0.6,
markerstrokewidth = 0,
mc = "#D55E00")
plot!(plt_iv_trim, [0,4], [0, 2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
plot!(plt_iv_trim, [0,4], [0,-2], seriestype=:straightline, lc="#0072B2", linestyle=:dash, lw=1.5)
vline!(plt_iv_trim, [0.50], lw=2, lc="#D55E00", linestyle=:dot)
Even after removing outliers, something is clearly wrong. Unlike the OLS or strong-IV cases, there is now a visible negative association between IV estimates and their standard errors — large positive estimates tend to have small standard errors, while large negative estimates tend to have large standard errors.
The orange dotted vertical line marks the OLS bias at $\rho = 0.50$: the position where IV estimates are "attracted" toward when instruments are weak.
Notice also that the orange (rejected) points cluster on the positive side only. This is power asymmetry: the test rejects for positive estimates far more often than for negative ones of the same absolute magnitude.
Let's quantify the correlation (restricting to the non-outlier region):
keep = (abs.(biv_w) .<= 4) .& (siv_w .<= se_cap)
println("Correlation between IV estimates and their SEs (outliers removed): ",
round(cor(biv_w[keep], siv_w[keep]), digits=4))
Correlation between IV estimates and their SEs (outliers removed): -0.1934
Keane & Neal explain this asymmetry on page 190 (their "Problem 2" and "Problem 4"). The argument is worth spelling out.
Recall the DGP (everything is scalar): $$ \begin{align*} Y_i &= \beta X_i + u_i\\ X_i &= \pi Z_i + v_i \end{align*} $$
The IV estimator is $$ \hat{\beta}_{\text{IV}} = \frac{s_{ZY}}{s_{XZ}} = \beta + \frac{s_{Zu}}{s_{XZ}} $$
Using the projection $v_i = \rho u_i + \eta_i$, we can decompose the denominator as: $$ s_{XZ} = \pi s_Z^2 + s_{Zv} = \pi s_Z^2 + \rho s_{Zu} + s_{Z\eta} \approx \sigma_{XZ} + \rho s_{Zu} $$
On page 189, Keane & Neal give a heuristic definition of a strong instrument:
An instrument is strong when $s_{XZ}$ and $\sigma_{XZ}$ have the same sign — that is, when the random component $\rho\, s_{Zu}$ is not large enough to flip the sign of the denominator.
In the strong IV case ($\sigma_{XZ}$ dominates), the sign of $\hat{\beta}_{\text{IV}} = \beta + s_{Zu}/s_{XZ}$ is determined entirely by the sign of $s_{Zu}$. Because $s_{Zu}$ is symmetric around zero, positive and negative IV estimates are equally likely:
In the strong IV case, the median of $\hat{\beta}_{\text{IV}}$ is zero (= the true $\beta$). ✓
In the weak IV case ($\sigma_{XZ}$ is small relative to $\rho s_{Zu}$), something different happens:
In other words, both large positive and large negative realisations of $s_{Zu}$ lead to $\hat{\beta}_{\text{IV}} > \beta$:
In the weak IV case, the median of $\hat{\beta}_{\text{IV}}$ is biased in the direction of $\operatorname{sign}(\rho)$.
Recall from the Week 7 lecture notes: $$ \operatorname{se}(\hat{\beta}_{\text{IV}}) = \frac{s_u \cdot s_Z}{\sqrt{N} \cdot |s_{XZ}|} \approx \frac{s_u \cdot s_Z}{\sqrt{N} \cdot |\sigma_{XZ} + \rho\, s_{Zu}|} $$
Since $\rho > 0$, when $|s_{Zu}|$ is large the denominator $|\sigma_{XZ} + \rho\, s_{Zu}|$ is also large (regardless of sign), and so the standard error is small.
In conclusion, two things happen simultaneously when $|s_{Zu}|$ is large and $\rho > 0$:
This means that the underlying t-test rejects more often — but only in the positive direction. That is precisely the power asymmetry visible in the funnel plot above.
plt_grid = plot(
layout = (length(parms_rho), length(parms_F)),
size = (1800, 800),
plot_title = "IV Estimate vs Standard Error for Different DGPs",
plot_titlefontsize = 13)
for (i, rho) in enumerate(parms_rho)
for (j, F) in enumerate(parms_F)
k = length(parms_F) * (i-1) + j
b_vec = dgps[i,j].biv_dst
s_vec = dgps[i,j].siv_dst
t_vec = dgps[i,j].tiv_dst
rejected = abs.(t_vec) .> 1.96
ylim = min(4, quantile(s_vec, 0.99))
plot!(plt_grid, b_vec, s_vec,
seriestype = :scatter,
markersize = 1,
markeralpha = 0.2,
markerstrokewidth = 0,
mc = "#0072B2",
subplot = k,
xlims = (-4, 4),
ylims = (0, ylim),
legend = false,
title = L"\rho = %$(rho),\; F = %$(F)",
titlefontsize = 10)
plot!(plt_grid, b_vec[rejected], s_vec[rejected],
seriestype = :scatter,
markersize = 1.5,
markeralpha = 0.4,
markerstrokewidth = 0,
mc = "#D55E00",
subplot = k)
plot!(plt_grid, [0,4], [0, 2], seriestype=:straightline, subplot=k,
lc="#0072B2", linestyle=:dash, lw=1)
plot!(plt_grid, [0,4], [0,-2], seriestype=:straightline, subplot=k,
lc="#0072B2", linestyle=:dash, lw=1)
vline!(plt_grid, [rho], subplot=k, lw=1.5, lc="#D55E00", linestyle=:dot)
end
end
display(plt_grid)
The funnel plots above show that weak instruments distort the standard errors in a systematic way. But there is another consequence that is just as important: low power.
Recall: if weak instruments produce spuriously large rejections in one direction, it stands to reason that there are fewer rejections in the other direction. Power is being "borrowed" from one side of the distribution and "spent" on the other. Let's make this precise.
How to compute power:
Fix $\rho$ and $F$. Then:
The resulting power curve shows how likely the test is to detect a non-zero $\beta$ at each point in the parameter space. When $\beta = 0$ the power equals the size.
# Uses a fine beta grid for the benchmark case (this may take a minute)
brange_fine = -1:0.025:1
brange, pow_ols, pow_iv, pow_ar = power_function(brange=brange_fine, F=73.75, rho=0)
plot(brange, [pow_ols, pow_iv],
xticks = -1:0.1:1,
label = ["OLS" "IV"],
linewidth = 2.5,
linestyle = [:solid :dash],
lc = ["#0072B2" "#E69F00"],
legend = :bottomright,
size = (900, 500),
title = L"Empirical Power: OLS vs IV ($\rho = 0$, popF = 73.75)",
xlabel = L"True $\beta$",
ylabel = "Power")
hline!([0.05], linestyle=:dot, lc=:gray50, lw=1.2, label="5% nominal size")
ylims!(0, 1)
For example: when the true $\beta = 0.10$, OLS rejects with probability ≈ 90%, while IV rejects with only ≈ 10% probability. This illustrates the fundamental precision loss from using IV when OLS is valid.
Lesson: Always use OLS when there is no endogeneity. IV sacrifices power, and when $\rho=0$ there is no benefit to justify that cost.
Keane & Neal note that in their DGP, a value of $\beta = 0.20$ constitutes a large effect ("This is a large effect in typical empirical applications"). Why?
In our normalised DGP, a one-standard-deviation increase in $X$ raises $Y$ by $\beta$ standard deviations. A value of $\beta = 0.20$ is a substantial effect by the standards of the economics literature.
Implications at $\beta = 0.20$:
This illustrates why using IV "just to be safe" has a real cost even when the instruments are strong.
Now let's examine how the power curves look under endogeneity ($\rho > 0$) and for the weaker instruments.
# This cell computes 15 power curves (takes a few minutes)
plt_pow = plot(
layout = (length(parms_rho), length(parms_F)),
size = (1800, 800),
plot_title = "Empirical Power: OLS (solid) vs IV (dashed)",
plot_titlefontsize = 13)
for (i, rho) in enumerate(parms_rho)
for (j, F) in enumerate(parms_F)
k = length(parms_F) * (i-1) + j
brange, pow_ols, pow_iv, pow_ar = power_function(F=F, rho=rho)
emp_size = round(100 * pow_iv[brange .== 0.0][], digits=2)
plot!(plt_pow, brange, [pow_ols, pow_iv],
label = ["OLS" "IV"],
linestyle = [:solid :dash],
lc = ["#0072B2" "#E69F00"],
legend = false,
subplot = k,
margin = 5mm,
title = L"\rho = %$(rho),\; F = %$(F)" * "\nSize = $(emp_size)%",
titlefontsize = 9)
hline!([0.05], linestyle=:dot, lc=:gray50, lw=1, subplot=k, legend=false)
ylims!(0, 1)
xlabel!(L"True $\beta$")
end
end
display(plt_pow)
The OLS power curve (blue solid):
The IV power curve (orange dashed):
The empirical size (power at $\beta=0$) is shown in each panel title. For weak instruments it can exceed 5% considerably, consistent with Week 8 findings.
Bottom line: The rule of thumb that sample $F > 10$ is not sufficient for good power. Even $F = 10$ can produce very low power and distorted tests. Keane & Neal's recommendation is that sample $F$ should be much larger than 10 — and even then, one should report the IV standard error and effect size transparently.
Keane & Neal propose a simple remedy to the power asymmetry problem.
Anderson-Rubin (AR) test: Instead of running the IV regression and testing $H_0:\beta=0$ using the IV t-statistic, simply regress $Y$ directly on $Z$ (the reduced form) and test whether the coefficient is zero. The AR t-statistic is just the OLS t-statistic from this reduced-form regression.
Why does this work? Under the null $H_0:\beta=0$, the structural equation becomes $Y_i = u_i$, which is uncorrelated with $Z_i$ (by the exclusion restriction). So the AR statistic is a valid test of the null. Under the alternative, a non-zero $\beta$ means $Z$ does predict $Y$ through $X$, so the test has power.
Crucially, the AR test bypasses the problematic denominator $s_{XZ}$ that causes the standard IV t-test to be unreliable under weak instruments.
# This cell re-uses the power_function results (re-computes to match earlier)
plt_ar = plot(
layout = (length(parms_rho), length(parms_F)),
size = (1800, 800),
plot_title = "Empirical Power: AR (solid) vs IV (dashed)",
plot_titlefontsize = 13)
for (i, rho) in enumerate(parms_rho)
for (j, F) in enumerate(parms_F)
k = length(parms_F) * (i-1) + j
brange, pow_ols, pow_iv, pow_ar = power_function(F=F, rho=rho)
emp_size = round(100 * pow_ar[brange .== 0.0][], digits=2)
plot!(plt_ar, brange, [pow_ar, pow_iv],
label = ["AR" "IV"],
linestyle = [:solid :dash],
lc = ["#009E73" "#E69F00"],
legend = false,
subplot = k,
margin = 5mm,
title = L"\rho = %$(rho),\; F = %$(F)" * "\nSize = $(emp_size)%",
titlefontsize = 9)
hline!([0.05], linestyle=:dot, lc=:gray50, lw=1, subplot=k, legend=false)
ylims!(0, 1)
xlabel!(L"True $\beta$")
end
end
display(plt_ar)
| Recommendation | Rationale |
|---|---|
| Do not use the standard IV t-test for inference when instruments might be weak | Power asymmetry leads to misleading rejections in one direction |
| Use the AR test instead | Symmetric power, valid size even under weak instruments |
| Sample F should be much larger than 10 | Even $\widehat{F} = 23$ (popF = 10) produces poor power |
| Report the first-stage F transparently | Readers can judge the severity of the weak-IV problem |
Looking at the AR power curves, acceptable power (say ≥ 80% at $\beta = 0.20$) requires a population $F$ near 29 or above, which corresponds to a sample $F$ of about 50. The familiar rule of thumb (sample $F > 10$) is nowhere near sufficient.
Recall from Week 7 that the approximate 2SLS bias is $\approx L\rho/F$. This bias formula already hinted that $F$ must be large relative to $L$ and $\rho$. The current analysis makes the same point from the perspective of hypothesis testing: a small $F$ not only biases the estimate but also distorts both size and power in ways that make standard inference unreliable.