# Week 6 Lab 

# **Estimating the Return to Schooling (continued)**

In week 3 we have replicated Card's (1993) OLS estimation of the return to schooling, using this specification:

* $\qquad \log earn = \beta_1 + \beta_2 educ + \beta_3 exper + \beta_4 expersq + \beta_5 black + \beta_6 south + \beta_7 smsa + \beta_8 smsa66 + \beta_9 reg661 + \cdots + \beta_{16} reg668 + u$ 

Labor economists typically believe that `educ` in such earnings regessions is an **endogenous** variable: it is a choice variable that is likely correlated with other factors that are not controlled for. Consequently, `educ` and the error term will be correlated. The canonical example is that the error term contains some measure of a person's ability, and obviously education and ability would be correlated.

## Excercise 1

Redo the OLS estimation from week 3, but this time implement **heteroskedasticity robust standard errors**. Recall from the lecture that

$$
\begin{align*}
    \sqrt{N} \left( \widehat{\beta}^\text{OLS} - \beta^* \right) 
    &\overset{d}{\to} N (0, \Omega) 
\end{align*}
$$
where $\Omega := E(X_i X_i')^{-1} E(u_i^2 X_i X_i') E(X_i X_i')^{-1}$.

We say that

* $\Omega$ is the asymptotic variance of $\sqrt{N} \left( \widehat{\beta}^\text{OLS} - \beta^* \right)$ 

* $\Omega / N$ is the asymptotic variance of $\widehat{\beta}^\text{OLS}$ 

We take this to mean that $\widehat{\beta}^\text{OLS}$ has an *approximate* normal distribution with mean $\beta^*$ and variance $\Omega / N$.

A consistent estimator for the covariance matrix $\Omega$ is
$$
\begin{align*}
    \widehat{\Omega}
    =   \left( \tfrac{1}{N} \sum_{i=1}^N X_i X_i' \right)^{-1}
        \left( \tfrac{1}{N-K} \sum_{i=1}^N \hat{u}_i^2 X_i X_i' \right)
        \left( \tfrac{1}{N} \sum_{i=1}^N X_i X_i' \right)^{-1}    
\end{align*}
$$

Therefore, the variance of $\widehat{\beta}^\text{OLS}$ is approximately equal to $\widehat{\Omega}/N$.



In [2]:
# read csv-file
using DelimitedFiles
data = readdlm("card.csv", ',');

# loading data
Y = Array{Float64}(data[:, 33])

# now create an n-by-k matrix X by grabbing the correct columns from the data matrix
X = Array{Float64}(data[:,[4, 32, 34, 22, 23, 24, 25, 12, 13, 14, 15, 16, 17, 18, 19]])
X = hcat(ones(length(Y), 1), X) # adding constant to front

n, k = size(X)

# implement heteroskedasticity robust estimation below


(3010, 16)

## The Instrumental Variable

To address the endogeneity issue of `educ` Card proposes the "*presence of a nearby college* [i.e., university]". For every person in the sample, he defines the dummy variable
$$
nearc4 =
\begin{cases}
    1 & \text{ if person lives near a 4-year college} \\0 & \text{ otherwise}
\end{cases}
$$

(How does Card define "nearness"?)

Check page 10 of Card's paper to read how he justifies the validity of his IV. Are you convinced?

## Exercise 2

Define the vectors and matrices $Y, X_1, X_2, X, Z_1, Z_2$, and $Z$. You can find their definition in the week 5 lecture notes.

In [3]:
# define vectors and matrices here









## Exercise 3

Estimate the **reduced form** model $X_{i2} = Z_i'\pi + v_i$, where $\pi = (\pi_1', \pi_2)'$ and $\pi_2$ is the coefficient that belongs to $Z_2$.

Report OLS estimates for $\pi_2$ and their standard errors (under heterskedasticity).

Compare your estimate to Card's table 3.

What does the estimation result tell you about **instrument relevance**? 

In [4]:
# reduced form between X and Z









## Exercise 4

Estimate the **reduced form** model $Y_i = Z_i' \lambda + w_i$, where $\lambda = (\lambda_1', \lambda_2)'$ and $\lambda_2$ is the coefficient that belongs to $Z_2$.

Report OLS estimates for $\lambda_2$ and their standard errors. 

Compare your results to Card's table 3.

In [5]:
# reduced form between Y and Z








## Excercise 5

### IV Estimation

Now estimate $\beta_2$ using three different approaches:

1. Using the formula $\widehat{\beta}^{\text{IV}} = (Z'X)^{-1} Z'Y$

2. Using the two-step procedure:

    * regress $X_i$ on $Z_i$, obtain $\widehat{\pi}$, create $\widehat{X}_i$ (the exogenous version of $X_i$)
    * regress $Y_i$ on $\widehat{X}_i$

3. Using the two-step procedure:

    * regress $X_{i2}$ on $Z_i$, obtain residuals $\widehat{v}_i$
    * regress $Y_i$ on $X_i$ and $\widehat{v}_i$

Can you confirm that you obtain three (almost) identical numerical values for your estimate of $\beta_2$?

(No need to estimate standard errors this week!)

In [6]:
# first approach







# second approach






#third approach







