# EMET2007 Week 4

This week, you will run your first OLS regression. The broad research question you will explore is: **Do tall people earn more?** (What would be your null hypothesis?)

You will use the Earnings_and_Height data that were collected in the U.S.; you can find its description on the course website.  Heights are therefore measured in inches, the following little table will be helpful in translating to the metric system:

| Height in inches | Height in centimetres |
|:--|:--|
| 65 | 165 |
| 67 | 170 |
| 70 | 178 |

## Imports and loading data

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf

In [2]:
# COLAB USERS: UNCOMMENT THE BELOW LINES:
# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_csv('drive/MyDrive/EMET2007/datasets/earnings_and_height.csv')

In [1]:
# ANACONDA USERS: UNCOMMENT THE BELOW LINE:
#df = pd.read_csv('../datasets/earnings_and_height.csv')

## Two sample t-test function
We wrote the following convenience function for you to conduct a statistical test for the difference in population means. You will need it below.

Here is a gentle reminder (see section 3.4 in Stock and Watson) how this test is conducted: 

Under the null hypothesis of a zero difference in means,

\begin{align*}
    t &:= \frac{(\overline{Y}_1 - \overline{Y}_0)}{\text{se} (\overline{Y}_1 - \overline{Y}_0)} \overset{approx}{\sim} N(0,1)\\
    \text{se} (\overline{Y}_1 - \overline{Y}_0) &:= \sqrt{s_0^2/n_0 + s_1^2/n_1}
\end{align*}




In [2]:
def t_test(x1, x2):
    numerator = x1.mean() - x2.mean()  # aka point estimate
    se1 = x1.std() / np.sqrt(len(x1))
    se2 = x2.std() / np.sqrt(len(x2))
    denominator = np.sqrt(se1**2 + se2**2)
    t_stat = numerator / denominator  # our t statistic
    ci_lb = numerator - 1.96 * denominator  # lower bound
    ci_ub = numerator + 1.96 * denominator  # upper bound
    ci = (ci_lb, ci_ub)
    print('Two-sample t-test')
    print(f'Mean in group 1: {x1.mean()}')
    print(f'Mean in group 2: {x2.mean()}')
    print(f'Point estimate for difference in means: {numerator}')
    print(f'Test statistic: {t_stat}')
    print(f'95% confidence interval: {ci}')
    return numerator, t_stat, ci

Execute the above function, so that Jupyter can use it later when we need it. When you execute the function, what happens?

## Exercise 1

Now that you have read the csv-file into your work environment, use your helper functions `head/tail` and `describe` to gain an understanding for the number of observations, the included variables, and first descriptive statistics.


## Exercise 2
As you can see, ``height`` is measured in inches. Add a new variable ``heightcm`` to the data frame that captures heigh measured in centimeters.

## Exercise 3
Present useful descriptive analysis for ``heightcm`` and visualize the variable in boxplots and histograms.


## Exercise 4 

Add a new (categorical) variable ``habove`` to your data frame. The definition of ``habove`` is as follows: it equals 1 if ``heightcm`` exceeds the median height, and zero otherwise.

(This makes ``habove`` a so-called *dummy* variable which splits the sample in two.)


## Exercise 5 

Is there an earnings difference between the two height groups (people above the median versus people at the median and below)? Can you distinguish the earnings difference statistically from zero? Make use of our imported ``t_test`` function.

## Exercise 6
Create a scatter plot of earnings versus heights (in cm). What's going on here?


## Exercise 7
    
Run a regression of ``earnings`` on ``heightcm``.

### Exercise 7a
Inspect the standard ``Python`` regression summary output and make sense of the results.

### Exercise 7b
Add your estimated PRF to the scatter plot.

### Exercise 7c
Use your estimated PRF to create `predictions` for earnings of a worker who is 165 cm tall / 170
cm tall / 178 cm tall.
    

## Exercise 8: Important for Your Assignments!

Save your current Jupyter notebook file named **week_4.ipynb**.

Create an **html** file with the name **week_4.html** from your Jupyter notebook file **week_4.ipynb**.

Depending on how you are using Jupyter (Colab or Anaconda), there are different ways of doing so:

### Anaconda users:

Open the **File** menu at the top left of your browser and choose **Download as** and navigate your way through to saving the file as an **html** file.

After saving the **html** file, convince yourself that it worked properly: open the file to see that it contains all your comments, code, and graphs! 

Let your tutor know if you are having any problems!

### Colab users:

Execute the following line of code:

In [None]:
# make sure to change the file path to correspond to your own Google drive setup!
!jupyter nbconvert --to html 'drive/MyDrive/EMET2007/notebooks/week_4.ipynb'

This single line command takes your Jupyter notebook file **week_4.ipynb** and creates the html file **week_4.html**. (The Jupyter notebook file will not be deleted in the process). This command requires you to:

* call the current Jupyter notebook **week_4.ipynb**, no deviations in spelling, cases, and spaces;
* update the file path to the location on your Google drive where you store the file **week_4.ipynb**

After creating the html file in your Google drive, download it to your computer and see that it contains all your comments, code, and graphs!

Let your tutor know if you are having any problems!

## Attribution
This exercise is based on Empirical Exercise 4.2 of Stock and Watson, *Introduction to Econometrics*, 4th global edition