# EMET2007 Week 8

## Tip

Donâ€™t forget to use standard errors that are robust to heteroskedasticity. This is the last time that I am explicitly reminding you to do so!

<hr>

In this exercise you will investigate, using OLS estimation, the research question:
Do moms who smoke during pregnancy have unhealthier babies?
(What would be your null hypothesis?)

To answer this question, you will use the Birthweight data set which is from Pennsylvania; you can find its description on the course website.

The data set includes, among others, the following variables:


| Variable | Description |
| :-- | :-- |
| `birthweight` |	Birth weight of baby in grams |
| `smoker` | Dummy equal to one if mom smoked during pregnancy, zero else |
| `alcohol` | Dummy equal to one if mom drank alcohol during pregnancy, zero else |
| `nprevist` | Total number of prenatal care visits |
    
Birth weight is thought to be an indicator for healthiness of a baby.

## Imports and loading data

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_csv('drive/MyDrive/EMET2007/datasets/birthweight.csv')

In [4]:
# df = pd.read_csv('../datasets/birthweight.csv')

We will also again use the 'homemade' t-test function from Week 4. Don't forget you need to run the cell below before you can use the function:

In [5]:
def t_test(x1, x2):
    numerator = x1.mean() - x2.mean()  # aka point estimate
    se1 = x1.std() / np.sqrt(len(x1))
    se2 = x2.std() / np.sqrt(len(x2))
    denominator = np.sqrt(se1**2 + se2**2)
    t_stat = numerator / denominator  # our t statistic
    ci_lb = numerator - 1.96 * denominator  # lower bound
    ci_ub = numerator + 1.96 * denominator  # upper bound
    ci = (ci_lb, ci_ub)
    print('Two-sample t-test')
    print(f'Mean in group 1: {x1.mean()}')
    print(f'Mean in group 2: {x2.mean()}')
    print(f'Point estimate for difference in means: {numerator}')
    print(f'Test statistic: {t_stat}')
    print(f'95% confidence interval: {ci}')
    return numerator, t_stat, ci

## Exercise 1
In the sample:
- What is the average value of `birthweight`?
- What is the average value of `birthweight` for mothers who smoke?
- What is the average value of `birthweight` for mothers who do not smoke?

## Exercise 2

Construct a 95% confidence interval for the difference in `birthweight` between moms who smoke and moms who do not smoke.


 
## Exercise 3

### Exercise 3a
Run a simple regression of ``birthweight`` on ``smoker``.


### Exercise 3b
Explain how estimated intercept and slope coefficients relate to the answers you have given to the preceding questions.


### Exercise 3c
Construct a 95% confidence interval for $\beta_1$, that is, the effect of ``smoker`` on ``birthweight``.

## Exercise 4 
Do you think smoking is uncorrelated with other factors that cause low birthweight? How might this bias your estimate of 
$\beta_1$?



## Exercise 5 

### Exercise 5a
Run a multiple regression of ``birthweight`` on ``smoker``, ``alcohol``, and ``nprevist``.


### Exercise 5b

Construct a 95% confidence interval for $\beta_1$, that is, the effect of ``smoker`` on ``birthweight``. Is it substantively different from the regression that excludes ``alcohol`` and ``nprevist``?

### Exercise 5c
How should you interpret the coefficient on ``nprevist``? Does it measure a causal effect of prenatal visits on birth weight? 


## Exercise 6

An alternative way to control for prenatal visits is to use the dummy variables ``tripre0`` through ``tripre3``. Notice that these four dummies are mutually exclusive while also being exhaustive of all possibilities with regards to prenatal visits: 

| Variable | Description |
| :-- | :-- |
| ``tripre0`` | Dummy equal to one if no pre-natal care visits, zero else |
| ``tripre1`` | Dummy equal to one if 1st pre-natal care visit in 1st trimester, zero else |
| ``tripre2`` | Dummy equal to one if 1st pre-natal care visit in 2nd trimester, zero else |
| ``tripre3`` | Dummy equal to one if 1st pre-natal care visit in 3rd trimester, zero else |


Run and compare the following multiple regressions:
* ``birthweight`` on ``smoker``, ``alcohol``, ``tripre0``, ``tripre2``, and ``tripre3`` 
* ``birthweight`` on ``smoker``, ``alcohol``, ``tripre0``, ``tripre1``, and ``tripre3`` 

Report the coefficient estimate for ``tripre0``. What does it capture?
Report the coefficient estimates for ``tripre2`` and ``tripre3``. What do they capture?
Why can we leave one trimester dummy out in either regression without sacrificing any information?


## Attribution
This exercise is based on Empirical Exercises 5.3 and 6.1 of 
Stock and Watson, *Introduction to Econometrics*, 4th global edition