# EMET2007 Week 9
    
This exercise uses the Lead_Mortality data; you can find its description on the website.
    
In this exercise you will investigate the research question: **Does lead in the water supply increase infant mortality rates?** Lead is toxic and should not be consumed. Nevertheless, it was common practice in the early 20th century to have water supply pipes that were made from lead. As a consequence, people may have been harmed. You will explore this possibility in the current
exercise.

The main variables of concern are:

| Variable | Description |
|:--|:--|
| `infRate` | Infant mortality rate (deaths per 100 in population) |
| `lead` | Dummy equal to one if city had lead pipes, zero else |
| `ph` | pH level of water in city |

Econometrically, this exercise uses interaction terms between a continuous and a binary variable to elicit the effect of lead on mortality. 

## Imports and loading data

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf

In [4]:
# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_csv('drive/MyDrive/EMET2007/datasets/lead_mortality.csv')

In [5]:
# df = pd.read_csv('../datasets/lead_mortality.csv')

In [None]:
def t_test(x1, x2):
    numerator = x1.mean() - x2.mean()  # aka point estimate
    se1 = x1.std() / np.sqrt(len(x1))
    se2 = x2.std() / np.sqrt(len(x2))
    denominator = np.sqrt(se1**2 + se2**2)
    t_stat = numerator / denominator  # our t statistic
    ci_lb = numerator - 1.96 * denominator  # lower bound
    ci_ub = numerator + 1.96 * denominator  # upper bound
    ci = (ci_lb, ci_ub)
    print('Two-sample t-test')
    print(f'Mean in group 1: {x1.mean()}')
    print(f'Mean in group 2: {x2.mean()}')
    print(f'Point estimate for difference in means: {numerator}')
    print(f'Test statistic: {t_stat}')
    print(f'95% confidence interval: {ci}')
    return numerator, t_stat, ci

## Exercise 1

In the sample:
* What is the average value of `lead`?
* What is the average value of `infRate`?
* What is the average value of `infRate` for cities with lead pipes?
* What is the average value of `infRate` for cities with non-lead pipes?

## Exercise 2

Construct a 95% confidence interval for the difference in `infRate` between cities with lead pipes and cities with non-lead pipes.

## Exercise 3


The amount of lead leached from lead pipes depends on the chemistry of the water running through the
pipes. The more acidic the water is (lower pH) the more lead is leached. It seems natural to study
if cities with lead pipes **and** low pH levels have particularly high infant mortality rate. The
way to do this in a regression analysis is by using interaction terms.

Plot a histogram of ``ph`` to get a sense of the distribution of water acidity.


## Exercise 4 

### Exercise 4a
Run the following alternative regressions:

* specification 1: ``infRate`` on ``lead``
* specification 2: ``infRate`` on ``lead`` and ``ph``
* specification 3: ``infRate`` on ``lead``, ``ph``, and the interaction term ``lead`` x ``ph``.



### Exercise 4b
Discuss the differences between each specification. Carefully discuss all coefficients in specification 3!

## Exercise 5 

Using specification 3, does ``lead`` have a statistically significant effect on ``infRate``?


## Exercise 6

Does the effect of ``lead`` on ``infRate`` depend on ``ph``?


## Exercise 7
    
### Exercise 7a
Write down the
* estimated PRF for a city with ``lead=1``
* estimated PRF for a city with ``lead=0``
* difference between both estimated PRFs



### Exercise 7b
All three are functions in ``ph``. Produce a scatter plot of ``infRate`` versus ``ph`` and add the three functions to the graph.  Use this graph to study how the effect of lead pipes changes as the pH level changes. 
    

### Attribution: 
This exercise is based on Empirical Exercise 8.1 of
Stock and Watson, *Introduction to Econometrics*, 4th global edition