# EMET2007 Week 2

In this tutorial you will become familiar with `Python`, and `Jupyter Notebooks` (.ipynb).  You will get to know the two main cell types in Jupyter Notebooks and the keyboard shortcuts that allow you to manipulate them easily.  We will then do some basic exploratory econometric analysis.
    
## Important note
You need to have the datasets available before you run any commands. If you haven't already, please refer to the [instructions](https://juergenmeinecke.github.io/EMET2007/jupyter.html) on the course website for setting this up.

Once you have followed these instructions, you can refer to the corresponding code cell below to load the data into this notebook.

## Tutorial information
The source data for this week's tutorial comes from the World Bank which keeps a database called *World Development Indicators*. The WDI database is updated each year in April and collects about 800 different economic variables for more than 150 countries. For example, the World Bank collects basic macroeconomic data on GDP and growth rates, as well as data such as land areas, population size and also data on the environment, corruption, political institutions, poverty and many more.

I have pulled a few variables from the following website:
    
https://databank.worldbank.org/source/world-development-indicators.

Note: To make it slightly easier for us syntactically, I have changed the variable names to use `camelCase`.

Here is a short table describing the included variables:

| Variable | Description | 
|:--|:--|
| ``countryName`` |      Name of country |
| ``countryCode`` |       Abbreviation for country name    |
| ``incomeGroup`` |       One of four categories (high, upper-middle, lower-middle, low) |
| ``region`` |              One of seven geographical regions |
| ``GDP`` |                Gross domestic product in 2019, (constant 2010 USD) |
| ``population`` |          Total number of residents (regardless of legal status or citizenship) |
| ``landArea`` |          Square kilometers |
| ``lifeExpecFem`` |      Life expectancy at birth, female |
| ``lifeExpecMale`` |     Life expectancy at birth, male |   
 
## Imports and loading data

Before we start working with our data, there are a few things we need to do:
1. Import the `pandas` package of `Python`. That's what we're doing in the cell below. `pandas` is a package that is widely used for data analysis, and `pandas` plays nicely with many other `Python` packages used for scientific computing.  For more information on `pandas`, see [here](https://pandas.pydata.org/). 

Note: the `as pd` part of the below line is a generally accepted convention.  It allows us to access `pandas` functions by typing `pd` instead of `pandas` every time.

In [2]:
import pandas as pd


2. Tell `pandas` to load the data file using the `read_csv()` function of `pandas`.  We have to specify where `pandas` can find the data file; then, the read_csv command converts the csv-file into a so-called Pandas data frame. We then give this data frame the name `df`.


**Google Colab** users need to:
- uncomment the code in the below cell (by removing the `# ` from the front of every line)
- run the cell below

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_csv('drive/MyDrive/EMET2007/datasets/world_bank_wdi.csv')

**Anaconda/Jupyter Notebook** users need to:
- uncomment the code in the below cell (by removing the `# ` from the front of every line)
- run the cell below

In [4]:
# df = pd.read_csv('../datasets/world_bank_wdi.csv')

## Notebook shortcuts

You tutor will now explain to you how you can execute pieces of a script with convenient keyboard shortcuts:

### Google Colab
__Normal mode__    
* `Enter` switches to Edit mode
* `Ctrl`+`Enter` runs the current cell
* `b` adds a new code cell below the current one
* `a` adds a new code cell above the current one
* `Ctrl+M D` deletes a cell
* `Ctrl+H` opens a find/replace dialog for the entire notebook
    
__Edit mode__    
* `Ctrl`+`Enter` runs the current cell and switches out of Edit mode (only works on code cells)
* `Esc` switches out of Edit mode without running

__Other functionality (found in dropdown menus)__
* Inserting text cells
* Copying/cutting cells


### Anaconda/Jupyter Notebook    
__Command mode__ (cells have blue edge)
* `Enter` switches to Edit mode
* `Shift`+`arrows` expands the cell selection
* `Ctrl`+`Enter` runs the current cell(s)
* `b` adds a new code cell below the current one
* `a` adds a new code cell above the current one
* `m` converts a code cell to a text (Markdown) cell
* `y` converts a text (Markdown) cell to a code cell
* `dd` deletes the current cell(s)
* `z` undoes a cell deletion
* `x` cuts the current cell(s)
* `c` copies the current cell(s)
* `v` pastes the copied/cut cell(s)
* `f` opens a find-and-replace dialog for the current cell(s)

    
__Edit mode__ (cells have green edge)  
* `Ctrl`+`Enter` runs the current cell and switches to Command mode
* `Esc` switches to Command mode without running
   

## Exercise 1

Get familiar with these shortcuts!  Create a code cell with the comment 'here is my code', and a text cell with the text 'This is my text answer'

## Exercise 2
By now, you should also have opened the csv file **world_bank_wdi.csv** in your notebook by running the corresponding cell in your notebook up above. 

Type `df` in a code cell and run it. What happens? 

## Exercise 3 

Look at excerpts of the data using ``df.head()`` and ``df.tail()``. This gives you a bit of a feeling for the variables in the data.

## Exercise 4
Let's look at some summary statistics of variables in this data set.

### Exercise 4a
Get your first summary statistics by using ``describe()`` on individual variables. How do you access individual variables? 

### Exercise 4b
Accessing individual variables can be a nuisance (lots of typing).  How can we summarise all of the (numeric) variables in the dataset at once?

### Exercise 4c
What if we have a dataset with lots of numeric variables and we are only interested in a few of them?  How can we summarise multiple variables of our choosing?  Practice this by printing a summary of population, GDP, and the life expectancies for males and females.

## Exercise 5
    
Play around with ``mean``, ``median``, ``std``, and ``var`` on individual variables. Notice how Pandas automatically disregards missing values.

## Exercise 6
Sometimes it can be important to know how many missing values there are for a variable.  How do we find them?