Based on Raghabendra Chattopadhyay and Esther Duflo. 2004. Women as Policy Makers: Evidence from a Randomized Policy Experiment in India, Econometrica, 72(5): 1409–43.
All materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).
In a few of this module’s problem sets, we will estimate the average causal effect of having a female politician on two different policy outcomes. For this purpose, we will analyse data from an experiment conducted in India, where villages were randomly assigned to have a female council head. The dataset we will use is in a file called “india.csv”. Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is villages.
Table 1: Variables in “india.csv”
Variable | Description |
---|---|
village | village identifier (“Gram Panchayat number_village number”) |
female | whether village was assigned a female politician: 1=yes, 0=no |
water | number of new (or repaired) drinking water facilities in the village since random assignment |
irrigation | number of new (or repaired) irrigation facilities in the village since random assignment |
In this problem set, we will practice how to load and make sense of data.
Use the function read.csv() to read the CSV file “india.csv” and use the assignment operator <- to store the data in an object called india. You must set your working directory before doing so. You can do this by selecting Session >> Set Working Directory >> Choose Directory… from the R Studio menu.
Use the function head() to view the first few observations of the dataset.
What does each observation in this dataset represent?
How can we substantively interpret the first observation in the dataset?
For each variable in the dataset, please identify the type of variable (character vs. numeric binary vs. numeric non-binary).
How many observations are in the dataset? In other words, how many villages were part of this experiment? (Hint: the function dim() might be helpful here.) Please provide both the R code you used and the substantive answer.
library()
and the tidyverse
One of the most time-consuming parts of using R
, and data analysis in general, is managing, tidying and ‘wrangling’ your data. Often, we start with data in a format that is not particularly useful for analysis. We might come across a dataset, for example, with far too many columns, covering lots of variables we are not interested in.
The base functions in R, which are what the textbook uses, might not always be the best option for data manipulation and analysis. Users can create collections of functions, called “packages”. One of those packages is the tidyverse
which is, in turn, a collection of packages for data analysis, manipulation, visualisation, among other useful tools.
The first time we use a package, we need to install it. And then we use the library()
function to load the library. You only need to install the package once, but will need to load it every time you open RStudio:
Let’s continue working with the data from the experiment in India. As a reminder, Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is villages.
Table 1: Variables in “india.csv”
Variable | Description |
---|---|
village | village identifier (“Gram Panchayat number_village number”) |
female | whether village was assigned a female politician: 1=yes, 0=no |
water | number of new (or repaired) drinking water facilities in the village since random assignment |
irrigation | number of new (or repaired) irrigation facilities in the village since random assignment |
In this problem set, we practice how to compute and interpret means, among other things.
We will start by loading and looking at the data, like so:
village female water irrigation
1 GP1_village2 1 10 0
2 GP1_village1 1 0 5
3 GP2_village2 1 2 2
4 GP2_village1 1 31 4
5 GP3_village2 0 0 0
6 GP3_village1 0 0 0
mean()
function or the tidyverse function summarise()
to calculate the average of the variable female. Please provide a full substantive interpretation of what this average means. Make sure to provide the unit of measurement.For example:
Now use the tidyverse function summarise() to calculate the average of the variable water. Again, please provide a full substantive interpretation of what this average means and make sure to provide the unit of measurement.
If we wanted to estimate the average causal effect of having a female politician on the number of new (and repaired) drinking water facilities:
If we wanted to estimate the average causal effect of having a female politician on the number of new (and repaired) irrigation facilities:
In both analyses above: