Based on Raghabendra Chattopadhyay and Esther Duflo. 2004. Women as Policy Makers: Evidence from a Randomized Policy Experiment in India, Econometrica, 72(5): 1409–43.
All materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).
Let’s continue working with the data from the experiment in India. As a reminder, Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is villages.
Table 1: Variables in “india.csv”
Variable | Description |
---|---|
village | village identifier (“Gram Panchayat number_village number”) |
female | whether village was assigned a female politician: 1=yes, 0=no |
water | number of new (or repaired) drinking water facilities in the village since random assignment |
irrigation | number of new (or repaired) irrigation facilities in the village since random assignment |
In this problem set, we will practice (1) how to estimate an average treatment effect using data from a randomized experiment and (2) how to write a conclusion statement.
As always, we will start by loading and looking at the data (don’t forget to set your working directory first!):
village female water irrigation
1 GP1_village2 1 10 0
2 GP1_village1 1 0 5
3 GP2_village2 1 2 2
4 GP2_village1 1 31 4
5 GP3_village2 0 0 0
6 GP3_village1 0 0 0
install.packages("tidyverse") # installing tidyverse packages, if we have not already
india %>% count(female) # calculates the number of villages with each value of female in the dataframe india
female n
1 0 214
2 1 108
we can see that 108 of the villages in our sample were assigned female politicians (female == 1) while 214 were not (female == 0).
india %>% count(female) %>% mutate(prop = prop.table(n)) # calculates the proportion of sample villages with each value of female in the dataframe india
female n prop
1 0 214 0.6645963
2 1 108 0.3354037
Proportions can be transformed into percentages by multiplying by 100. Therefore, we can say that 66.46% (0.6645963 * 100, to 2 decimal places) of our sample is made up of villages that have male politicians (female == 0) and 33.54% (0.3354037 * 100, to 2 decimal places) of our sample is made up of villages that have female politicians (female == 1).
install.packages("ggplot2") # installing ggplot2 package
ggplot(data = india, aes(x = irrigation)) +
geom_histogram()
First, we install and load the ggplot2 package which contains the ggplot() function. Then, in the third code chunk we write our code which produces our histogram. When we create any type of graph, we start with the ggplot() function, which creates a canvas to add plot elements to. It takes two parameters. The first is the dataframe, here we use the dataframe india. The second specifies which columns of the dataframe we are looking to plot and onto which axis, using the aes() function. Here we map the column - or variable - irrigation to the x axis. Simply specifying these two parameters will not draw our graph. To produce a histogram, we then need to add a histogram geometry to our code using the function geom_histogram().
Our histogram shows us that the vast majority of the villages in our sample (more than 200 of our 322 villages) have had only a very small number of new (or repaired) irrigation facilities built in the village since random assignment. Only a tiny number of the villages in our sample have had 25 or more new (or repaired) irrigation facilities built in the village since random assignment. It is clear that the variable irrigation does not follow a normal distribution.
mean
1 3.263975
median
1 0
sd
1 9.492506
To obtain the variance of the variable irrigation we can simply square the standard deviation for this variable which was returned above. We can ask R to run this calculation for us like so:
9.492506^2 # finding the square of 9.492506 (the standard deviation of irrigation) to calculate the variance of this variable
[1] 90.10767
We can interpret these figures as follows:
Mean: On average, each village has had 3 new (or repaired) irrigation facilities built since random assignment.
Median: Recall that the median represents the middle value - or 50th percentile - of a dataset. Exactly half the values in the dataset will be smaller than the median and exactly half will be larger. Calculating the median is important because it gives us an idea of the ‘typical’ value in the dataset, unlike the mean value which can be distorted by extreme values. Here, the median value of 0 indicates that the typical village in our dataset has had 0 new (or repaired) irrigation facilities built since random assignment.
Standard deviation: the standard deviation of irrigation is 9.492506, which is a relatively large value (especially given the comparatively small size of this variables’ median and mean values). This indicates that the range of our irrigation values is large.
If the variable irrigation was normally distributed (which we know it is not, based on our histogram!), about 95% of the villages in our experiment would have irrigation values between approximately 0 and 22. We can calculate the bounds within which 95% of irrigation data would lie if this variable were normally distributed by multiplying this variables’ standard deviation by two and adding and subtracting this quantity from the mean of this variable (calculated previously). We can ask R to make these calculations for us.
3.263975 - (2*9.492506)
[1] -15.72104
3.263975 + (2*9.492506) # finding the bounds within which 95% of irrigation data would lie if this variable were normally distributed
[1] 22.24899
While R finds the lower bound of irrigation values would be -15 under a normal distribution, we take a lower bound of 0. This is because it is not possible for a negative number of new (or repaired) irrigation facilities to have been built in the villages since random assignment, the minimum value possible here is 0.
Variance: the variance of irrigation is 90. Again, this value is large, indicating a good deal of spread in the number of new (or repaired) irrigation facilities built in our sample of villages since random assignment.
Remember: we are usually better off using standard deviations as our measure of spread, as they are easier to interpret because they are in the same unit of measurement as the variable.