POL269 - Political Data Research: Solutions Seminar 2

Do Women Promote Different Policies than Men?

Based on Raghabendra Chattopadhyay and Esther Duflo. 2004. Women as Policy Makers: Evidence from a Randomized Policy Experiment in India, Econometrica, 72(5): 1409–43.

All materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).

Let’s continue working with the data from the experiment in India. As a reminder, Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is villages.

Table 1: Variables in “india.csv”

Variable	Description
village	village identifier (“Gram Panchayat number_village number”)
female	whether village was assigned a female politician: 1=yes, 0=no
water	number of new (or repaired) drinking water facilities in the village since random assignment
irrigation	number of new (or repaired) irrigation facilities in the village since random assignment

In this problem set, we will practice (1) how to estimate an average treatment effect using data from a randomized experiment and (2) how to write a conclusion statement.

As always, we will start by loading and looking at the data (don’t forget to set your working directory first!):

india <- read.csv("india.csv") # reads and stores data
head(india) # shows first observations

       village female water irrigation
1 GP1_village2      1    10          0
2 GP1_village1      1     0          5
3 GP2_village2      1     2          2
4 GP2_village1      1    31          4
5 GP3_village2      0     0          0
6 GP3_village1      0     0          0

To produce a frequency table which reports how many of the villages in the sample have female politicians and how many have male politicians using the function count(), we do as so:

install.packages("tidyverse") # installing tidyverse packages, if we have not already

library(tidyverse) # loading tidyverse packages, which include the count() and summarise() functions

india %>% count(female) # calculates the number of villages with each value of female in the dataframe india

  female   n
1      0 214
2      1 108

we can see that 108 of the villages in our sample were assigned female politicians (female == 1) while 214 were not (female == 0).

To produce a table of proportions which reports the percentage of villages in the sample that have female politicians and male politicians, we combine the count() function with the mutate() function, like so:

india %>% count(female) %>% mutate(prop = prop.table(n)) # calculates the proportion of sample villages with each value of female in the dataframe india

  female   n      prop
1      0 214 0.6645963
2      1 108 0.3354037

Proportions can be transformed into percentages by multiplying by 100. Therefore, we can say that 66.46% (0.6645963 * 100, to 2 decimal places) of our sample is made up of villages that have male politicians (female == 0) and 33.54% (0.3354037 * 100, to 2 decimal places) of our sample is made up of villages that have female politicians (female == 1).

You can create a histogram which shows the distribution of the variable irrigation using the ggplot() function, like so:

install.packages("ggplot2") # installing ggplot2 package

library(ggplot2) # loading ggplot package, which will allow us to plot our histogram

ggplot(data = india, aes(x = irrigation)) +
  geom_histogram()

First, we install and load the ggplot2 package which contains the ggplot() function. Then, in the third code chunk we write our code which produces our histogram. When we create any type of graph, we start with the ggplot() function, which creates a canvas to add plot elements to. It takes two parameters. The first is the dataframe, here we use the dataframe india. The second specifies which columns of the dataframe we are looking to plot and onto which axis, using the aes() function. Here we map the column - or variable - irrigation to the x axis. Simply specifying these two parameters will not draw our graph. To produce a histogram, we then need to add a histogram geometry to our code using the function geom_histogram().

Our histogram shows us that the vast majority of the villages in our sample (more than 200 of our 322 villages) have had only a very small number of new (or repaired) irrigation facilities built in the village since random assignment. Only a tiny number of the villages in our sample have had 25 or more new (or repaired) irrigation facilities built in the village since random assignment. It is clear that the variable irrigation does not follow a normal distribution.

We can compute the mean, median and standard deviation of the variable irrigation as follows:

india %>% summarise(mean = mean(irrigation)) # calculates the average (mean) of irrigation

      mean
1 3.263975

india %>% summarise(median = median(irrigation)) # calculates the average (median) of irrigation

  median
1      0

india %>% summarise(sd = sd(irrigation)) # calculates the standard deviation of irrigation

        sd
1 9.492506

To obtain the variance of the variable irrigation we can simply square the standard deviation for this variable which was returned above. We can ask R to run this calculation for us like so:

9.492506^2 # finding the square of 9.492506 (the standard deviation of irrigation) to calculate the variance of this variable

[1] 90.10767

We can interpret these figures as follows:

Mean: On average, each village has had 3 new (or repaired) irrigation facilities built since random assignment.

Median: Recall that the median represents the middle value - or 50th percentile - of a dataset. Exactly half the values in the dataset will be smaller than the median and exactly half will be larger. Calculating the median is important because it gives us an idea of the ‘typical’ value in the dataset, unlike the mean value which can be distorted by extreme values. Here, the median value of 0 indicates that the typical village in our dataset has had 0 new (or repaired) irrigation facilities built since random assignment.

Standard deviation: the standard deviation of irrigation is 9.492506, which is a relatively large value (especially given the comparatively small size of this variables’ median and mean values). This indicates that the range of our irrigation values is large.

If the variable irrigation was normally distributed (which we know it is not, based on our histogram!), about 95% of the villages in our experiment would have irrigation values between approximately 0 and 22. We can calculate the bounds within which 95% of irrigation data would lie if this variable were normally distributed by multiplying this variables’ standard deviation by two and adding and subtracting this quantity from the mean of this variable (calculated previously). We can ask R to make these calculations for us.

3.263975 - (2*9.492506)

[1] -15.72104

3.263975 + (2*9.492506) # finding the bounds within which 95% of irrigation data would lie if this variable were normally distributed

[1] 22.24899

While R finds the lower bound of irrigation values would be -15 under a normal distribution, we take a lower bound of 0. This is because it is not possible for a negative number of new (or repaired) irrigation facilities to have been built in the villages since random assignment, the minimum value possible here is 0.

Variance: the variance of irrigation is 90. Again, this value is large, indicating a good deal of spread in the number of new (or repaired) irrigation facilities built in our sample of villages since random assignment.

Remember: we are usually better off using standard deviations as our measure of spread, as they are easier to interpret because they are in the same unit of measurement as the variable.