POL269 - Political Data Research: Solutions Seminar 4

Do Women Promote Different Policies than Men?

Based on Raghabendra Chattopadhyay and Esther Duflo. 2004. Women as Policy Makers: Evidence from a Randomized Policy Experiment in India, Econometrica, 72(5): 1409–43.

All materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).

Let’s continue working with the data from the experiment in India. As a reminder, Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is villages.

Table 1: Variables in “india.csv”

Variable	Description
village	village identifier (“Gram Panchayat number_village number”)
female	whether village was assigned a female politician: 1=yes, 0=no
water	number of new (or repaired) drinking water facilities in the village since random assignment
irrigation	number of new (or repaired) irrigation facilities in the village since random assignment

In this problem set, we practice (1) how to create and make sense of visualisations and (2) how to compute and interpret the correlation between two variables, among other things.

As always, we will start by loading and looking at the data (don’t forget to set your working directory first!):

india <- read.csv("india.csv") # reads and stores data
head(india) # shows first observations

       village female water irrigation
1 GP1_village2      1    10          0
2 GP1_village1      1     0          5
3 GP2_village2      1     2          2
4 GP2_village1      1    31          4
5 GP3_village2      0     0          0
6 GP3_village1      0     0          0

As water is a numeric non-binary variable, we will use a histogram to visualise the distribution of this variable using the ggplot2 package (you may remember us doing the same for the variable irrigation in seminar 2). Before we attempt to make graphs using the ggplot2 package, we must install this package if we have not already, and load this package if it is already installed. Remember you can use your packages tab in R Studio to see if the ggplot2 package is currently loaded on your machine.

install.packages("ggplot2") # installing the ggplot2 package, if we have not already

library(ggplot2) # loading the ggplot2 package

Once the ggplot2 package is loaded, we then run the following to produce our histogram of the variable water:

ggplot(data = india, aes(x = water)) + geom_histogram() #plotting a histogram of variable water

When we create any type of graph, we start with the ggplot() function, which creates a canvas to add plot elements to. It takes two parameters. The first is the dataframe, here we use the dataframe india. The second specifies which columns of the dataframe we are looking to plot and onto which axis, using the aes() function. Here we map the column - or variable - water to the x axis. Simply specifying these two parameters will not draw our graph. To produce a histogram, we then need to add a histogram geometry to our code using the function geom_histogram().

In answer to 1a) no, this variable does not appear to be normally distributed or symmetrical. Instead, it looks postively skewed or right skewed, meaning that most of our data sits in a big ‘clump’ on the left of our x axis, with a long tail of data trailing off to the right.

In answer to 1b) we can see from our histogram that very few of the villages in our experiment had about 250 new (or repaired) drinking water facilities. In fact, it appears that there were only two villages with 250 or more new (or repaired) drinking water facilities in our sample. If you want to verify this number you could sort your data by the variable water, in descending order, and count the number of cases with values of 250 or more on this variable.

As the variables water and irrigation are both numeric non-binary variables, we will use a scatter plot to show the relationship of these two variables to one another. We do this using the ggplot2 package again.

As the ggplot2 package has already been loaded in this session, we simply run the following to produce our scatter plot:

ggplot(data = india, aes(x = water, y = irrigation)) + geom_point() #plotting a scatterplot of water and irrigation

This code looks similar to the ggplot code used to produce the histogram above. Again, we first specify our dataframe, using the code data = india. We then specify the columns of the dataframe we are looking to plot and onto which axis, using the aes() function. Here we map the column - or variable - water to the x axis and irrigation onto the y axis. Finally, to produce our scatter plot, we add the point geometry to our code using the function geom_point().

In answer to 2a) we see a positive linear relationship between our two variables. The slope of the line of best fit would be positive here. We can verify this by adding a line of best fit to our scatterplot like so:

ggplot(data = india, aes(x = water, y = irrigation)) + geom_point() + geom_smooth(method=lm, se=FALSE) #plotting a scatterplot of water and irrigation with a line of best fit

The code here is the same as for the previous scatter plot, it just adds the section geom_smooth(method=lm, se=FALSE) which asks R to draw a line of best fit on top of the original graph using the linear regression method and asks that the 95% confidence bounds of this line are not drawn.

In answer to 2b) we can say that the relationship between these two variables does not look particularly strong. We can note that many of our observations are pretty far away from the line of best fit, which helps to explain the lacking strength of this relationship.

We use the base R function cor() to compute the correlation of water and irrigation like so:

cor(india$water, india$irrigation) # computing the correlation of water and irrigation using base r

[1] 0.4073307

The function cor() takes two arguments separated by a comma, this is the code identifying each variable to be used, and takes these in no particular order. You would get the same answer here if you switched the order of the variables water and irrigation. Remember that the $ operator tells R to look for a column/variable within an object/dataframe.

In answer to 3a) no, we should not be surprised to see that the correlation between these variables is positive because, in the scatter plot above, we observe that the line that best summarises the relationship between them has a positive slope.

In answer to 3b) again, we should not be surprised that the absolute value of the correlation coefficient is 0.4 because, in the scatter plot above, we see that the relationship between water and irrigation does not appear to be overly strongly linear.The observations were pretty far away from the line of best fit.

The missing word was representative. If we wanted to use the sample of villages in this dataset to infer the characteristics of all villages in India, we would have to make sure that the sample is representative of the population.
The best way to make the sample of villages representative of all villages in India would have been to select the villages through random sampling.