Based on Raghabendra Chattopadhyay and Esther Duflo. 2004. Women as Policy Makers: Evidence from a Randomized Policy Experiment in India, Econometrica, 72(5): 1409–43.
All materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).
Let’s continue working with the data from the experiment in India. As a reminder, Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is villages.
Table 1: Variables in “india.csv”
Variable | Description |
---|---|
village | village identifier (“Gram Panchayat number_village number”) |
female | whether village was assigned a female politician: 1=yes, 0=no |
water | number of new (or repaired) drinking water facilities in the village since random assignment |
irrigation | number of new (or repaired) irrigation facilities in the village since random assignment |
In this problem set, we practice (1) how to create and make sense of visualisations and (2) how to compute and interpret the correlation between two variables, among other things.
As always, we will start by loading and looking at the data (don’t forget to set your working directory first!):
village female water irrigation
1 GP1_village2 1 10 0
2 GP1_village1 1 0 5
3 GP2_village2 1 2 2
4 GP2_village1 1 31 4
5 GP3_village2 0 0 0
6 GP3_village1 0 0 0
install.packages("ggplot2") # installing the ggplot2 package, if we have not already
Once the ggplot2 package is loaded, we then run the following to produce our histogram of the variable water:
ggplot(data = india, aes(x = water)) + geom_histogram() #plotting a histogram of variable water
When we create any type of graph, we start with the ggplot() function, which creates a canvas to add plot elements to. It takes two parameters. The first is the dataframe, here we use the dataframe india. The second specifies which columns of the dataframe we are looking to plot and onto which axis, using the aes() function. Here we map the column - or variable - water to the x axis. Simply specifying these two parameters will not draw our graph. To produce a histogram, we then need to add a histogram geometry to our code using the function geom_histogram().
In answer to 1a) no, this variable does not appear to be normally distributed or symmetrical. Instead, it looks postively skewed or right skewed, meaning that most of our data sits in a big ‘clump’ on the left of our x axis, with a long tail of data trailing off to the right.
In answer to 1b) we can see from our histogram that very few of the villages in our experiment had about 250 new (or repaired) drinking water facilities. In fact, it appears that there were only two villages with 250 or more new (or repaired) drinking water facilities in our sample. If you want to verify this number you could sort your data by the variable water, in descending order, and count the number of cases with values of 250 or more on this variable.
As the ggplot2 package has already been loaded in this session, we simply run the following to produce our scatter plot:
ggplot(data = india, aes(x = water, y = irrigation)) + geom_point() #plotting a scatterplot of water and irrigation
This code looks similar to the ggplot code used to produce the histogram above. Again, we first specify our dataframe, using the code data = india. We then specify the columns of the dataframe we are looking to plot and onto which axis, using the aes() function. Here we map the column - or variable - water to the x axis and irrigation onto the y axis. Finally, to produce our scatter plot, we add the point geometry to our code using the function geom_point().
In answer to 2a) we see a positive linear relationship between our two variables. The slope of the line of best fit would be positive here. We can verify this by adding a line of best fit to our scatterplot like so:
ggplot(data = india, aes(x = water, y = irrigation)) + geom_point() + geom_smooth(method=lm, se=FALSE) #plotting a scatterplot of water and irrigation with a line of best fit
The code here is the same as for the previous scatter plot, it just adds the section geom_smooth(method=lm, se=FALSE) which asks R to draw a line of best fit on top of the original graph using the linear regression method and asks that the 95% confidence bounds of this line are not drawn.
In answer to 2b) we can say that the relationship between these two variables does not look particularly strong. We can note that many of our observations are pretty far away from the line of best fit, which helps to explain the lacking strength of this relationship.
cor(india$water, india$irrigation) # computing the correlation of water and irrigation using base r
[1] 0.4073307
The function cor() takes two arguments separated by a comma, this is the code identifying each variable to be used, and takes these in no particular order. You would get the same answer here if you switched the order of the variables water and irrigation. Remember that the $ operator tells R to look for a column/variable within an object/dataframe.
In answer to 3a) no, we should not be surprised to see that the correlation between these variables is positive because, in the scatter plot above, we observe that the line that best summarises the relationship between them has a positive slope.
In answer to 3b) again, we should not be surprised that the absolute value of the correlation coefficient is 0.4 because, in the scatter plot above, we see that the relationship between water and irrigation does not appear to be overly strongly linear.The observations were pretty far away from the line of best fit.
The missing word was representative. If we wanted to use the sample of villages in this dataset to infer the characteristics of all villages in India, we would have to make sure that the sample is representative of the population.
The best way to make the sample of villages representative of all villages in India would have been to select the villages through random sampling.