POL269 - Political Data Research: Solutions Seminar 12

Do Women Promote Different Policies than Men?

Based on Raghabendra Chattopadhyay and Esther Duflo. 2004. Women as Policy Makers: Evidence from a Randomized Policy Experiment in India, Econometrica, 72(5): 1409–43.

All materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).

Let’s go back to the data from the experiment in India, which we used in the first few seminars on this module. As a reminder, Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is villages.

Table 1: Variables in “india.csv”

Variable	Description
village	village identifier (“Gram Panchayat number_village number”)
female	whether village was assigned a female politician: 1=yes, 0=no
water	number of new (or repaired) drinking water facilities in the village since random assignment
irrigation	number of new (or repaired) irrigation facilities in the village since random assignment

In this problem set, we’ll be recapping some of the key methods, analyses and skills we’ve learned throughout this course. First, we’ll practice how to estimate an average treatment effect using data from a randomised experiment and then we’ll consider how we determine whether the estimated average treatment effect is statistically significant at the 5% level, or not.

As always, we’ll start by loading and looking at the data (remembering to set our working directory first!):

india <- read.csv("india.csv") # reads and stores data as object called india
head(india) # looking at first few rows of dataset

# A tibble: 6 × 4
  village      female water irrigation
  <chr>         <dbl> <dbl>      <dbl>
1 GP1_village2      1    10          0
2 GP1_village1      1     0          5
3 GP2_village2      1     2          2
4 GP2_village1      1    31          4
5 GP3_village2      0     0          0
6 GP3_village1      0     0          0

If we wanted to estimate the average casual effect of having a female politician on the number of new (or repaired) drinking water facilities…

Our Y variable, or outcome variable, would be water. We are looking to explore how the number of new (or repaired) drinking water facilities in these villages has changed, on average, since the random assignment of female politicians took place.
Our X, or treatment variable, would be female. We want to measure whether outcomes are different in villages led by female politicians than in ones led by males.
We might consider irrigation to be a confounding variable, an additional variable which could be included as an X, or predictor, variable in a regression model. In a context where villages only have limited finances, building new irrigration facilities or repairing existing ones may prevent the building of new water facilities or repairing of existing ones, and male and female leaders may have preferences for building or repairing different kinds of facilities. So, we might expect that the association of the gender of village leaders with the number of new or repaired water facilities built in those villages might ‘look different’ after we control for the number of new or repaired irrigation facilities built in those same villages.

What is the estimated average casual effect of having a female politician on the number of new (or repaired) drinking water facilities?

Fit a linear regression model to the data in such a way that the estimated slope coefficient is equivalent to the difference-in-means estimator you are interested in and store the fitted model in an object called fit.

We do this like so:

fit <- lm(water ~ female, data = india) #running regression and storing as an object called fit, using separate data argument
fit

## 
## Call:
## lm(formula = water ~ female, data = india)
## 
## Coefficients:
## (Intercept)       female  
##      14.738        9.252


fit <- lm(india$water ~ india$female) #running regression and storing as an object called fit, using integrated data argument
fit

## 
## Call:
## lm(formula = india$water ~ india$female)
## 
## Coefficients:
##  (Intercept)  india$female  
##       14.738         9.252

What is the estimated slope coefficient returned by this model?

The estimated slope coefficient is 9.25. We’ll move on to interpret this in the next part of this question.

Now, let’s answer the question: What is the estimated average treatment effect? Provide a full substantive answer, making sure to include the assumption you make when using the difference-in-means estimator, why the assumption is or is not reasonable in this case and what we can guage from the the direction, size, and unit of measurement of this average treatment effect.

What’s the assumption? We assume that the villages that were assigned to have a female politician (the treatment group) are comparable to the villages that were assigned to NOT have a female politician (the control group). If this assumption were not true then the difference-in-means estimator would NOT produce a valid estimate of the average treatment effect.

Is the assumption reasonable? Because the female politicians were assigned at random OR because the data come from a randomized experiment we can assume that this assumption is reasonable. Remember: random treatment assignment makes the treatment and control groups identical to each other in all observed and unobserved pre-treatment characteristic, on average.

What is the direction, size, and unit of measurement of the average causal effect? There is an increase of 9 facilities, on average. It is an increase because we are measuring change — the change in the outcome variable caused by the treatment — and our slope coefficient is positive. Because X is the treatment variable and Y is the outcome variable, and there are no other variables included in our regression model, this slope coefficient is equal to the difference-in-means estimator.

In this case, the difference-in-means estimator = average number of facilities in villages with female politician - average number of facilities in villages with male politician = 9.25 facilities (but we round to 9 because it is not possible to have a quarter of a facility).

Final Answer: Assuming that the villages that were randomly assigned to have a female politician were comparable to the villages that were randomly assigned to NOT have a female politician (a reasonable assumption since the female politicians were assigned at random), we estimate that having a female politician increases the number of new or repaired drinking water facilities by 9 facilities, on average.

Fit a linear regression model to the data which includes a confounding variable in addition to the key Y and X variables. Comment on whether, and how, the association of Y and X changes after controlling for the effects of this confounding variable. Store this new fitted model in an object called fit_controls.

We do this like so:

fit_controls <- lm(water ~ female + irrigation, data = india) #running regression with controls for confounding and storing as an object called fit_controls, using separate data argument
fit_controls

## 
## Call:
## lm(formula = water ~ female + irrigation, data = india)
## 
## Coefficients:
## (Intercept)       female   irrigation  
##       9.812        9.789        1.454


fit_controls <- lm(india$water ~ india$female + india$irrigation) #running regression with controls for confounding and storing as an object called fit_controls, using integrated data argument
fit_controls

## 
## Call:
## lm(formula = india$water ~ india$female + india$irrigation)
## 
## Coefficients:
##      (Intercept)      india$female  india$irrigation  
##            9.812             9.789             1.454

Remember, we add confounding variables into our regression model by including these on the right hand side of the regression equation in R, using a + sign after our key X variable, or treatment variable.

We can see that after accounting (or controlling for) the effects of the variable irrigation, the association between water and female remains largely unchanged. The slope coefficient of female increases only slightly to 9.79 in the with controls model, compared to 9.25 in the without controls model.

Is the effect of having a female politician on the number of new (or repaired) drinking water facilities statistically significant at the 5% level?

Let’s start by specifying both the null and alternative hypotheses, providing both the mathematical notations and their meaning.

The null and alternative hypotheses are:

\(H_0 {:} \,\, \beta{=}0\) (meaning that having a female politician has no average causal effect on the number of new or repaired drinking water facilities at the population level).

\(H_1 {:} \,\, \beta{\neq}0\) (meaning that having a female politician either increases or decreases the number of new or repaired drinking water facilities, on average, at the population level)

Note that the null and alternative hypotheses refer to \(\beta\), which is the true average causal effect at the population level, not to \(\widehat{\beta}\), which is the estimated average causal effect at the sample level.

What is the value of the observed test statistic in the with- (fit_controls) and without- (fit) controls regression models we have estimated?

We can access this information like so:

summary(fit) # returning coefficients for the without controls model, saved as object fit

## 
## Call:
## lm(formula = india$water ~ india$female)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.991 -14.738  -7.865   2.262 316.009 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    14.738      2.286   6.446 4.22e-10 ***
## india$female    9.252      3.948   2.344   0.0197 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.45 on 320 degrees of freedom
## Multiple R-squared:  0.01688,    Adjusted R-squared:  0.0138 
## F-statistic: 5.493 on 1 and 320 DF,  p-value: 0.0197


summary(fit_controls) # returning coefficients for the with controls model, saved as object fit_controls

## 
## Call:
## lm(formula = india$water ~ india$female + india$irrigation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.686 -11.266  -5.812   4.188 289.861 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.8118     2.1719   4.518 8.82e-06 ***
## india$female       9.7895     3.6011   2.719  0.00692 ** 
## india$irrigation   1.4542     0.1794   8.106 1.13e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.5 on 319 degrees of freedom
## Multiple R-squared:  0.1848, Adjusted R-squared:  0.1797 
## F-statistic: 36.16 on 2 and 319 DF,  p-value: 7.019e-15

In the without controls model (fit), the test statistic for female is 2.34.

In the with controls model (fit_controls), the test statistic for female is 2.72.

What is the associated p-value in both models?

In the without controls model (fit), the p value for female is 1.97e-02, which equals 0.0197 (1.97 shifted two decimal places to the right). We can interpret this as indicating that, if the null hypothesis were true, the probability of observing a test statistic equal to or larger than 2.34 (in absolute value) is about 1.97%. This is a small probability, well below 5%, so we will reject the null hypothesis.

In the with controls model (fit_controls), the p value for female is even smaller, at 6.92e-03, which equals 0.00692. We can interpret this as indicating that, if the null hypothesis were true, the probability of observing a test statistic equal to or larger than 2.72 (in absolute value) is less than 1%. Again here we will reject the null hypothesis.

This shows that accounting for the confounding variable irrigation makes little effect here. We draw the same substantive conclusions about the effect of female on water whether or not we control for this confounding variable.

Now, let’s answer the question: Is the effect statistically significant at the 5% level? Please provide your reasoning, and refer to the results of both models estimated in this session.

Yes, the effect is statistically significant at the 5% level. Because (a) the absolute value of the observed test statistic is higher than 1.96 in both models (|2.34 and 2.72| > 1.96), and/or (b) the p-value is lower than 0.05 (0.0197 and 0.00692 < 0.05), we reject the null hypothesis and conclude that there is likely to be an average treatment effect different than zero at the population level.

In other words, we conclude that having a female politician is likely to have an average effect different than zero on the number of new (or repaired) drinking water facilities at the population level.

Note: you do not need to provide both reasons, (a) and (b). One of them suffices since both procedures should lead to the same conclusion.