POL269 - Political Data Research: Seminar 5

Predicting Overall Scores in POL269 Using Midterm Scores

All materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).

Today we are going to explore a new dataset. We will be analysing real, historical, student performance data from POL269. The goal is to model the relationship between midterm and overall scores achieved on the module so that we can later predict overall scores based on midterm scores. The dataset we will use is stored in the grades2.csv file, which can be downloaded from the POL269 website.

Remember to save this file as a .csv file (otherwise we will not be able to read it in to R Studio using the read.csv() function) and to pay attention to where you are saving this data. This will be important when setting our working directory to load the data from.

Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is students.

Table 1: Variables in “grades.csv”

Variable	Description
Assignment.1	students’ scores in the midterm (from 0 to 100 points)
Take.Home.Exam	students’ scores in the final exam (from 0 to 100 points)
Course.total	students’ scores in the class overall (from 0 to 100 points)
distinction	identifies students who earned a distinction on the module

In this problem set, we practice fitting a linear regression model to make predictions.

As always, we start by loading and looking at the data (don’t forget to set your working directory first!):

grades <- read.csv("grades2.csv") # reads and stores data
head(grades) # shows first observations

  X.1 X Assignment.1 Take.Home.Exam Course.total distinction
1   1 1           75             86         80.5         yes
2   2 2           75             86         80.5         yes
3   3 3           74             86         80.0         yes
4   4 4           55             86         70.5         yes
5   5 5           80             84         82.0         yes
6   6 6           78             82         80.0         yes

In this dataset, what does each observation represent?
What should be our X variable? In other words, which variable are we going to use as the predictor variable?
What should be our Y variable? In other words, which variable are we going to use as the outcome variable?
Create a visualisation of the relationship between the two variables using the ggplot2 package. Remember to load in this package prior to use, and to install this if you have not already done so.
Use the base R function lm() to fit a linear model to the data.
What is the fitted line for this model? In other words, provide the formula \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\) where you specify each term by substituting these for the values reported in our regression model. \(Y\) will become the name of our outcome variable, \(\widehat{\alpha}\) will be the estimated value of the intercept coefficient, \(\widehat{\beta}\) will be the estimated value of the slope coefficient, and \(X\) is substituted for the name of the predictor.
Finally, use this fitted line to make some predictions.
1. Firstly, please compute \(\widehat{Y}\) based on \(X\), supposing that someone earns 80 points in the midterm. This will give us a best guess as to what this students’ predicted overall score in the class would be, if they achieved a score of 80 on their midterm.
2. Secondly, please compute \(\triangle \widehat{Y}\) based on \(\triangle X\), supposing that a student studies for an few extra hours and earns 10 extra points in their midterm as a result. This will tell us how much a students’ predicted overall score is predicted to change as a result of scoring 10 extra points in the midterm.