Solutions Seminar 5

Elizabeth Simon
2024-02-21

Predicting Overall Scores in POL269 Using Midterm Scores

All materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).

Today we are going to explore a new dataset. We will be analysing real, historical, student performance data from POL269. The goal is to model the relationship between midterm and overall scores achieved on the module so that we can later predict overall scores based on midterm scores. The dataset we will use is stored in the grades2.csv file, which can be downloaded from the POL269 website.

Remember to save this file as a .csv file (otherwise we will not be able to read it in to R Studio using the read.csv() function) and to pay attention to where you are saving this data. This will be important when setting our working directory to load the data from.

Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is students.

Table 1: Variables in “grades.csv”

Variable Description
Assignment.1 students’ scores in the midterm (from 0 to 100 points)
Take.Home.Exam students’ scores in the final exam (from 0 to 100 points)
Course.total students’ scores in the class overall (from 0 to 100 points)
distinction identifies students who earned a distinction on the module

In this problem set, we practice fitting a linear regression model to make predictions.

As always, we start by loading and looking at the data (don’t forget to set your working directory first!):

grades <- read.csv("grades2.csv") # reads and stores data
head(grades) # shows first observations
  X.1 X Assignment.1 Take.Home.Exam Course.total distinction
1   1 1           75             86         80.5         yes
2   2 2           75             86         80.5         yes
3   3 3           74             86         80.0         yes
4   4 4           55             86         70.5         yes
5   5 5           80             84         82.0         yes
6   6 6           78             82         80.0         yes
  1. In this dataset, each observation (or row of data) represents a student. The unit of observation in this dataset is students, as stated at the beginning of this problem set.

  2. Our x variable, or predictor variable, is Assignment.1. We are trying to predict students’ overall scores on our course based on their midterm scores.

  3. Our y variable, or outcome variable, is Course.total. We are trying to predict these overall scores on Llaudet’s course using information contained in the predictor variable Assignment.1, which details students’ midterm scores on this same course.

  4. Before we run a linear regression model, it is always good practice to visualise the relationship between our x and y variables. This helps us get a sense of how we may expect our regression results to look.

As the variables Assignment.1 and Course.total are both numeric non-binary variables, we will use a scatter plot to show the relationship of these two variables to one another. We do this using the ggplot2 package, as we did in the seminar last week (remember to install this package if you have not already, and to load this into R Studio before running the code to plot your scatter diagram).

We load the ggplot2 package and then run the following to produce our scatter plot:

library(ggplot2)
ggplot(data = grades, aes(x = Assignment.1, y = Course.total)) + geom_point() #plotting a scatterplot of midterm and overall

To produce our scatter plot, we must first specify our dataframe, using the code data = grades. We then specify the columns of the dataframe we are looking to plot and onto which axis, using the aes() function. Here we map the column - or variable - Assignment.1 to the x axis and Course.total onto the y axis (note: we plot Assignment.1 on the X axis as it is our predictor variable, and Course.total on the y axis as this is our outcome variable). Finally, to produce our scatter plot, we add the point geometry to our code using the function geom_point().

We can clearly see from our scatter plot that there is a strong, positive linear relationship between the midterm grades students achieve on POL269 and their overall grades on this same course. This means that the higher a students’ midterm grade is, the higher their overall grade will be, on average.

  1. We use the base R function lm() to fit a linear regression model to our data, like so:
lm(grades$Course.total ~ grades$Assignment.1)

Call:
lm(formula = grades$Course.total ~ grades$Assignment.1)

Coefficients:
        (Intercept)  grades$Assignment.1  
            15.6116               0.7687  

The function lm() fits a linear model. It requires a function of the type Y ~ X, where Y identifies the Y variable and X identifies the X variable. To specify the dataframe where the variables are stored, we can use either the $ operator (as in the code above) or the optional argument data. If we wanted to use the latter, the code to fit the linear model would be:

lm(Course.total ~ Assignment.1, data = grades)

Call:
lm(formula = Course.total ~ Assignment.1, data = grades)

Coefficients:
 (Intercept)  Assignment.1  
     15.6116        0.7687  

You can see clearly from the outputs above that both methods of running a linear regression model will provide the same results.

  1. The fitted regression line helps us make predictions about the likely value of our outcome variable (overall), at different values of our predictor variable (midterm).

A fitted regression line, in general, takes the form: \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\)

Here, we need to substitute these values for those reported in our regression model such that \(Y\) becomes the name of our outcome variable, \(\widehat{\alpha}\) becomes the estimated value of the intercept coefficient, \(\widehat{\beta}\) becomes the estimated value of the slope coefficient, and \(X\) is substituted for the name of the predictor.

This gives us the following fitted line equation: \(\widehat{\textrm{overall}}\) = 15.61 + 0.77 Assignment.1.

The Y variable is Course.total, \(\widehat{\alpha}\)=15.61, \(\widehat{\beta}\)=0.77, and the X variable is Assignment.1.

This tells us that our best guess as to a student’s Course.total grade achieved on this module will be to take a score of 15.61 (the average overall score for students whose midterm score is 0), and add to this ‘base’ score a multiplier of 0.77 times the students’ the midterm grade.

  1. We are now going to use our fitted line to make some predictions.

Firstly, as in 7a) we will compute \(\widehat{Y}\) based on \(X\), supposing that someone earns 80 points in the midterm. This will give us a best guess as to what this students’ predicted overall score in the class would be, if they achieved a score of 80 on their midterm. \[\begin{align*} \widehat{\textrm{Course.total}} &= \widehat{\alpha} + \widehat{\beta} \, \textrm{Assignment.1}\\ &= 15.61 + 0.77 \, \textrm{Assignment.1}\\ &= 15.61 + 0.77 \times 80 \color{gray} \,\,\,\textrm{(if Assignment.1=80)}\\ &= 15.61 + 61.6 = 77.21 \end{align*}\] Substituting this value of 80 for midterm into our fitted line equation, we can see that a student who earns 80 points in their midterm would be predicted to earn 77.21 points in the class overall, on average. Note: \(\widehat{Y}\) is in the same unit of measurement as \(\overline{Y}\); in this case, Y is non-binary and measured in points so \(\overline{Y}\) and \(\widehat{Y}\) are also measured in points.

Secondly, as in 7b) we will compute \(\triangle \widehat{Y}\) based on \(\triangle X\), supposing that a student studies for an few extra hours and earns 10 extra points in their midterm as a result. In this case, what would be our best guess of by how much their predicted overall score will change as a result of these 10 extra points in the midterm? \[\begin{align*} \triangle \widehat{\textrm{Course.total}} &= \widehat{\beta} \, \triangle \textrm{Assignment.1} \\ &= 0.77 \times \triangle \textrm{Assignment.1}\\ &= 0.77 \times 10 \color{gray} \,\,\,\textrm{(if $\triangle$ Assignment.1=10)}\\ &= 7.7 \end{align*}\] By removing the intercept from our fitted line equation (which tells us the average student Course.total score for the course if their Assignment.1 score is 0), and substituting in the value of 10 for the predictor variable midterm (symbolising an extra 10 points scored), we can see that if a student increases their midterm score by 10 points, we predict that their overall score would increase by 7.7 points, on average.

The intercept is omitted here as we are not looking to predict a student’s overall score, but just the amount by which this overall score would increase if their midterm score was increased by a certain amount.

Note again that \(\triangle\widehat{Y}\) is in the same unit of measurement as \(\triangle\overline{Y}\); in this case, Y is non-binary and measured in points so \(\triangle\overline{Y}\) and \(\triangle\widehat{Y}\) are also measured in points.