All materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).
Today we are going to explore a new dataset. We will be analysing real, historical, student performance data from POL269. The goal is to model the relationship between midterm and overall scores achieved on the module so that we can later predict overall scores based on midterm scores. The dataset we will use is stored in the grades2.csv file, which can be downloaded from the POL269 website.
Remember to save this file as a .csv file (otherwise we will not be able to read it in to R Studio using the read.csv() function) and to pay attention to where you are saving this data. This will be important when setting our working directory to load the data from.
Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is students.
Table 1: Variables in “grades.csv”
Variable | Description |
---|---|
Assignment.1 | students’ scores in the midterm (from 0 to 100 points) |
Take.Home.Exam | students’ scores in the final exam (from 0 to 100 points) |
Course.total | students’ scores in the class overall (from 0 to 100 points) |
distinction | identifies students who earned a distinction on the module |
In this problem set, we practice fitting a linear regression model to make predictions.
As always, we start by loading and looking at the data (don’t forget to set your working directory first!):
X.1 X Assignment.1 Take.Home.Exam Course.total distinction
1 1 1 75 86 80.5 yes
2 2 2 75 86 80.5 yes
3 3 3 74 86 80.0 yes
4 4 4 55 86 70.5 yes
5 5 5 80 84 82.0 yes
6 6 6 78 82 80.0 yes
In this dataset, what does each observation represent?
What should be our X variable? In other words, which variable are we going to use as the predictor variable?
What should be our Y variable? In other words, which variable are we going to use as the outcome variable?
Create a visualisation of the relationship between the two variables using the ggplot2 package. Remember to load in this package prior to use, and to install this if you have not already done so.
Use the base R function lm() to fit a linear model to the data.
What is the fitted line for this model? In other words, provide the formula \(\widehat{Y} = \widehat{\alpha} + \widehat{\beta} X\) where you specify each term by substituting these for the values reported in our regression model. \(Y\) will become the name of our outcome variable, \(\widehat{\alpha}\) will be the estimated value of the intercept coefficient, \(\widehat{\beta}\) will be the estimated value of the slope coefficient, and \(X\) is substituted for the name of the predictor.
Finally, use this fitted line to make some predictions.