To begin your work read the assignment description . It contains information that is relevant to your learning, beyond the mere request to accomplish a task. To solve this assignment you need to load the this data into your workspace. After you give it a try to solve the assignment and conclude that you are in need of help, you can go ahead and download a Jupyter Notebook with the solution to this lab . If you are just perusing and do not intend to run the code, there is a static version of the notebook below.

1_A_Data,_Models_and_Decisions

Let us pretend that we are tasked with designing a system to make admission decisions at the University of Pennsylvania (Penn). In order to design this system we need to acquire data, choose a model, and train it to make admission decisions.

To do so we leverage data on the performance of past students to construct an AI model that predicts the Penn GPA of students based on their High School GPA and their SAT scores. We then use this AI to predict the future performance of applicants based on their High School GPA and SAT scores.

Environment setup¶

Before we beign with the lab proper, we import some necessary Python packages. Python is a very pithy programming language. It can do very little on its own. Most of the functionalities that we need require that we load specific packages. For this particular lab we load Numpy, Pytorch, and matplotlib. These packages are used to load data, process data, and make plots, respectively.

In [37]:

# To import a library we use the "import" command. When we import a library
# we can give it a nickname. This nickname is useful because when we use a
# function provided by a library we need to specify the library that we are using.
# E.g., here we import the "numpy" library and we give it the "np" nickname.
import numpy as np
import matplotlib.pyplot as plt
import torch

# We can also give nicknames to sublibraries. E.g., to have access to neural network
# functions we re-import the "torch.nn" library and give it the nickname "nn"
import torch.nn as nn

# This is a technicality. It has to be specified.
torch.set_default_dtype(torch.float64)

# Parameters for plots. It controls their appearance. That's all.
plt.style.use('default')
plt.rcParams['font.size'] = '14'

There is nothing to understand in the code above. This is just knowledge that you need to have. Like what is the command in a text editor to write in boldface.

Data¶

This file contains data that we have available to make admission decisions. For a collection of former students we have access to their high school (HS) grade point average (GPA), their Scholastic Assessment Test (SAT) scores, their gender and their Penn GPA. There are a total of 600 students in this file.

Task 1. Load the GPA and SAT scrore data and plot: (i) The Penn GPA as a function of the High School GPA. (ii) The Penn GPA as a function of the SAT score.

Loading Data¶

In [38]:

# The data is given to us a CSV file. This is a data file in which
# columns are sparated by commas and in which each row corresponds to a
# different data entry. We need to load this file and convert it into a format
# that we can manipulate in Python. The following command performs this
# conversion. It creates a tensor named "data."
data = torch.from_numpy(
           np.genfromtxt (
               'Penn-GPA-and-HS-Data.csv',
               delimiter = ",",
               skip_header=1,
               dtype = float ) )

# The tensor "data" is a two dimensioanl table (a matrix). The first dimension
# (dimension 0) sweeps through students. The second dimension (dimension 1)
# sweeps through different pieces of data. We visualize this by printing the
# shape of the "data" tensor
print(f"\nNumber of students: {data.shape[0]}")
print(f"Number of variables: {data.shape[1]}\n")

Number of students: 600
Number of variables: 5

In [39]:

# To facilitate further processing we load high school GPA, SAT scores and
# Penn GPA in different vectors. Each of these is a vector containing data
# for the 600 students that are given to us.
high_school_gpa = data[:,1] # Column 1 of the data tensor.
sat_scores      = data[:,2] # Column 2 of the data tensor.
penn_gpa        = data[:,4] # Column 4 of the data tensor.

# We print data for some selected students
print(f"\nExample students\n")
print(f"HS GPA: {high_school_gpa[0]} \t SAT: {sat_scores[0]}\t Penn GPA: {penn_gpa[0]}")
print(f"HS GPA: {high_school_gpa[1]} \t SAT: {sat_scores[1]}\t Penn GPA: {penn_gpa[1]}")
print(f"HS GPA: {high_school_gpa[2]} \t SAT: {sat_scores[2]}\t Penn GPA: {penn_gpa[2]}")
print(f"HS GPA: {high_school_gpa[3]} \t SAT: {sat_scores[3]}\t Penn GPA: {penn_gpa[3]}")
print(f"HS GPA: {high_school_gpa[4]} \t SAT: {sat_scores[4]}\t Penn GPA: {penn_gpa[4]}")

Example students

HS GPA: 3.95 	 SAT: 1570.0	 Penn GPA: 3.9
HS GPA: 4.0 	 SAT: 1580.0	 Penn GPA: 3.97
HS GPA: 3.69 	 SAT: 1560.0	 Penn GPA: 3.57
HS GPA: 3.86 	 SAT: 1550.0	 Penn GPA: 3.7
HS GPA: 3.64 	 SAT: 1400.0	 Penn GPA: 3.7

You can corroborate that the numbers above correspond the first 5 rows of the GPA and SAT scrore data.

Penn GPA as a Function of High School GPA¶

Continuing with Task 1 we now proceed to plot Penn GPAs as a function of High School GPAs. This is done by calling a command to plot the data.

In [40]:

# We plot the data. The command plot specifies the x axis, the y axis,
# and the marker used to plot datapoints.
plt.plot(high_school_gpa, penn_gpa, ".")

# We label the axes of the plot. Never, ever, leave a plot without labels.
plt.xlabel("High-School GPA")
plt.ylabel("Penn GPA")

# Adjust the axes. We specify limits for the x and y coordinates. This is
# just so the plot looks pretier. You can skip these commands.
plt.xlim([3.4, 4.1])
plt.ylim([3.2, 4.1])

# Save the plot. It may be useful for your lab report
plt.savefig("HSGPAvsPennGPA.pdf")

No description has been provided for this image

¶

In this figure the horizontal axis is the high school GPA and the vertical axis is the Penn GPA. This plot shows that high school GPA is predictive of Penn GPA. Although there is significant variation we can see that higher high school GPA corresponds with higher Penn GPA. This indicates that high school GPAs are useful information for admission decisions as they can predict with some accuracy the GPA that an admitted student may attain at Penn.

Penn GPA as a Function of SAT Scores¶

To finalize Task 1 we plot Penn GPAs as a function of SAT scores. This is more or less the same code that wrote for plotting Penn GPAs with respect to High School GPAs. It may be good exercise to do this part without looking at this solution but looking at the previous one instead.

In [41]:

# We plot the data. The command plot specifies the x axis, the y axis,
# and the marker used to plot datapoints. Relative to the previous plot, we
# just change the "high_school_gpa" vector by the "sat_scores" vector.
plt.plot(sat_scores, penn_gpa, ".")

# We label the axes of the plot. Never, ever, leave a plot without labels.
plt.xlabel("SAT score")
plt.ylabel("Penn GPA")

# Adjust the axes. We specify limits for the x and y coordinates. This is
# just so the plot looks pretier. You can skip these commands.
plt.xlim([1380, 1620])
plt.ylim([3.2, 4.1])

# Save the plot. It may be useful for your lab report
plt.savefig("SATvsPennGPA.pdf")

¶

In this figure the horizontal axis is the SAT score and the vertical axis is the Penn GPA. The predictive power here is weaker than the predictive power of High School GPA. There ir, nonetheless, useful information here that we can leverage to design an AI that predicts Penn GPA of admission candidates.

System¶

To make admission decisions we interpret Penn as a system that takes as inputs high school GPA, SAT scores and gender information of a student and produces as an output the Penn GPA of the corresponding graduate. If we define the input data as a vector $x = [\text{HS GPA}; \text{SAT}; \text{Gender}]$ and we denote the output as $y = \text{Penn GPA}$ we can represent this system as the function,

\begin{equation}\tag{1} \text{Penn GPA} ~=~ y ~=~ P(x) ~=~ P \left[ \begin{array}{l} \text{HS GPA} \\ \text{SAT} \\ \text{Gender} \end{array} \right] \ . \end{equation}

Notice that this is a poor representation of Penn. Incoming students are much more than their gender, High School GPA, and SAT scores. Penn graduates are much more than their Penn GPAs and the institution itself does much more to a high school graduate than transforming their High School GPA and SAT scores into a Penn GPA. This is just one aspect of the whole system on which we are choosing to focus. The distinction between what a system is and what an engineer chooses to say that a system is warrants some discussion that we undertake in Section 4 of the lab assgignment.

Model and Decisions¶

To make admission decisions we design an AI model that predicts the Penn GPA of prospective students.

To make matters simpler let us begin by ignoring SAT scores and gender and attempt predictions based on high school GPAs. This means that the system above is replaced by the system

\begin{equation}\tag{2} \text{Penn GPA} ~=~ y ~=~ P(x) ~=~ P ( \, \text{HS GPA} \, ) . \end{equation}

The function $P(x)$ is the true effect of Penn on scholastic accomplishment. This is information that becomes available after the fact. When a student graduates Penn, we have access to their high school GPA $x$ and their Penn GPA $y$.

Penn GPA predictions are to be made prior the fact. Before a student attends Penn we want to estimate their Penn GPA based on their high school GPA $x$. We choose to postulate a linear relationship and make predictions of the form

\begin{equation}\tag{3} \hat{y} = \alpha x. \end{equation}

In this equation, $\hat{y}$ is a prediction of the true Penn GPA $y$ that will be available after the fact. The coefficient $\alpha$ is to be determined with the goal of making predictions $\hat{y}$ close to to actual Penn GPA $y$. We can then use Penn GPA predictions to make admission decisions.

Least Squares Estimation¶

To determine a proper value for the coefficient $\alpha$ in (3) we utilize the data we have available on the scholastic performance of past students. Use $N$ to denote the total number of available data points. Introduce a subindex $i$ to differentiate past students so that the pair $(x_i, y_i)$ denotes the high school GPA and Penn GPA os student $i$. For these students we can make GPA predictions $\hat{y}_i = \alpha x_i$. For a given coefficient $\alpha$ we define the mean squared error (MSE),

\begin{equation}\tag{4} \text{MSE}(\alpha) ~=~ \frac{1}{N} \sum_{i=1}^{N} \Big( y_i – \hat{y}_i \Big)^2 ~=~ \frac{1}{N} \sum_{i=1}^{N} \Big( y_i – \alpha x_i \Big)^2 . \end{equation}

The mean squared error $\text{MSE}(\alpha)$ measures the predictive power of coefficient $\alpha$. The quantity $( y_i – \hat{y}_i )^2$ is always nonnegative and indicates good the predicted GPA $\hat{y}_i$ is to the true GPA $y_i$. The MSE averages this metric over all students. It follows that a natural choice for $\alpha$ is the value that makes the MSE smallest. We therefore define the optimal coefficient

\begin{equation}\tag{5} \alpha^* ~=~ \text{argmin}_\alpha \frac{1}{2} \text{MSE}(\alpha) ~=~ \text{argmin}_\alpha \frac{1}{2N} \sum_{i=1}^{N} \Big( y_i – \alpha x_i \Big)^2 , \end{equation}

and proceed to make Penn GPA predictions as $\hat{y} = \alpha^* x$. This GPA predictor is called the linear minimum mean squared error (MMSE) prediction. This is because the predictor is the linear function that minimizes the MSE.

Task 2 Prove that the MMSE estimator coefficient $\alpha^*$ defined in (5) is given by the expression

\begin{equation}\tag{6} \alpha^* = \sum_{i=1}^{N} x_i y_i ~\bigg/~ \sum_{i=1}^{N} x_i^2 . \end{equation}

Compute $\alpha^*$ for the data loaded in Task 1. Plot the Penn GPA with respect to HS GPA and superimpose the prediction line $\hat{y} = \alpha^* x$.

In [42]:

# In Equation (6) the variable x corresponds to High School GPA and the
# variable y corresponds to Penn GPA. We create vectors x and y with the
# corresponding information to make the code more similar to the equation
x = high_school_gpa
y = penn_gpa

# We use a while loop to compute numerator and denominator of Equation (6).
# This loop sweeps through the vectors x and y and updates the numerator and
# denominator summands. To understand a while loop it is easier to read it from
# right to left. Code is indented for a reason.

numerator   = 0 # Initialize numerator to zero
denominator = 0 # Initialize denominator to zero
i = 0 # Initialize the index of x and y to zero

while i < len(x): # Repeat the following instructions while i is smaller than the lenght of x
    numerator   += x[i]*y[i] # Add the product of x[i] with y[i] to the numerator
    denominator += x[i]**2   # Add the square of x[i] to the denominator
    i += 1 # Increase the index of x and y
# End of the while loop

# The coefficient alpha_star is the ratio of the numerator to the denominator.
alpha_star = numerator/denominator

# Print the value of alpha_star with 3 significant digits
print(f"alpha_star = {alpha_star:.3f}")
print(f"Predicted_Penn_GPA = {alpha_star:.3f} * High_School_GPA ")

alpha_star = 0.975
Predicted_Penn_GPA = 0.975 * High_School_GPA

In [43]:

# The same computation of alpha_star can be done more efficiently using vector operations.
# To do so the first thing we need to do is redefine the vectors x and y as matrices with
# one column. This is done with the unsqueeze method.
x = high_school_gpa.unsqueeze(1)
y = penn_gpa.unsqueeze(1)

# The numerator is the product of x transpose with y. The matrix product of A and B is
# written as A @ BThe transpose of x is obtained with the method x.T.
numerator   = x.T @ y

# The denominator is the product of x transpose with x.
denominator = x.T @ x

# The coefficient alpha_star is the ratio of the numerator to the denominator.
alpha_star = numerator/denominator

# Print the value of alpha_star with 3 significant digits. Observe that we have to say
# alpha_star.item(). This is because alpha_star is an object with a lot of attributes and
# methods. We will learn what this means as we advance in the course.
print(f"alpha_star = {alpha_star.item():.3f}")
print(f"Predicted_Penn_GPA = {alpha_star.item():.3f} * High_School_GPA ")

alpha_star = 0.975
Predicted_Penn_GPA = 0.975 * High_School_GPA

In [44]:

# As per Equation (3) the predicted Penn GPAs are given by the product between
# alpha and x. Since we are using the optimal coefficient alpha_star the predicted
# Penn GPAs are the product of alpha_star with x
y_hat = alpha_star * x

# We plot estimates of Penn GPA with respect to High School GPA. Recall that the
# command plot specifies the x axis, the y axis, and the marker used to plot
# datapoints. Since we want a continuous line here, we do not specify a marker.
plt.plot(high_school_gpa, y_hat)

# We superimpose a plot of the data as we did in Task 1.
plt.plot(high_school_gpa, penn_gpa, ".")

# We label the axes of the plot because we never, ever, leave a plot without labels.
plt.xlabel("High-School GPA")
plt.ylabel("Penn GPA")

# Adjust the axes. We specify limits for the x and y coordinates. This is
# just so the plot looks pretier. You can skip these commands.
plt.xlim([3.4, 4.1])
plt.ylim([3.2, 4.1])

# Save the plot for your lab report
plt.savefig("linear_model_HSGPA.pdf")

In Task 2 we make predictions of the Penn GPA of students that have graduated Penn. Predicting Penn GPAs of past students is unnecessary given that we already know their true GPAs. Our motivation for solving this unnecessary problem is to determine the coefficient $\alpha^*$ that we can use to make predictions $\hat{y} = \alpha^* x$ of students that have not yet attended Penn — for which $x$ is available but $y$ is not. The effectiveness of this prediction depends on the extent to which the past is a good representation of the future.

It is germane to emphasize that in Task 2 we are using something we know — the GPA of former students — to answer a new question — the GPA of a prospective student. However primitive, this is a form of intelligence.

Root Mean Squared Error¶

We evaluate the merit of $\alpha^*$ with the root mean squared error (RMSE)

\begin{equation}\tag{7} \text{MSE}(\alpha) ~=~ \sqrt { \text{MSE}(\alpha) } ~=~ \Bigg [\, \frac{1}{N} \sum_{i=1}^{N} \Big( y_i – \alpha x_i \Big)^2 \, \Bigg]^{1/2}. \end{equation}

The reason we use the RMSE to evaluate the merit of $\alpha^*$ instead of the MSE is that the RMSE has the same units as our target variable. It is easier to interpret than the MSE. Since the difference between the two is just a square root function, the coefficient $\alpha^*$ that minimizes the MSE also minimizes the RMSE.

Task 3. Compute the RMSE of $\alpha^*$ and comment on the quality of the Penn GPA predictions.

In [63]:

# This cell computes equation (7).

# Here we compute the squared difference between each prediction and the target value.
# This is also a vector operation: this line computes (y[i] - y_hat[i])**2 for each
# entry in the tensors. Note that **2 here means "power of 2".
# We store this intermediate result in a variable called "squared_error".
squared_error = ( y - y_hat ) ** 2

# The torch function torch.mean now sums over all elements of the tensor, then
# divides by n. This results in the mean squared error
mean_squared_error = torch.mean(squared_error)

# Finally, for the root mean squared error, we use torch.sqrt.
r_mean_squared_error = torch.sqrt(mean_squared_error)

# Print the value of r_mean_squared_error with 3 significant digits
print(f"The root mean squared error is: {r_mean_squared_error:.3f}")

The root mean squared error is: 0.094

The RMSE is 0.094. We can think of this number as the accuracy of our Penn GPA predictions. This number seems to imply that our Penn GPA predictions are quite accurate because Penn GPAs can range from 0 to 4. However, the actual range of Penn GPAs observed in the dataset is between 3.3 and 4.0. In a variable whose range spans 0.7 units, a prediction error of 0.094 is not very accurate. This is apparent in the Figure we prepared in response to Task 2 where the line of predicted Penn GPAs is a rough estimate of observed Penn GPAs.

Linear Regression¶

We consider now a more complete model in which the Penn GPA is deemed to depend on the high school GPA and the SAT score. We therefore define the input vector $x = [x_1; x_2] = [\text{HS GPA}; \text{SAT}; \text{Gender}]$

\begin{equation}\tag{8} \text{Penn GPA} ~=~ y ~=~ P(x) ~=~ P \left[ \begin{array}{l} x_1 \\ x_2 \\ \end{array} \right] \ . ~=~ P \left[ \begin{array}{l} \text{HS GPA} \\ \text{SAT} \\ \end{array} \right] \ . \end{equation}

To make Penn GPA predictions we postulate an input-output relationship of the form

\begin{equation}\tag{9} \hat{y} ~=~ w_1 x_1 + w_2 x_2 ~=~ w^T x . \end{equation}

As is the case of (3), $\hat{y}$ in (9) is a prediction of the true Penn GPA $y$ that will be available after the fact. The coefficient $w=[w_1; w_2]$ is to be determined with the goal of making predictions $\hat{y}$ close to to actual Penn GPA $y$.

Task 4 Define and compute the MMSE estimator coefficients $w^*$, that would extend the MMSE definition in (5). Show that this coefficient is given by the expression

\begin{equation}\tag{10} w^* = \Bigg[ \sum_{i=1}^{N} x_i x_i^T \Bigg]^{-1} \sum_{i=1}^{N} x_i y_i . \end{equation}

In matrix form, this is equivalent to

\begin{equation} \Bigg[ x^T x \Bigg]^{-1} \cdot x^Ty \end{equation}

Compute $w^*$ for the data loaded in Task 1. Compute the RMSE of $w^*$ and comment on the quality of the Penn GPA predictions.

In [76]:

# This gets all rows from columns 1 and 2, that is, columns in range [1,3) in Python notation.
x = data[:,1:3]

#  Again, we take Penn GPA as our y, and use unsqueeze to get a (600,1) matrix.
y = penn_gpa.unsqueeze(1)

# We compute the w coefficients using equation (10) using matrix multiplications
# and transposes. torch.linalg.inv computes the matrix inverse.
w = torch.linalg.inv(x.T@x) @ x.T @ y

# Print our linear regression formula with the coefficients we just computed.
print(f"Optimal predictor: \nPredicted Penn GPA = {w[0].item():.4f} * High-School GPA + {w[1].item():.4f} * SAT score")

Optimal predictor: 
Predicted Penn GPA = 0.6332 * High-School GPA + 0.0009 * SAT score

In [78]:

# Compute predictions with the optimal w
y_pred = x@w

# Calculate the RMSE of this new set of predictions, again using the formula in Equation (7)
mean_squared_error = torch.mean((y-y_pred)**2)
root_mean_squared_error = torch.sqrt(mean_squared_error)

# Print the Root Mean squared error with 3 significant digits
print(f"The root mean squared error is: {root_mean_squared_error:.3f}")

The root mean squared error is: 0.088