Download the assignment in the following link.

Download the data in this link. Note: Do not use the dataset from the Lab1B, use this one instead.

Download the data and Jupyter notebook in this link.

Instructions on how to download and use Jupyter Notebooks can be found here. You can find a static version of the notebook below.

1.C Neural Networks

Lab 1: Grade Point Average Prediction¶

In this Lab we are tasked with designing a system to make admission decisions at the University of Pennsylvania (Penn). In order to design this system we need to acquire data and create a model to substantiate our decisions.

Lab 1.C: Neural Networks¶

0. Environment setup¶

In [1]:

# There are three different sections in this cell. We import and configure packages, 
# we load and normalize the data, and we print data for a student to verify 
# that data loaded correclty.

# Import Packages
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.nn.functional import relu
import random
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from math import sqrt

# Configuration commands for packages
torch.set_default_dtype(torch.float64)
plt.style.use('default')

# Set random seed for reproducibility
np.random.seed(1)
torch.manual_seed(1)
random.seed(1)

# Load data from the CSV file, skipping the header, and convert to a PyTorch tensor
data = np.genfromtxt('1C-Penn-GPA-and-HS-Data.csv', delimiter=",", skip_header=1, dtype = float)
data = torch.from_numpy(data)

# Compute the mean and standard deviation of each variable
means = torch.mean(data[:,1:4], dim = 0)
stds = torch.std(data[:,1:4], dim = 0)

# Substract the mean and divide by the std each variable
data_norm = data.clone()
data_norm[:, 1:4] = (data[:, 1:4] - means) / stds 

# Select the normalized HS GPA, SAT scores
x = data_norm[:,1:3]
# The targets are the normalized Penn GPA values
y = data_norm[:,3]

# Print data sizes and data for an example student to verify that data loaded properly
print(f"Number of students: {data.shape[0]}")
print(f"Number of variables: {data.shape[1]}")
print(f"\nExample student:")
print(f"\t ID:        {data[0, 0]}")
print(f"\t HS GPA:    {data[0, 1]}") 
print(f"\t SAT Score: {data[0, 2]}")
print(f"\t Penn GPA:  {data[0, 3]}")
print(f"\t Female:    {data[0, 4]}")
print(f"\t Male:      {data[0, 5]}")

Number of students: 600
Number of variables: 6

Example student:
	 ID:        991962.0
	 HS GPA:    3.81
	 SAT Score: 1560.0
	 Penn GPA:  3.7
	 Female:    0.0
	 Male:      1.0

1 Neural networks¶

Neural Networks are information processing architectures made up of a composition of layers, each of which is itself the composition of a linear transformation with a pointwise nonlinearity.

For a network with an arbitrary number of layers $L$ we define the input output relationship through the recursion

\begin{equation}\tag{1} \mathbf{x}_0 = \mathbf{x}, \qquad \mathbf{x}_\ell = \sigma \Big(\, \mathbf{z}_\ell \,\Big) = \sigma \Big(\, \mathbf{A}_\ell\mathbf{x}_{\ell-1} \,\Big) ,\qquad \mathbf{x}_L = \mathbf{\Phi}(\mathbf{x}; \mathcal{A}). \end{equation}

In this recursion the output of Layer $\ell-1$ produces the output of Layer $\ell$ as $\mathbf{x}_\ell = \sigma ( \mathbf{A}_\ell \mathbf{x}_{\ell-1})$. This operation is a composition of the linear map $\mathbf{A}_\ell\mathbf{x}_{\ell-1}$ with a pointwise nonlinear function $\sigma$. This is a function that acts separately on each individual components. To complete the recursion we redefine the input $\mathbf{x}$ as the output of Layer 0, $\mathbf{x}_0 = \mathbf{x}$. The output of the neural network is the output of layer $L$, $\mathbf{x}_L = \mathbf{\Phi}(\mathbf{x}; \mathcal{A})$. In this notation $\mathcal{A}$ is the tensor $\mathcal{A} := [\mathbf{A}_1,\ldots,\mathbf{A}_L]$ that groups the $L$ matrices that are used at each of the $L$ layers.

A neural network with three layers is depicted in Figure 1. The input to the neural network is a vector, or signal, x which we will also denote as $\mathbf{x_0} = \mathbf{x}$. This signal is passed to the first layer of the neural network where it is processed with a linear transformation defined by the matrix $\mathbf{A1}$. This produces the intermediate signal $\mathbf{z_1} = \mathbf{A_1x}$ which is then passed through a pointwise nonlinearity $s$ to produce the first layer output

\begin{equation}\tag{2} \mathbf{x}_1 = σ(\mathbf{z_1}), = σ(\mathbf{A_1x_0}). \end{equation}

The output $\mathbf{x_1}$ of Layer 1 becomes an input to Layer 2. In this layer, the signal is processed by a linear transformation defined by the matrix $\mathbf{A_2}$ and a pointwise nonlinearity to produce the Layer 2 output

\begin{equation}\tag{3} \mathbf{x}_2 = σ(\mathbf{z_2}), = σ(\mathbf{A_2x_1}). \end{equation}

This output becomes now an input to the third layer where an analogous composition of a linear map with a pointwise nonlinearity produces the output of the neural network:

\begin{equation} \mathbf{Φ}(\mathbf{x};A) = \mathbf{x_3} = \sigma(\mathbf{z}_3) = \sigma(\mathbf{A_3x_2}). \tag{4} \end{equation}

In (4) we use $\mathbf{Φ}(\mathbf{x};A)$ to denote the neural network output. This output is a function of the input $\mathbf{x}$ and a function of the matrices $\mathbf{A_1}, \mathbf{A_2},$ and $\mathbf{A_3}$. These three matrices are grouped in the tensor $A := [\mathbf{A_1}, \mathbf{A_2}, \mathbf{A_3}]$. This is the parameter that we must determine during training.

This training process entails solutions of an empirical risk minimization problem (ERM). For clarity, recall that we are given data pairs $(\mathbf{x}_i, \mathbf{y}_i)$ in which $\mathbf{x}_i$ is the input to a system of interest and $\mathbf{y}_i$ is the respective observed output. In the ERM problem we evaluate the neural network outputs $\hat{\mathbf{y}}_i:=\mathbf{\Phi}(\mathbf{x}_i; \mathcal{A})$ and compare them with the corresponding system outputs $\mathbf{y}_i$ using a given loss functions $\ell(\mathbf{y}, \hat{\mathbf{y}})$. The ERM objective is the average of these individual losses. Thus, training a neural network implies finding the tensor $\mathcal{A}^*$ that solves the ERM problem,

\begin{equation}\tag{5} \mathcal{A}^* ~=~ \operatorname{argmin}_\mathcal{A} \frac{1}{N} \sum_{i=1}^{N} \ell \Big( \mathbf{y}_i, \mathbf{\Phi}(\mathbf{x}_i; \mathcal{A}) \Big) ~ \end{equation}

Notice that in (5) we must have that the dimensions of the system outputs $\mathbf{y}_i$ and the dimensions of the neural network outputs $\hat{\mathbf{y}}_i:=\mathbf{\Phi}(\mathbf{x}_i; \mathcal{A})$ must be the same.

1.1 Pointwise nonlinearities¶

That the nonlinear function σ in (1) is pointwise means that it acts on each component of the input separately. This means that if we write the intermediate signal as $\mathbf{z_{\ell}} = [z_{\ell1}, \ldots, z_{\ell p}]$ , the nonlinearity σ produces as an output the vector: \begin{equation}\tag{6} σ(\mathbf{z_{\ell}}) = σ\left( \begin{array}{c} z_{\ell1} \\ z_{\ell2} \\ \vdots \\ z_{\ell p} \end{array} \right) = \left( \begin{array}{c} σ(z_{\ell1}) \\ σ(z_{\ell2}) \\ \vdots \\ σ(z_{\ell p}) \end{array} \right). \end{equation}

The most popular choice for a pointwise nonlinearity is a rectified linear unit (ReLU). A ReLU is a simple function that just zeros the entry when it is negative:

\begin{equation}\tag{7} σ(z) = \max(0, z). \end{equation}

In this expression we can think that the output $σ(z)$ is activated when $z \geq 0$ and not activated when $z < 0$. For this reason pointwise nonlinearities are also called activation functions, whether we use a ReLU or not.

A variation of a ReLU nonlinearity is a leaky ReLU. Given a constant $0 < \alpha < 1$, a leaky ReLU is defined as

\begin{equation}\tag{8} σ(z) = \max(\alpha z, z). \end{equation}

Given that $0 < \alpha < 1$ the output of the leaky ReLU is $σ(z) = z$ when $z \geq 0$ and $σ(z) = \alpha z$ when $z < 0$. The idea is to use a constant $\alpha \ll 1$ so that negative outputs are attenuated significantly. The advantage of a leaky ReLU is that it does not eliminate negative outputs entirely. This is particularly useful during training.

Another common choice of activation function is the logistic or sigmoid

\begin{equation}\tag{9} σ(z) = \frac{1}{1 + e^{-z}}. \end{equation}

This function approaches 0 when $z$ is a large negative number and approaches 1 when $z$ is a large positive number. When $z$ has a small absolute value the sigmoid is approximately linear. Sigmoids are useful when we want outputs that range between 0 and 1.

The final activation function we define is the hyperbolic function

\begin{equation}\tag{10} σ(z) = \frac{e^z – e^{-z}}{e^z + e^{-z}}. \end{equation}

This function is similar in shape to the sigmoid except that its range is between plus and minus 1. An interesting interpretation of the hyperbolic function is that it tapers large values. When the argument $z$ has small absolute value, the output of a hyperbolic function is close to its input, $σ(z) \approx z$. When the argument $z$ is large, the output is saturated to plus or minus 1.

Notice that the ReLU and leaky ReLU functions in (7) and (8) are not differentiable at $z = 0$ whereas the sigmoid and hyperbolic tangent in (9) and (10) are differentiable everywhere. This is an important distinction but it has little practical significance.

Further observe that all these four activation functions are such that the values at the output have less variability than the values at the input. ReLUs and leaky ReLUs maintain positive values but eliminate or drastically reduce negative values. Sigmoids and hyperbolic tangents squash the range of the input to $(0, 1)$ or $(-1, 1)$. Reducing variability improves the stability of information processing with neural networks relative to information processing with linear functions.

1.2 Neural network specification¶

The input to each layer of a neural network is a vector and the output is another vector. We say that we specify a neural network when we specify the number of entries of each of these vectors. Thus, if $p_0=p$ is the number of components of the input $\mathbf{x}_0$ and $p_\ell$ is the number of components at the output $\mathbf{x}_\ell$ of layer $\ell$ of a neural network with $L$ layers, the neural network is specified by the numbers $L$ and $p_0, p_1, \ldots, p_L$. Of these numbers $p_0$ and $p_L$ are determined by the data because $p_0$ is the dimension of the input $\mathbf{x}$ and $p_L$ matches the dimension of the output $\mathbf{y}$.

To complete the specification of a neural network we must also specify the nonlinearities that are used at each layer. In most cases, the same activation function is used in all layers.

Task 1¶

Define a class to represent a neural network to process scholastic achievement data. The inputs are high school GPA and SAT scores. The output is an estimate of the Penn GPA. This neural network has 2 layers and the output of Layer 1 has dimension $p_1 = 20$. The neural network uses ReLU nonlinearities in both layers.

Endow the object with a method that takes as inputs vectors $\mathbf{x} = [\text{HS GPA}; \text{SAT}]$ with high school GPA and SAT scores and produces estimates of Penn GPAs.

Notice that while we say that we are specifying the neural network, we are, in reality, specifying a family or a class of neural networks. An actual neural network is a concrete choice of the tensor $\mathbf{A} := [\mathbf{A}_1, \ldots, \mathbf{A}_L]$. The intent of a neural network specification like the one we give in Task 1 is to specify the search space for the ERM problem in (5).

We will define a class that inherits from torch.nn.Module and implements our two layer NN. As explained in LAb 1.X, this inheritance is so that we can later evaluate gradients and train the neural network.

A detailed explanation of this code is in the cell immediately below.

In [2]:

class TwoL2ayerNN(torch.nn.Module): # The TwoL2ayerNN class inherits from torch.nn.Module
    
    def __init__(self, p1=20):
    
        super().__init__() # Call the parent's init method

        # The linear maps are defined by their dimensions. We specify them by initializing the
        # matrices A1 and A2 with random entries with the proper dimensionality.  
        
        # The first layer has input dimension 2 (SAT, GPA) and output dimension p1
        self.A1 = nn.parameter.Parameter( torch.rand(2, p1) - 0.5 )
        
        # The second layer has input dimension p1 and output dimension 1 (predicted GPA)
        self.A2 = nn.parameter.Parameter( torch.rand(p1, 1) - 0.5 )

To specify the neural network we need to define the parameters of our model in the init method of the class. Recall that the linear maps are defined by their dimensions. In our case the input layer (Layer 1) needs to have input dimension 2 (SAT, GPA) and has an output dimension given by the number of hidden layers. We choose to make the number of hidden layers a parameter $p_1$ that can be passed as an argument when the NN is instantiated. This parameter is optional and set to $p_1=20$ by default. Layer 2 (the output layer) has input dimension $p_1$ and output dimension 1 (Penn GPA).

To specify these linear transformations, we create randomly initialised parameter matrices A1 and A2 with the corresponding input and output dimensions. We could also initialize the matrices A1 and A2 with a fixed set of inputs (say, all zeros or all ones). However, it is known that random parameter initialization helps with the training of neural networks.

Now we need to implement the forward method that given an input computes the predictions of our neural network using these previously defined parameters (A1, A2).

In [3]:

class TwoLayerNN(torch.nn.Module): # The TwoL2ayerNN class inherits from torch.nn.Module

    def __init__(self, p1=20): # The same init method explained above with shortened comments
        super().__init__()
        self.A1 = nn.parameter.Parameter( torch.rand(2, p1) - 0.5 ) # Layer 1, p1 X 2 random matrix
        self.A2 = nn.parameter.Parameter( torch.rand(p1, 1) - 0.5 ) # Layer 2, 1 X p1 random matrix

    def forward(self, x0): # Forward method implements the neural networl
        
        # First layer takes x0 as input (passed when calling the function) and gives x1 as output
        z1 = torch.matmul( x0, self.A1 ) # Multiplication by A1. A matrix stored in the NN object
        x1 = relu( z1 ) # Relu nonlinearity 
        
        # Second layer takes x1 as input and gives x2 as output
        z2 = torch.matmul( x1, self.A2 )  # Multiplication by A2. A matrix stored in the NN object
        x2 = relu( z2 ) # Relu nonlinearity 
        
        return x2 # This is a two-layer NN. The output is x2

The forward method is a straightforward implementation of the NN parametrization. We compute the linear transformation of the inputs $A_1 x_0$ in the first layer by calling torch.matmul(x0, self.A1) and then apply the pointwise non linearity relu() to obtain the first layer output (x1). We then use x1 as an input to the second layer, which is implemented in the same way (relu(torch.matmul(x,self.A2))).

2 Train and Test Sets¶

The minimization of empirical risks carries a subtle challenge. To see which one, consider an ERM problem in which we do not choose a parameterization. In this case we are simply seeking an artificial intelligence $\mathbf{\Phi}(\mathbf{x})$ that solves the following problem

\begin{equation}\tag{11} \mathbf{\Phi}^* ~=~ \operatorname{argmin}_{\mathbf{\Phi}} \frac{1}{N} \sum_{i=1}^{N} \ell \Big( \mathbf{y}_i, \mathbf{\Phi}(\mathbf{x}_i) \Big) ~. \end{equation}

This problem is the opposite of challenging. We just choose the map $\mathbf{\Phi}(\mathbf{x}_i)=y_i$ and select $\mathbf{\Phi}(\mathbf{x})$ at random for any other $\mathbf{x}$. This solution is as easy as it is nonsensical. It does not solve the artificial intelligence problem of finding a function $\mathbf{\Phi}(\mathbf{x})$ that can make predictions about the future. It memorizes the past perfectly but it says nothing about future inputs that are not exact matches of some past inputs.

Given that it is nonsensical there is no point in working with (11). And, indeed, we are not attempting to solve (11). We are attempting to solve the ERM problem in (5) in which predictions $\hat{\mathbf{y}}=\mathbf{\Phi}(\mathbf{x}_i; \mathcal{A})$ are outputs of a family of neural networks that follow some specification — as discussed in Section 1.2. Nevertheless, there is the mystifying fact that from the perspective of empirical risk a function $\mathbf{\Phi}^*$ that solves (2) is at least as good, and in all likelihood better, than a function $\mathbf{\Phi}(\mathbf{x}_i; \mathcal{A}^*)$ that solves (5). Indeed, since the empirical risk of $\mathbf{\Phi}^*$ is null and the empirical risk is nonnegative, we must have,

\begin{equation}\tag{12} ~0~ = \frac{1}{N} \sum_{i=1}^{N} \ell \Big( \mathbf{y}_i, \mathbf{\Phi}^*(\mathbf{x}_i) \Big) ~\leq~ \frac{1}{N} \sum_{i=1}^{N} \ell \Big( \mathbf{y}_i, \mathbf{\Phi}(\mathbf{x}_i; \mathcal{A}^*) \Big) . \end{equation}

The answer to this riddle is that the merit of an artificial intelligence is not its empirical risk but the risk it attains during operation.

Suppose then that we observe $\tilde{N}$ points $(\tilde{\mathbf{x}}_j, \tilde{\mathbf{y}}_j)$ during operation of the neural network. The true metric of performance of the neural network is its ability to make predictions $\mathbf{\Phi}(\tilde{\mathbf{x}}_j; \mathcal{A})$ that are close to observed outcomes. This is measured but the operational risk

\begin{equation}\tag{13} \tilde{r}(\mathcal{A}) ~=~ \frac{1}{\tilde{N}} \sum_{j=1}^{\tilde{N}} \ell \Big( \tilde{\mathbf{y}}_j, \mathbf{\Phi}(\tilde{\mathbf{x}}_j; \mathcal{A}) \Big) ~. \end{equation}

This solves the riddle of (12) because it is possible that the operational risk of the neural network is better than the operational risk of the function $\mathbf{\Phi}^*$. This is not just possible but does happen in practice. The reasons why this may happen or not are the subject matter of statistical learning theory. We will not study this here.

Observe that the operational risk becomes available a posteriori; after the neural network goes into service and starts making predictions. To gain a handle on operational risk before service we divide available data into two sets. A training set with the $N$ samples $(\mathbf{x}_i, \mathbf{y}_i)$ that we use for solving (5) and a test set with $\tilde{N}$ samples $(\tilde{\mathbf{x}}_i, \tilde{\mathbf{y}}_i)$ that we can use to estimate the operational risk. It is important that the test set is not used in training and it is important that the test set is representative of the data as a whole. In this context the operational risk is also called to test error.

Task 2¶

Divide the data of Lab 1A into a test set with $\tilde{N} = 100$ samples and use the remaining samples as a training set. Use it to train the neural network of Task 1. Try with a learning rate of $0.01$ and train for $200$ epochs. Evaluate the training error and the test error. They will be different.

Please report the train and test error. Additionally, write a paragraph discussing differences between train and test risks.

We begin by partiotioning the data in two sets.

In [4]:

# Randomly shuffle indices for creating training and testing splits
dataset_size = len(x)
indices = list(range(dataset_size))
np.random.shuffle(indices)

# Number of test samples
numTest = 100  
# Number of training samples
numTrain = dataset_size - numTest
print(f"Num Train = {numTrain}, numTest = {numTest}")

train_indices, test_indices = indices[numTest:], indices[:numTest]

# Create training and testing datasets using the shuffled indices
# This ensures a random split of the data for unbiased evaluation
x_train = x[train_indices]
y_train = y[train_indices]
x_test = x[test_indices]
y_test = y[test_indices]

Num Train = 500, numTest = 100

In [5]:

torch.manual_seed(1) #Fix the seed for reproducibility of this cell
# Specification of learning parametrization
estimator = TwoLayerNN(40)

# Define parameters used in training loop
epsilon=0.01
batch_size = 64

# Specify the optimizer that is used to perform descent updates
optimizer = optim.SGD(estimator.parameters(), lr=epsilon, momentum=0)

# Lists to record the evolution of the MSE during training and testing
mse_evolution = []
mse_evolution_test = []

# Training loop: iterates over the dataset n_epoch times
n_epochs = 200

# After num_train_examples / batch_size iterations, we complete an epoch
num_train_examples = x_train.shape[0]
epoch_iterations =  num_train_examples // batch_size

#So total iterations is n_epochs * epoch_iterations
for _ in range(n_epochs*epoch_iterations):
    # Sample with replacement
    batch_indices = torch.randint(0, num_train_examples, (batch_size,))
    x_batch = x_train[batch_indices]
    y_batch = y_train[batch_indices]
        
    # Set gradients to zero
    estimator.zero_grad() 
    
    # Compute predictions
    yHat = estimator.forward(x_batch).squeeze()
    
    # Compute MSE
    mse = torch.mean((yHat-y_batch)**2)

    # Compute gradients
    mse.backward()
    
    # Update model parameters
    optimizer.step()

    # Evaluate the model (on the whole) training dataset and record the MSE
    with torch.no_grad():
        y_pred_train = estimator(x_train).squeeze()
        epoch_mse = torch.mean((y_pred_train - y_train) ** 2).item()
    mse_evolution.append(epoch_mse)
    
    # Evaluate the model (on the whole) testing dataset and record the MSE
    with torch.no_grad():
        y_pred_test = estimator(x_test).squeeze()
        mseTest = torch.mean((y_pred_test - y_test) ** 2).item()
    mse_evolution_test.append(mseTest)

# Print the final MSE on the training dataset
print(f"Train loss: {mse_evolution[-1]} ")

# Print the final MSE on the testing dataset
print(f"Test loss: {mse_evolution_test[-1]} ")

# Plot the evolution of the MSE loss over epochs for training and testing
_ = plt.plot(mse_evolution, ".", label="train")
_ = plt.title("Mean squared error evolution")
_ = plt.xlabel("Epoch")
_ = plt.ylabel("MSE")
_ = plt.plot(mse_evolution_test, "-", label="test")
_ = plt.legend()

Train loss: 0.5908371658605887 
Test loss: 0.666048637746284

No description has been provided for this image

Also, the mean squared errors evaluated in the train and test sets are different. This difference entails a penalty in the performance of the neural network (NN) when it is deployed to make actual predictions. Indeed, the loss averaged of the training set converges to a value around 0.59 . This is the performance that we would expect the NN to attain when deployed. However, the deployment performance (as evaluated on the test set) is around 0.615. The difference between training loss and deployment loss (of which the test loss is a proxy) is because the probability distribution of the training data is not an exact match for the true distribution of the true system. This is the mismatch between empirical risk minimization and statistical risk minization which is the subject of study of statistical learning theory.

Task 3¶

We will incorporate the gender data that we have so far ignored. To that end encode gender as $\text{Female} = 1$ and $\text{Male} = -1$. Redefine the object in Task 1 so that it can now process this information. In this neural network we use a single intermediate layer with $20$ hidden neurons.

This is a task that you should be able to accomplish without looking at the solution shown below. This is because the solution is all but identical to the solution of Task 1. There is, in fact, a single line of code that has to be modified. Even more, within this single line of code there is a single character that has to change.

In [6]:

class TwoLayerNN(torch.nn.Module):
    def __init__(self, p1=20):
        super().__init__()
        # Modify the input dimension to 3
        self.A1 = nn.parameter.Parameter((torch.rand(3, p1)-0.5))
        self.A2 = nn.parameter.Parameter((torch.rand(p1, 1)-0.5))
    def forward(self, x0):
        # Compute output of first layer
        x1 = relu(torch.matmul(x0, self.A1))
        # Compute output of second layer
        x2 = relu(torch.matmul(x1,self.A2))
        return x2

As we discussed above, this is very similar to the solution of Task 1. The forward method is identical and the init method is almost the same. The only change is that the input dimension is 3. We therefore have to change the initialization of the matrix $A_1$ to maker it a $p_1 \times 3$ matrix instead of the $p_1 \times 2$ matrix used in Task 1. The adventurous among you can even makes this quantity into a parameter that is passed to the neural network at initialization.

Task 4¶

Divide the data of Lab 1A into a test set with $\tilde{N} = 100$ samples and use the remaining samples as a training set. Use it to train the neural network of Task 3. Use a step size of $0.01$, a batch size of $64$ and train for $200$ epochs. Evaluate the training error and the test error.

This is another task that you should be able to accomplish without looking at the solution shown below. This is because the solution of this task and the solution of Task 2 are essentially the same. In both tasks we are training a neural network and the training loop is hte same independently of the arcitecture that is being trained.

In [7]:

# Create a gender tensor that is < 0 if it is Male and > 0 if it is Female
gender =  data_norm[:,4] -  data_norm[:,5]
# Normalize
gender = (gender - torch.mean(gender)) / torch.std(gender)

# Select the normalized HS GPA, SAT scores, and add the gender as a column
x = torch.hstack((data_norm[:,1:3], gender.unsqueeze(1)))
# The targets are the normalized Penn GPA values
y = data_norm[:,3]

# Create train and test sets again
x_train = x[train_indices]
y_train = y[train_indices]
x_test = x[test_indices]
y_test = y[test_indices]

In [8]:

torch.manual_seed(3) #Fix the seed for reproducibility of this cell
# Specification of learning parametrization
estimator = TwoLayerNN(40)

# Define parameters used in training loop
epsilon=0.01
batch_size = 64

# Specify the optimizer that is used to perform descent updates
optimizer = optim.SGD(estimator.parameters(), lr=epsilon, momentum=0)

# Lists to record the evolution of the MSE during training and testing
mse_evolution = []
mse_evolution_test = []

# Training loop: iterates over the dataset n_epoch times
n_epochs = 200

# After num_train_examples / batch_size iterations, we complete an epoch
num_train_examples = x_train.shape[0]
epoch_iterations =  num_train_examples // batch_size

#So total iterations is n_epochs * epoch_iterations
for _ in range(n_epochs*epoch_iterations):
    # Sample with replacement
    batch_indices = torch.randint(0, num_train_examples, (batch_size,))
    x_batch = x_train[batch_indices]
    y_batch = y_train[batch_indices]
        
    # Set gradients to zero
    estimator.zero_grad() 
    
    # Compute predictions
    yHat = estimator.forward(x_batch).squeeze()
    
    # Compute MSE
    mse = torch.mean((yHat-y_batch)**2)

    # Compute gradients
    mse.backward()
    
    # Update model parameters
    optimizer.step()

    # Evaluate the model (on the whole) training dataset and record the MSE
    with torch.no_grad():
        y_pred_train = estimator(x_train).squeeze()
        epoch_mse = torch.mean((y_pred_train - y_train) ** 2).item()
    mse_evolution.append(epoch_mse)
    
    # Evaluate the model (on the whole) testing dataset and record the MSE
    with torch.no_grad():
        y_pred_test = estimator(x_test).squeeze()
        mseTest = torch.mean((y_pred_test - y_test) ** 2).item()
    mse_evolution_test.append(mseTest)

# Print the final MSE on the training dataset
print(f"Train loss: {mse_evolution[-1]} ")

# Print the final MSE on the testing dataset
print(f"Test loss: {mse_evolution_test[-1]} ")

# Plot the evolution of the MSE loss over epochs for training and testing
_ = plt.plot(mse_evolution, ".", label="train")
_ = plt.title("Mean squared error evolution")
_ = plt.xlabel("Epoch")
_ = plt.ylabel("MSE")
_ = plt.plot(mse_evolution_test, "-", label="test")
_ = plt.legend()

Train loss: 0.584802666916238 
Test loss: 0.6613269555077533

Task 5¶

Given that we have decided to use gender information in our decisions, it is fair that we are asked what effect this choice has. Evaluate the average predicted GPA for Female and Male applicants and observe that data can be perilous. Or not. Comment.

Please report the train error, test error and test errors for Male and Female applicants. Comment on the perils of data.

We should note that this task has little to do with the course. We have already learned how to code and train a neural network and we are ready to move on to the higher heights of convolutional neural networks. That said, when one climbs a mountain it is worth enjoying the view. It is interesting to compute these quantities and ponder how our algorithms may end up doing things that are incompatible with our principles.

In [9]:

# Calculate average predicted GPA for males
male_indexes = gender < 0
male_yHat = torch.mean(estimator.forward(x[male_indexes]).squeeze())

# Calculate average predicted GPA for females
female_indexes = gender > 0
female_yHat = torch.mean(estimator.forward(x[female_indexes]).squeeze())

print(f"Average Male GPA: {stds[2] * male_yHat + means[2]} ")
print(f"Average Female GPA: {stds[2] * female_yHat + means[2]}")

Average Male GPA: 3.7713411394401843 
Average Female GPA: 3.7696337341131434

In [10]:

def evaluate_gender_mse(x, y, estimator, gender):
    # Initialize variables to accumulate the MSE and count of male and female samples
    male_mse_sum = 0
    female_mse_sum = 0
    male_count = 0
    female_count = 0

    with torch.no_grad():
        # Determine which samples are male and which are female
        male_indices = gender < 0
        female_indices = gender > 0

        # Male MSE
        if any(male_indices):
            male_x = x[male_indices]
            male_y = y[male_indices]
            # Compute predictions
            male_y_pred = estimator.forward(male_x).squeeze()
            # Accumulate the sum of squared errors for male samples
            male_mse_sum = (male_y_pred - male_y).pow(2).sum().item()
            male_count = male_indices.sum().item()

        # Female MSE
        if any(female_indices):
            female_x = x[female_indices]
            female_y = y[female_indices]
            # Compute predictions
            female_y_pred = estimator.forward(female_x).squeeze()
            # Accumulate the sum of squared errors for female samples
            female_mse_sum = (female_y_pred - female_y).pow(2).sum().item()
            female_count = female_indices.sum().item()

    # Compute the MSE for male and female
    male_mse = male_mse_sum / male_count if male_count > 0 else None
    female_mse = female_mse_sum / female_count if female_count > 0 else None

    return male_mse, female_mse

male_mse, female_mse = evaluate_gender_mse(x, y, estimator, gender)

# Print the calculated MSE for male and female samples.
print(f"Test MSE for Males: {male_mse}")
print(f"Test MSE for Females: {female_mse}")

Test MSE for Males: 0.5606643664954948
Test MSE for Females: 0.6344490635341531

The test MSEs for male and female candidates are different. We make predicition that are more accurate for male applicants than female applicants. This means that in a sense we are giving male applicants a fairer hearing than female applicants. I think we can all agree that this is not something we want to do. In that sense, data is perilous. On the other hand, we didn’t do anything to foster this outcome. This is just the way the data turns out to be. In that sense, data is not perilous.

As engineers our job is to design systems that affect the lives of our fellow citizens. It is our responsibility to make sure that we do not affect them for the worse. Alas, what is worse is not always clear. This is why you should take a class on engineering et

Sampling Without Replacement¶

In Tasks 2 and 4 we trained our neural networks using the stochastic gradient descent (SGD) algorithm we introduced in Lab 1B. As a reminder, this algorithm utilizes the stochastic gradients,

$$ \hat{g}(w) ~=~ \frac{1}{B} \sum_{i=1}^{B} \frac{\partial}{\partial w} \ell \Big( y_i, \Phi(x_i; w) \Big)~, $$

which are averages of gradients associated with individual data pairs over a batch of $B\ll N$ samples. These stochastic gradients form the basis of the SGD recursion,

$$ w(k+1) = w(k) – \epsilon \hat{g}(w(k)). $$

In the stochastic gradient computation the samples in the batch are chosen independently at random at each SGD iteration. While this is justifiable in theory it tends to perform erratically in practice because there is a significant variability in the samples that are drawn in different runs. Indeed, there is a significant likelihood that some samples are never chosen in any batch and a significant likelihood that some samples are drawn in several batches.

We overcome this erratic behavior by dividing the training set into $N/B$ batches of size $B$ containing different samples. We call this choice of batches sampling without replacement. This nomenclature indicates that samples in different batches are different and that all samples are part of one batch.

This process is sketched in Figure 6. The training set in the illustration contains $N=30$ samples that we divide into $N/B = 5$ batches of $B=6$ samples each. If we need to run more than $N/B = 5$ stochastic gradient descent iterations we repeat the process of dividing the training set into batches. To avoid having batches with the exact same set of samples, the training set is reshuffled before it is divided into batches. Each of the new set of batches that are created after reshuffling is called an epoch.

Task 6¶

Reimplement Task 2 using sampling without replacement. This is a task that is easier to implement using functionalities built in Pytorch. We recommend that you look at the posted solution.

Compare the test loss of this trained neural network with the test loss of the neural network trained in Task task_1C2. They should have similar test losses in this case. This is because the dataset has limited complexity but it is not true in general. In upcoming labs we will use sampling without replacement unless we say otherwise. $\blacksquare$

In [11]:

torch.manual_seed(3)
# Specification of learning parametrization
estimator = TwoLayerNN(40)

# Define parameters used in training loop
epsilon=0.01
batch_size = 64

# Specify the optimizer that is used to perform descent updates
optimizer = optim.SGD(estimator.parameters(), lr=epsilon, momentum=0)

## Instantiate Datasets and Dataloaders

# TensorDataset combines tensors into a dataset
# It's useful for creating a dataset from two tensors (x_train, y_train)
train_dataset = TensorDataset(x_train, y_train)
test_dataset = TensorDataset(x_test, y_test)

# DataLoader wraps an iterable around the Dataset to enable easy access to samples
# It provides features like automatic batching, shuffling, and multiprocessing for data loading
# batch_size determines how many samples per batch to load
# shuffle=True shuffles the data at every epoch
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Lists to record the evolution of the MSE during training and testing
mse_evolution = []
mse_evolution_test = []

# Training loop: iterates over the dataset n_epoch times
n_epochs = 200
for _ in range(n_epochs):
    # Iterate over all batches in the dataset
    for x_batch, y_batch in train_loader:
        
        # Set gradients to zero
        estimator.zero_grad() 
        
        # Compute predictions
        yHat = estimator.forward(x_batch).squeeze()
        
        # Compute MSE
        mse = torch.mean((yHat-y_batch)**2)

        # Compute gradients
        mse.backward()
        
        # Update model parameters
        optimizer.step()

    # Evaluate the model on the training dataset and record the MSE
    train_mse = 0
    with torch.no_grad():
        for x_batch, y_batch in train_loader:
            yHat = estimator.forward(x_batch).squeeze()
            train_mse += torch.sum((yHat-y_batch)**2)
    train_mse /= len(train_loader.dataset)
    mse_evolution.append(train_mse.item())
    
    # Evaluate the model on the testing dataset and record the MSE
    test_mse = 0
    with torch.no_grad():
        for x_batch, y_batch in test_loader:
            yHat = estimator.forward(x_batch).squeeze()
            test_mse += torch.sum((yHat-y_batch)**2)
    test_mse /= len(test_loader.dataset)
    mse_evolution_test.append(test_mse.item())

# Print the final MSE on the training dataset
print(f"Train loss: {mse_evolution[-1]} ")

# Print the final MSE on the testing dataset
print(f"Test loss: {mse_evolution_test[-1]} ")

# Plot the evolution of the MSE loss over epochs for training and testing
_ = plt.plot(mse_evolution, ".", label="train")
_ = plt.title("Mean squared error evolution")
_ = plt.xlabel("Epoch")
_ = plt.ylabel("MSE")
_ = plt.plot(mse_evolution_test, "-", label="test")
_ = plt.legend()

Train loss: 0.5843111034719861 
Test loss: 0.6594311505172723