Lab 1: Grade Point Average Prediction¶

Pytorch tutorial¶

We will use as an example the ERM problem from lab 1.b:

Consider a generic collection of $N$ data pairs $(x_i, y_i)$ along with predictions $\hat{y} = \Phi(x;w)$. In the function $\Phi(x;w)$ the variable $x$ is an input and $w$ is a parameter to be determined. We say that $\Phi(x;w)$ is a learning parameterization. The goal is to compare outputs $y_i$ with predictions $\hat{y}_i = \Phi(x_i;w)$ so as to find a suitable value of $w$. With this parameter value on hand we can then make predictions $\hat{y} = \Phi(x;w)$ for input variables $x$ for which the output $y$ has not been observed yet.

In order to find suitable values for $w$ we introduce a loss function $\ell(y,\hat{y})\geq0$ which we use to evaluate the cost of predicting $\hat{y}$ when the true value realized by the world is $y$. Given that we have $N$ data points $x_i$ for which the true output of the system is known to be $y_i$, we define the empirical risk associated with parameter $w$ as

\begin{equation}\label{eqn_er} r(w) ~=~ \frac{1}{N} \sum_{i=1}^{N} \ell \big( y_i, \hat{y}_i \big) ~=~ \frac{1}{N} \sum_{i=1}^{N} \ell \Big( y_i, \Phi(x_i; w) \Big) ~. \end{equation}

The empirical risk $r(w)$ measures the predictive power of coefficient $w$. The quantity $\ell ( y_i, \hat{y}_i )$ is always nonnegative and indicates how close the predicted output $\hat{y}_i$ is to the true output $y_i$. The empirical risk averages this metric over all available datapoints. It follows that a natural choice for $w$ is the value that makes the empirical risk smallest. We therefore define the optimal coefficient as the one that solves the following empirical risk minimization (ERM) problem \begin{equation}\label{eqn_erm} w^* ~=~ \text{argmin}_w r(w) ~=~ \text{argmin}_w \frac{1}{N} \sum_{i=1}^{N} \ell \Big( y_i, \Phi(x_i; w) \Big) ~. \end{equation}

0. Environment setup¶

We will first import the necessary Python Packages and load the data from the lab to use as an example.

import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
torch.set_default_dtype(torch.float64)
plt.style.use('default')

data = torch.from_numpy( np.genfromtxt ('data.csv', delimiter=",", skip_header=1, dtype = float ) )
# Compute the mean and standard deviation of each variable.
means = torch.mean(data[:,:3], dim = 0)
stds = torch.std(data[:,:3], dim = 0)
# Substract the mean and divide by the std each variable.
data_norm = data.clone()
data_norm[:, :3] = (data[:, :3] - means) / stds 
# We use both the normalized HS GPA and the SAT scores
x = data_norm[:,:2] 
# The targets are the normalized Penn GPA values
y = data_norm[:,2]

Classes and Objects¶

Using Pytorch is easy but it can look complicated because it requires that you either learn or remember that Python is an object oriented language. You have to create objects that instantiate classes where you specify the operations that are to be performed. This results in code that can look weird and complicated but that is easier to modify. And while it may look complicated, it is not, in reality complicated.

import torch

Classes: Attributes and Methods¶

The first concept to understand is the difference between a class and an object. A class in an abstract entity that contains atrtibutes and methods and an object is a specific instance of a class. This is easiest explained with an example. Suppose that we are interested in linear functions of the form $\mathbf{y}=w\mathbf{x}$ with $w$ being a matrix with $m$ rows and $n$ columns. We therefore create a class LinearFunction defined as follows

class LinearFunction:
    def __init__(self, w):
        self.w = w
    def evaluate(self, x):
        y = x@self.w
        return y

The class definition contains two methods. The method init plays a special role in the creation of objects which we will explain soon. At this point, observe how it specifies the attributes that are part of the class. In this specific example, the class contains one atriburte, the matrix $w$. When you define a class, the __init__ function has to be specified always and self has to always be the first parameter of the method.

evaluate is a function, which in object oriented programming we call a method. This method takes a variable as an input and returns the matrix product as an output. Observe that the matrix is not an input to this function. The matrix is an attribute that belongs to the class. Further notice that self is the first parameter of the method. Any method that is defined in a class has to take as the first parameter.

Objects: Concrete Instances of Abstract Classes¶

The class is an abstract object with methods that specifies how to manipulate its attributes. If we want to actually process data, we create a specific instance. This is an object. For example, if we want a linear transformation specified by a specific matrix $w$, we create the object as an instance of the class.

# We use a random w as an example
w = torch.rand(2)
# create object
random_linear_map = LinearFunction(w)

When we create the object we implicitly call the init method. In doing so we instantiate the attributes that belong to the object.

The object is a variable that stores this attributes. Thus, we can access and modify them:

# Use a random w
current_weight = random_linear_map.w
print(f"The linear map weight matrix is {current_weight}")
# Set New random weights 
random_linear_map.w = torch.rand(2)
print(f"The linear map weight matrix is {random_linear_map.w}")

The linear map weight matrix is tensor([0.3503, 0.0703])
The linear map weight matrix is tensor([0.5778, 0.5944])

We can now evaluate the output of our linear map for an input $x$ by using the evaluate method that we defined in the class.

linear_pred = random_linear_map.evaluate(x[0])
print(f"Predicted output: {linear_pred}")

Predicted output: -0.18316594087472796

If this looks like a lot of trouble for a matrix product it is because it is a lot of trouble; indeed. However, suppose that you now find a more efficient algorithm for implementing matrix computations. Perhaps because you have decided to take advantage of a GPU. You go into the definition of the class and update the method. The change is now implemented in the hundreds of places in your code where you had used matrix multiplication.

Inheritance¶

A third concept of object oriented programming we have to introduce is inheritance. This is the possibility of defining a “child” subclass that inherits methods from a “parent” class. As an example, suppose that you intend to create several random Linear maps. Instead of generating the matrix and passing it as an argument in the creation of the object, it is more convenient to encapsulate the generation of the Bernoulli matrix inside of an object. To do that, create a class which we define as a child of the class

class RandomLinearFunction(LinearFunction):
    def __init__(self):
        super(RandomLinearFunction, self).__init__(torch.rand(2))

Observe that the specification of as a child of is done by making the latter an argument in the statement. The use of inheritance allows us to reuse our hard work in the creation of the class.

super(RandomLinearFunction, self)._ _init_ _ calls the init function of the parent class, passing a random weight matrix as an argument.

We do not need to specify the evaluate function for the because we are reusing from the parent class . We are inheriting, to use the more technical term. If at some point in the future we update the method in the class, that updated method is automatically inherited by the child class.

With this new class, the creation of random Linear maps simplifies to the code:

random_linear_map = RandomLinearFunction()

The code for the evaluation of the linear functions is still the same because it has been inherited.

linear_pred = random_linear_map.evaluate(x[0])
print(f"Predicted output: {linear_pred}")

Predicted output: -0.2156904216848885

The most important advantage of defining a new class is that modifications to the LinearFunction class will now propagate to all the places where a RandomLinearFunction is defined.

A Simple Training Loop¶

The reason why training with Pytorch may look complicated is that part of the operations are encapsulated in an object that inherits methods from a parent class. Having developed an understanding of the encapsulation of operations inside of objects, it is now easy to understand how to write a training loop in Pytorch.

In this section we focus on the problem in [eqn_ERM_linear]. In which the loss associated to individual observations is the mean squared cost $\ell(y, \hat{y}) = (y – \hat{y})^2$ and the learning parametrization is the linear function $\hat{y} = w\mathbf{x}$.

The Parametrization Class¶

Our first task is to specify the learning parametrization that we will use. We do that by creating a class – which we will instantiate later – that we will call . This class must have an method, as all classes do, and a method called . Most importantly, the class must inherit from the class that is part of the library. This is what will allow its use in a training loop. To describe this in more detail, here is a minimal code that define a class for estimates $\hat{y} = w\mathbf{x}$,

class Parametrization(torch.nn.Module):
    def __init__(self, w_init):
        super().__init__()
        self.w = torch.nn.parameter.Parameter(w_init)
    def forward(self, x):
        return x@self.w

The definition of the class specifies that nn.Module is a parent of the class. This allows Parametrization to inherit methods from nn.Module. Most notable among these inherited methods are those related to the computation of gradients, which we will use in the training loop below.

Aside from that, we specify the __init_ method and the forward method.

The first line of the 1__init__ method calls the init function of the parent class (nn.module) and the second line of the method specifies that the variable self.w is a parameter. This means exactly what you think it means. It is indicating that self.w is a variable that we will train. A fact that has to be specified for gradients to be computed correctly. The Parametrization class could include other parameters that are not trained. The specification of further states that this variable is a with rows and columns. This is just a specification of a matrix.

The forward method is where the parametrization is specified. It says that when given an input x, the class is to produce estimates according to x@self.w. This is the line of the code that we have to change if we want to use a different parametrization. To illustrate ideas, suppose we want to change the parametrization to the perceptron $y=\text{ReLU}(w\mathbf{x})$, with $\text{ReLU}(y)=\max(y,0)$ representing a rectified linear unit (ReLU). We can do this by simply redefining the forward function as follows,

from torch.nn.functional import relu
class Perceptron(Parametrization):
    def forward(self, x):
        return relu(x@self.w)

This is a good moment to appreciate the advantage of using objects. The parametrization is encapsulated inside of the class. Once we write a training loop, this training loop can be used for any parametrization. We can experiment with different versions of the forward method without having to meddle with the training loop.

The Training Loop¶

The training loop is going to contain the instructions you expect it to contain. We read the dataset, we compute gradients and we update the parameters. The computation of the gradients is going to have a form that may look a little strange and that is the part we will explain here.

The following code trains the linear model that we encapsulated in the class defined in the previous Section,

import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
# Parameters
epsilon=0.03
w_init = torch.rand(2)-0.5
# instantiate estimator
estimator = Parametrization(w_init)
# instantiate optimizer
optimizer = optim.SGD(estimator.parameters(), lr=epsilon, momentum=0)
# Data Loader
dataset = TensorDataset(x, y)
train_loader = DataLoader(dataset, batch_size=64)
#iterate over all batches in the dataset
for x_batch, y_batch in train_loader:
    
    # set gradients to zero
    estimator.zero_grad() 
    
    # Compute predictions
    yHat = estimator.forward(x_batch)
                             
    # Compute error
    mse = torch.mean((yHat-y_batch)**2)
                             
    # Compute gradients
    mse.backward()
                             
    # Update parameters
    optimizer.step()

The first line after the import command and parameter definitions in the code above instantiates estimator as an object belonging to class Perceptron. This object is essentially a matrix. It is not really a matrix. It is an object of class Parametrization, which inherits from class nn.module. This endows it with methods which allow the computation of gradients. But this is transparent to us. All that matters is that estimator is a matrix that we are learning. If we want to access the actual matrix we have to call estimator.w.

The second line specifies the optimization method, which we choose to be SGD. In the specification of the optimization method, the important argument is the passing of estimator.parameters(). This is telling the optimizer which are the variables that have to be updated, or trained. If we recall the definition of the Parametrization class, we see that it contains a command in the __init__ method that specifies the trainable parameters. That command says that the trainable parameter is the attribute w. When we pass estimator.parameters() to the optimizer, we are therefore telling the optimizer that it has to update the variable w.

Then we instantiate the torch Dataloader class, which we will use to iterate over the data batches.

Note that we write our training as two nested for loops. The outermost loop iterates n_epoch times over the whole dataset. The loop inner loop iterates over the whole dataset. In each iteration we access a batch and perform two actions: (i) we compute stochastic gradients. (ii) We perform a SGD step by calling optimizer.step().

The command estimator.zero\_grad() acts in tandem with the command loss.backward(). Their combined effect is to compute the gradient of all the operations that are contained within. In this particular case, the instruction yHat = estimator.forward(x) calls the forward method of the estimator object — which we defined in the Parametrization class — and applies it to the input x we read from the batch.

The instruction mse = torch.mean((yHat-y)**2) computes the mean squared loss.

The mechanics of how gradients are computed are fascinating and worth learning. But you don’t need to know them to run training loops. Beginners don’t even need to modify the training loop. They just modify the Parametrization class. That suffices to train a different system. The explanations here are enough to make us Intermediate users.

These are the facts we have learned:

We have to instantiate a Parametrization class that inherits from nn.Module. This creates an estimator object.
We instantiate an optimizer from the optim.SGD class. This optimizer needs access to the trainable parameters of \p{estimator}, which we pass by invoking the parameters() method of the Parametrization class.
The zero.grad() method of the estimator object in conjunction with the backward method of the loss object implement the computation of gradients. When we call loss.backward() we initiate a backward chain of gradient computations that stops at the most recent call of estimator.zero_grad().
The loss.backward() call computes gradients with respect to all the objects that are involved. These gradients are stored in the proper object. Among these gradients, we calculate the gradient with respect to estimator.parameters().
This gradient is accessed by the optimizer object to update the values of estimator.parameters().

We will revisit these learned facts by discussing the training of a Neural Network in the next section.

Training a Neural Network¶

Suppose that we are interested in another type of parameterization. To be concrete, suppose you want to use a fully connected neural networks(NN) with two layers. In this case the learning parametrization is replaced by the following composition of linear transformations and pointwise nonlinearities,

\begin{equation}\label{eqn_nn} y \ =\ H_2 z \ =\ H_2 \Big (\, \sigma \big(\, H_1 x\,\big) \,\Big) , \end{equation}

where the matrix $ H_1$ is $h \times n$ and the matrix $ H_2$ is $m \times h$. The scalar constant $h$ is the number of hidden units.

If we keep using the squared Euclidean error loss $\ell( y, y) = \| y – y\|^2$, the ERM problem we want to solve is obtained by substituting the linear parametrization $ y= H x$ by the NN parametrization. This yields the ERM problem

\begin{align}\label{eqn_ERM_nn} H^* \ =\ argmin_{ H\in R^{m\times n}} \frac{1}{Q} \sum_{q=1}^Q \, \frac{1}{2} \, \left \| \, y_q- H_2\Big(\sigma\big( H_1 x_q\big)\Big) \,\right \|^2_2 . \end{align}

To implement SGD we need to compute gradients of the summands with respect to $ H_1$ and $ H_2$. This is painful. Lucky for us, we have access to automatic differentiation in Pytorch.

To use pytorch to train the neural network, the training loop doesn’t have to change. All we have to do is replace the $Parametrization$ class by the class $TwoLayerNN$ that implements the NN parametrization. This class has an $\_\_init\_\_$ method and a $forward$ method and is defined as follows,

class TwoLayerNN(nn.Module):
    def __init__(self, n, m, h):
        super().__init__()
        self.H1 = nn.parameter.Parameter(torch.rand(n, h))
        self.H2 = nn.parameter.Parameter(torch.rand(h, m))
    def forward(self, x):
        sigma = nn.ReLU()
        z = sigma(torch.matmul(x,self.H1))
        yHat = torch.matmul(z,self.H2)
        return yHat

The $\_\_init\_\_$ method initializes two different parameters H1 and H2. This is because we have two parameters defining the NN. We also initialize them at random. Just to illustrate a different way of initializing parameters. The forward method is a straightforward implementation of the NN parametrization. We compute the intermediate output as $z = sigma(torch.matmul(x,self.H1))$ and we follow with the NN output computation $yHat = torch.matmul(z,self.H2).$

We have said that the training loop does not change. This is true except that when we instantiate the estimator object we need to instantiate it as member of the $TwoLayerNN$ class. For completeness, we rewrite the training loop here with that modification highlighted,

# Parameters
epsilon=0.05
# instantiate estimator
estimator = TwoLayerNN(2, 1, 40)
# instantiate optimizer
optimizer = optim.SGD(estimator.parameters(), lr=epsilon, momentum=0)
# Data Loader
dataset = TensorDataset(x, y)
train_loader = DataLoader(dataset, batch_size=64)
#iterate over all batches in the dataset
for x_batch, y_batch in train_loader:
    # set gradients to zero
    estimator.zero_grad()
    # Compute predictions
    yHat = estimator.forward(x_batch).squeeze()
    # Compute error
    mse = torch.mean((yHat-y_batch)**2)
    # Compute gradients
    mse.backward()
    # Update parameters
    optimizer.step()
# Evaluate final train Loss
mseTrain = 0
with torch.no_grad():
    for x_batch, y_batch in train_loader:
        yHatTrain = estimator.forward(x_batch).squeeze()
        mseTrain += torch.sum((yHatTrain-y_batch)**2)
mseTrain /= len(dataset)
# Print train results
print(f"Final Train loss: {mseTrain.item()} ")

Final Train loss: 1.8330609422879793

There are two difference between this training loop and the training loop for linear parametrizations. The first difference is the use of a different class for the estimator object. In this loop, the combined calls to estimator.zero_grad() and loss.backward() result on computations of gradients with respect to the NN parameters H1 and H2. The call of the step optimizer.step() results in a stochastic gradient update of these parameters. These changes are implemented by the expedient action of replacing the definition of the estimator object. If we want to train a graph neural network, we just need to define a proper class and instantiate a proper object. The training loop remains unchanged.

Secondly, we train the model in epochs, which are full passes over the dataset. In each epoch, the samples are permuted and partitioned in fixed-size batches covering the entire dataset in order to use every training sample an equal number of times. Training in epochs is helpful because epochs are more interpretable than training steps, it makes more sense to specify the number of full passes over the data than the total number of steps. Given a certain number of epochs and the size of a batch, the number of training steps is calculated as the number of epochs multiplied by the number of batches necessary to cover the training set.

A More Comprehensive Learning Loop¶

In the last section we described a basic training loop. In this section, we introduce and discuss techniques that can be used to improve it in order to learn better representations.

1. Epochs and Batches¶

In the basic training loop, the samples of a batch are selected at random from the training set in each training step. If the number of training steps is large enough, this is not an issue as it is highly likely that all training samples have been included in a batch — and therefore used to train the model — at least once. However, the randomness of this approach might make it so that some samples are selected multiple times before the dataset is considered in full. This affects the gradient descent path and, if the number of training steps is not chosen judiciously, it can have a negative effect on the resulting model.

To address this, we can train the model in epochs, which are full passes over the dataset. In each epoch, the samples are permuted and partitioned in fixed-size batches covering the entire dataset in order to use every training sample an equal number of times. Training in epochs is helpful because epochs are more interpretable than training steps — it makes more sense to specify the number of full passes over the data than the total number of steps. Given a certain number of epochs and the size of a batch, the number of training steps is calculated as the number of epochs multiplied by the number of batches necessary to cover the training set.

# Specification of learning parametrization.
estimator = TwoLayerNN(2, 1, 40)
# Parameters used in training loop
epsilon=0.05
batch_size = 64
# Specify the optimizer that is used to perform descent updates
optimizer = optim.SGD(estimator.parameters(), lr=epsilon, momentum=0)
# Data
train_dataset = TensorDataset(x, y)
# Instantiate Data Loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size)
mse_evolution = []
# Training loop
n_epochs = 200
# Iterate n_epochs times over the whole dataset.
for _ in range(n_epochs):
    epoch_mse = 0
    #iterate over all batches in the dataset
    for x_batch, y_batch in train_loader:
        
        # set gradients to zero
        estimator.zero_grad() 
        
        # Compute predictions
        yHat = estimator.forward(x_batch).squeeze()
        
        # Compute error
        mse = torch.mean((yHat-y_batch)**2)
        epoch_mse += mse.item()
        # Compute gradients
        mse.backward()
        
        # Update parameters
        optimizer.step()
        
    mse_evolution.append(epoch_mse)
# Evaluate final train Loss
mseTrain = 0
with torch.no_grad():
    for x_batch, y_batch in train_loader:
        yHatTrain = estimator.forward(x_batch).squeeze()
        mseTrain += torch.sum((yHatTrain-y_batch)**2)
mseTrain /= len(dataset)
# Print train results
print(f"Train loss: {mseTrain.item()} ")

Train loss: 0.8182163128064673

_ = plt.plot(mse_evolution[:20], "-")
_ = plt.title("Mean squared error evolution")
_ = plt.xlabel("Epoch")
_ = plt.ylabel("MSE")

2. Testing¶

The objective of learning is to obtain a model that generalizes well to unseen input data. The ability of a model to generalize is measured by the generalization error, which is the expectation of the error realized by the model on an unseen input. In order to approximate it, we need to observe the error realized on data drawn from the same distribution as the training data, but which is not used for training. This is the test set.

Unlike the validation set used to tune the hyperparameters and keep track of the best model, the samples in the test set are only accessed once the training loop is over. The learned model is run on these samples to compute the test error, which provides a measure of how well the model generalizes to unseen data. In particular, for a good model the gap between the training and the test error should be small. A large gap usually indicates that the model has overfitted the training data.

Splitting the Dataset¶

In real-world scenarios we are usually given a chunk of data consisting of all the available samples, which then have to be split between the training and test sets. In most train-test splits, the largest portion of the data (80-90%) is used for training and the rest for testing. Before splitting the data between the training and test sets, the samples are randomized. This is an important step because in real-world scenarios we don’t usually know whether the available samples are random or ordered in some way. In practice, randomizing the samples is also necessary to assess the quality of the model parametrization independently of the quality of a particular train-test split. This is done by running several experiments, where estimators are trained on multiple train-test splits to compute the average test error realized by models with a given parametrization.

# Specification of learning parametrization.
estimator = TwoLayerNN(2, 1, 40)
# Parameters used in training loop
epsilon=0.05
batch_size = 64
# Specify the optimizer that is used to perform descent updates
optimizer = optim.SGD(estimator.parameters(), lr=epsilon, momentum=0)
# Creating data indices for training and validation splits:
dataset_size = len(x)
indices = list(range(dataset_size))
np.random.shuffle(indices)
numTest = int(np.floor(0.2 * dataset_size))
numTrain = dataset_size - numTest
print(f"Num Train = {numTrain}, numTest = {numTest}")
train_indices, test_indices = indices[numTest:], indices[:numTest]
# Data
train_dataset = TensorDataset(x[train_indices], y[train_indices])
test_dataset = TensorDataset(x[test_indices], y[test_indices])
# Instantiate Data Loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
mse_evolution = []
# Training loop
n_epochs = 200
# Iterate n_epochs times over the whole dataset.
for _ in range(n_epochs):
    epoch_mse = 0
    #iterate over all batches in the dataset
    for x_batch, y_batch in train_loader:
        
        # set gradients to zero
        estimator.zero_grad() 
        
        # Compute predictions
        yHat = estimator.forward(x_batch).squeeze()
        
        # Compute error
        mse = torch.mean((yHat-y_batch)**2)
        epoch_mse += mse.item()
        # Compute gradients
        mse.backward()
        
        # Update parameters
        optimizer.step()
        
    mse_evolution.append(epoch_mse)
# Evaluate final train Loss
mseTrain = 0
with torch.no_grad():
    for x_batch, y_batch in train_loader:
        yHatTrain = estimator.forward(x_batch).squeeze()
        mseTrain += torch.sum((yHatTrain-y_batch)**2)
mseTrain /= numTrain
# Print train results
print(f"Train loss: {mseTrain.item()} ")
mseTest = 0
with torch.no_grad():
    for x_batch, y_batch in test_loader:
        yHatTest = estimator.forward(x_batch).squeeze()
        mseTest += torch.sum((yHatTest-y_batch)**2)
mseTest /= numTest 
# Print test results
print(f"Test loss: {mseTest.item()} ")

Num Train = 480, numTest = 120
Train loss: 0.7968306206843984 
Test loss: 0.9188826164916073

_ = plt.plot(mse_evolution[:20], "-")
_ = plt.title("Mean squared error evolution")
_ = plt.xlabel("Epoch")
_ = plt.ylabel("MSE")