Download the Jupyter Notebook for this lab in the following link.

Find the assignment pdf in the following link.

Instructions on how to download and use Jupyter Notebooks can be found here. You can find a static version of the notebook below.

lab_4b_cnns_mnist

Lab 4: MNIST digit classification with CNNs¶

0. Environment setup¶

In [ ]:

# Use this cell for Google Colab integration with Google Drive
# from google.colab import drive
# import os

# # This will prompt for authorization.
# drive.mount('/content/drive')

# # If you want to work within a specific folder, specify the path
# folder_path = '/content/drive/My Drive/ese2000 ailab/Lab4B'

# # You can then change the directory to this folder to perform operations within it
# os.chdir(folder_path)

In [ ]:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt
import numpy as np
import time

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device("cpu")
#device = torch.device("mps")
print(f"Using device: {device}")

Using device: cuda:0

1. Convolutional Neural Networks in Space¶

To process images we use spatial Convolutional Neural Networks (CNNs). Spatial CNNs are defined as compositions of layers, which are themselves compositions of spatial convolutional filters with pointwise nonlinearities. Thus, for a network with $L$ layers we consider $L$ different filters. The filter at Layer $\ell$ has coefficients $h_{\ell, uv}$ that are entries of the matrix $\mathbf{H}_\ell$. We then define the input output relationship of the CNN through the recursion $$ \def\ccalH{{{\mathcal H}}} $$ \begin{equation} \tag{1} \mathbf{X}_\ell = \sigma \Big(\, \mathbf{Z}_\ell \,\Big) = \sigma \left(\, \sum_{~u={-K_\ell}~}^{K_\ell} \sum_{~v={-K_\ell}~}^{K_\ell} h_{\ell, uv} \, S_V^u \, S_H^v \, \mathbf{X}_{\ell-1} \, \right) , \end{equation}

which we initialize with $\mathbf{X}_0=\mathbf{X}$ and terminate at Layer $L$. We further define $\ccalH$ as a tensor that groups all of the filters of all of the layers and write the output of the CNN as,

\begin{equation} \tag{2} \Phi(\mathbf{X}; \ccalH) = \mathbf{X}_L. \end{equation}

This is the same architecture that we used to define CNNs for time signals, which is also the same architecture we used to define GNNs. The differences between these three different types of CNNs is the use of convolutions that are adapted to the specific structure of the input. We used one dimensional convolutions to process signals in time, graph convolutions to process graph signals, and we are now using two dimensional convolutions to process images.

1.1 Layers with Multiple Features¶

$$ \def\bbX{{{\mathbf X}}} \def\bbH{{{\mathbf H}}} \def\bbZ{{{\mathbf Z}}} \def\Sv{\mathcal{S}_{\mathrm{V}}} \def\Sh{\mathcal{S}_{\mathrm{H}}} $$ As we did for one dimensional CNNS and GNNs, we increase the representation power of two dimensional CNNs with the addition of multiple features per layer. To explain this, we introduce the notation $\bbH \star \bbX$ to denote the spatial convolution of filter $\bbH$ with image $\bbX$,

\begin{equation} \tag{3} \bbZ = \bbH \star \bbX = \sum_{~k={-K}~}^{K} \sum_{~l={-K}~}^{K} h_{kl} \, \Sv^k \, \Sh^l \, \bbX . \end{equation}

With this notation the CNN recursion in (2) can be written as

\begin{equation}\tag{4} \bbX_\ell = \sigma \Big(\, \bbZ_\ell \,\Big) = \sigma \Big(\, \bbH_\ell \star \bbX_{\ell-1} \, \Big) . \end{equation}

In a CNN with multiple features, each layer processes multiple images in parallel to produce a number of images at the output. To write this formally we let each layer produce as an output a collection of $F_{\ell}$ features $\bbX_\ell^g$. Each of these features is an image. These features are produced by processing the $F_{\ell-1}$ features $\bbX_{\ell-1}^f$ that are output by Layer $\ell-1$.

The mapping from features $\bbX_{\ell-1}^f$ into features $\bbX_{\ell}^g$ is determined by a collection of convolutional filters $\bbH^{fg}$. The specific relationship is

\begin{equation}\tag{5} \bbX_\ell^g = \sigma\left(\sum_{f=1}^{F} \bbH^{fg} \star \bbX^f\right) \, . \end{equation}

The expression in (5) is such that all input features $\bbX^f$ can affect all output features $\bbX^g$. The influence of input $\bbX^f$ on output $g$ is the convolution $\bbH^{fg} \star \bbX^f$. All of the convolutions for a fixed output feature index $g$ are summed and then passed through a pointwise nonlinearity.

1.2 Spatial Convolutional Neural Network Specification¶

To specify a CNN we need to specify the number of layers $L$ and the characteristics of the filters that are used at each layer. The latter are the number of filter taps $K_\ell$ and the number of features $F_\ell$ at the output of the layer. The number of features $F_0$ must match the number of features at the input and the number of features $F_L$ must match the number of features at the output. Observe that the number of features at the output of Layer $(\ell-1)$ determines the number of features at the input of Layer $\ell$.

Task 1:¶

Program a class that implements a spatial CNN with $L$ layers. This class receives as initialization parameters a CNN specification consisting of the number of layers $L$ and vectors $[K_1, \ldots, K_L]$ and $[F_0, F_1, \ldots, F_L]$ containing the number of taps and the number of features of each layer.

Endow the class with a method that takes an input feature $\mathbf{X}$ and produces the corresponding output feature $\mathbf{\Phi}(\mathbf{X}; \mathcal{H})$

Loading the data¶

In [ ]:

# Image normalization
# Converts the image into a tensor and normalizes to achieve a mean of 0 and variance of 1
transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])

# Download MNIST train and test set and normalize images
train_set = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_set = datasets.MNIST('./data', train=False, download=True, transform=transform)

Visualize images¶

In [ ]:

# Creating a matplotlib grid to plot three images.
fig, axes = plt.subplots(1,3, figsize=(10, 10))

# axes[i].imshow takes a matrix and plots its pixel values as an image. The cmap argument specifies to plot
# the image in black and white.
# .squeeze() removes the extra dimension that is added by default when the image is loaded

# Image 1
img_index1 = 20
image1, label1 = train_set[img_index1]
axes[0].imshow(image1.squeeze(), cmap='binary_r')
axes[0].set_title(f"Label: {label1}")
axes[0].axis('off')
print(f"Image 1 size: {image1.squeeze().shape}")

# Image 2
img_index2 = 100
image2, label2 = train_set[img_index2]
axes[1].imshow(image2.squeeze(), cmap='binary_r')
axes[1].set_title(f"Label: {label2}")
axes[1].axis('off')
print(f"Image 2 size: {image2.squeeze().shape}")

# Image 3
img_index3 = 1000
image3, label3 = train_set[img_index3]
axes[2].imshow(image3.squeeze(), cmap='binary_r')
axes[2].set_title(f"Label: {label3}")
axes[2].axis('off')
print(f"Image 3 size: {image3.squeeze().shape}")

plt.tight_layout()
plt.show()

Image 1 size: torch.Size([28, 28])
Image 2 size: torch.Size([28, 28])
Image 3 size: torch.Size([28, 28])

No description has been provided for this image

Solving task¶

In [ ]:

# We use the same code as in the previous lab (Lab 4B)
class ConvolutionalFilter2D(nn.Module):
    """
    A very simple implementation of a convolutional filter. Given an image, it returns the result of convolving the image with the filter.
    It is very simplified and processes each filter sequentially, to make the operation easier to understand.
    It also assumes F=1 because MNIST has only one input channel.

    A more sophisticated version of this filter would process all the filters and images in parallel, to take advantage of the batch processing.
    Args:
        num_filters (Int): The number of filters in the bank.
        K (Int): Defines the size of each filter in the bank. The true size is (2K+1)^2.
                    Recall that we sweep from -K to K. So if K is 1, each filter in our bank is a 3x3 grid.
                    The filter is assumed to be square.
    """

    def __init__(self,  num_filters, K):
        super(ConvolutionalFilter2D, self).__init__()
        # L is the number of pixels in each dimension (For the example above L = (1) * 2 + 1 = 3 and each filter is LxL = 3x3)
        L = K*2+1

        # The number of filters in the bank determines the number of output channels AKA output features
        self.num_filters = num_filters
        self.filters = torch.zeros((num_filters, L, L))

    def forward(self, X):
        """
        Performs a 2D convolution on the given image with the given
        Args:
            X (torch.Tensor): input image of shape (B, F=1, M, N)
            filters (torch.Tensor): filters of shape (G, 2K+1, 2K+1).
        Returns:
            torch.Tensor: Convolution of the image with shape (G, M, N).
        """

        # We are assuming square filters. The true size is filter_size^2
        G, num_coefficients, _ = self.filters.shape
        B, M, N = X.shape # Here F=1

        # By definition num_coefficients = 2*K+1
        K = (num_coefficients-1)//2

        # First we will create a padded version of our image, to follow the convention that SX_ij = 0 for i,j out of bounds
        # Create the padded dimensions. There are 2K-1 elements to be added to each dimension of the image.
        V_pad = M + 2*K
        H_pad = N + 2*K

        # Initialize padded image with the new dimensions
        X_padded = torch.zeros((B, G, V_pad, H_pad))
        # Place the original images in the center of the X_padded tensor.
        X_padded[:, :, K:(V_pad-K), K:(H_pad-K)] = X.unsqueeze(1)

        # Initialize the convolution output tensor
        Y = torch.zeros((B, G, M, N))

        # Iterate over the filters
        for g in range(G):
            # Get the current filter H^(g)
            H = self.filters[g]
            # Iterate over the filter taps (these are the parameters we are learning)
            for u in range(2*K+1):
                for v in range(2*K+1):
                    # Shift the image vertically and horizontally
                    # This operation is equivalent to Z_uv = S_V^u * S_H^v * X.
                    # u:u + M shifts in the vertical direction
                    # v:v + N shifts in the horizontal direction
                    Z_uv = X_padded[:, g, u:u + M, v:v + N]

                    # (ALTERNATIVE IMPLEMENTATION)
                    # Another way to implement the line above is using torch.roll: roll u to the top and v to the left
                    # Roll-based version: roll u to the top and v to the left, then take the first MxN elements (to match the dimensions of Y)
                    #Z_uv = torch.roll(torch.roll(X_padded[g], -u, dims=0), -v, dims=1)[:M, :N]

                    # Multiply the kernel with the rolled image and sum the result:
                    # conceptually this operation is Y[g] = Z_uv * H
                    Y[:,g] += Z_uv * H[u, v]

        return Y

We have provided two solutions of Task 1. In the first solution we use our own implementation of the convolutional filter. In the second implementation we just call the Pytorch implementation. We do that because the Pytorch function is a wrapper that calls a C implementation of convolutions with multiple features. This is numerically more efficient.

Solution 1: Using our implementation of the convolution.¶

In [ ]:

class CNN1(nn.Module):
    def __init__(self, L=1, taps=[1,]):
        """
        A simple implementation of a convolutional neural network, making use of the convolutional filter previously defined.

        Args:
          L (Int): The number of layers in the network.
          taps (list of length L): value of k for each layer k_1,...,k_L
          """
        super(CNN1, self).__init__()
        # We initialize a list in which to save all the layers of the network.
        self.convs = []
        for layer in range(L):
          # Filter Bank with one filter and the corresponding taps
          self.convs.append(ConvolutionalFilter2D(num_filters=1, K=taps[layer]))

        # Store conv list as module_list
        self.convs = torch.nn.ModuleList(self.convs)

        # activation
        self.act = nn.ReLU()

    def forward(self, x):
        # Iterate over layers
        for conv in self.convs:
          x = conv(x)
          x = self.act(x)
        return x

Solution 2: Using pytorch implementation of the convolution.¶

In [ ]:

class CNNPytorch(nn.Module):
    def __init__(self, L=1, taps=[1,], features=[1,3]):
        """
        param:L: (int) number of layers
        param:taps: (list of length L) value of k for each layer k_1,...,k_L
        param:features: (list of length L+1) number of features f_0,...,f_L
        """
        super(CNNPytorch, self).__init__()
        #check whether CNN specification has the correct dimensions
        self.convs = []
        for layer in range(L):
          # Filter Bank with num_filter filters and num_taps taps.
          self.convs.append(nn.Conv2d(in_channels = features[layer], out_channels = features[layer+1],
                                kernel_size = 2*taps[layer]+1, stride = 1, padding = 'same', bias=False))

        # Store conv list as module_list
        self.convs = torch.nn.ModuleList(self.convs)
        # activation
        self.act = nn.ReLU()

    def forward(self, x):
        # Iterate over layers
        for conv in self.convs:
          x = conv(x)
          x = self.act(x)
        return x

Task 2¶

Instantiate a CNN with 1 layer. In this CNN we have $K_1=1$, $F_0=3$, and $F_1=3$. Set the filters to

\begin{equation} \tag{6} {\small \mathbf{H}^{f1} = \left[\begin{array}{rrr} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{array} \right] ,~ \mathbf{H}^{f2} = \left[\begin{array}{rrr} -1 & 0 & 1 \\ -1 & 0 & 1 \\ -1 & 0 & 1 \end{array} \right],~ \mathbf{H}^{f3} = \left[\begin{array}{rrr} -1 & -1 & -1 \\ 0 & 0 & 0 \\ 1 & 1 & 1 \end{array} \right]}. \end{equation}

Create an input signal made up of three different images that correspond to the same digit. Process these input with the CNN and plot the output. Give some interesting observations.

In [ ]:

# Instantiate our previous model
model = CNNPytorch(L=1, taps=[1], features=[1,3])

# Define the weights of the filters. Torch expects H to be f_out x f_in x 3 x 3
# we define each filter as a tensor of 1 x 3 x 3
f1 = torch.ones((1, 3, 3))  # Hf1
f2 = torch.tensor([[1.0, 0.0, -1.0],
                   [1.0, 0.0, -1.0],
                   [1.0, 0.0, -1.0]]).unsqueeze(0)  # Hf2
f3 = torch.tensor([[-1.0, -1.0, -1.0],
                   [0.0, 0.0, 0.0],
                   [1.0, 1.0, 1.0]]).unsqueeze(0)  # Hf3

# Stack the filters to obtain 3 x 1 x 3 x 3
H = torch.stack((f1, f2, f3))

# We finally load the weights into the model
model.convs[0].weight.data = H

In [ ]:

# We find three samples of the number "4" in the training set and take a look at the images.
class_idx = 4
num_samples = 3
samples = []
for x, y in train_set:
  if y==class_idx:
    samples.append(x)
    plt.imshow(x.squeeze(), cmap = 'gray')
    plt.show()
  if len(samples)==num_samples:
    break
samples = torch.stack(samples)

In [ ]:

# We compute the output of the CNN when the inputs are the previous images of the number "4".
# We need to do this without computing gradients.
with torch.no_grad():
  output = model(samples)

# We plot the results
fig, axs = plt.subplots(nrows=3, ncols=3, layout=None)
for img_idx, sample in enumerate(samples):
  for out_feat in range(3):
    axs[img_idx, out_feat].imshow(output[img_idx, out_feat].squeeze(), cmap='gray')
    axs[img_idx, out_feat].set_title(f'Image {img_idx}, f{out_feat+1}')

    # Hide X and Y axes label marks
    axs[img_idx, out_feat].xaxis.set_tick_params(labelbottom=False)
    axs[img_idx, out_feat].yaxis.set_tick_params(labelleft=False)

    # Hide X and Y axes tick marks
    axs[img_idx, out_feat].set_xticks([])
    axs[img_idx, out_feat].set_yticks([])

2. Pooling¶

As we did for signals in time, we define a pooling operator to reduce dimensionality. The procedures are analogous except that when we implement pooling in images we need to average in the horizontal and vertical direction. For instance average pooling is an average over a square window with $\Delta$ pixels in each direction,

\begin{equation}\tag{7} x(u,v) ~=~ \frac{1}{\Delta^2} \left( \, \sum_{u=m\Delta}^{m\Delta + (\Delta-1)} \sum_{v=n\Delta}^{n\Delta + (\Delta-1)} w(m,n) \,\right) ~. \end{equation}

Likewise, max pooling chooses the maximum value over a square window with $\Delta$ pixels in each direction,

\begin{equation} \tag{8} x(u,v) ~=~ \max_{~u\in [m\Delta,~ m\Delta + (\Delta-1)]~} \max_{~v\in[n\Delta,~ n\Delta + (\Delta-1)]~} w(m,n) ~. \end{equation}

Max pooling is the most common choice in practice but average pooling is also popular. As is the case of signals in time, pooling is effective when the elements that are being pooled are similar. In such case, there is not much difference between using max or average pooling.

Task 3¶

Modify the CNN of Task 1 to incorporate pooling. This can be done by modifying the method that implements convolutional layers to incorporate the pooling operation.

As in any CNN the initialization parameters include the number of layers $L$ along with vectors $[K_1, \ldots, K_{L-1}]$ and $[F_0, F_1, \ldots, F_{L-1}]$. These vectors contain the number of taps $K_\ell$ of the filters used at each layer and the number of features $F_\ell$ at the output of each layer.

Since we are incorporating pooling the initialization parameters must also include the vector $[N_0, N_1, \ldots, N_{L}]$ containing the dimension $N_\ell$ of the features at the output of each layer. Notice that $N_0$ matches the dimension of the input signal.

The forward method of this class takes a tensor $\mathbf{X}$ in which each slice is an image with $N_0\times N_0$ pixels. The total number of slices in this tensor is $F_0$.

Use relu nonlinearities in all layers. Use average pooling in all convolutional layers.

First we will implement average pooling by modifying the convolution implementation

In [ ]:

def avg_pool(images, delta=2):
    """
    Performs 2D convolution over a set of batched images
    Args:
        image (torch.Tensor): input image of shape (batch_size, num_features, height, width).
        delta (torch.Tensor): widow size in each direction
    Returns:
        torch.Tensor: pooled image of shape (batch_size, num_features, height//delta, width//delta).
    """
    batch_size, num_features, height, width = images.shape

    # Initialize output tensor
    output = torch.zeros((batch_size, num_features, height // delta, width // delta), device=images.device)


    # We define the indexes to keep just one value per window
    pool_idx_x = torch.arange(0, height, delta, device=images.device)
    pool_idx_y = torch.arange(0, width, delta, device=images.device)


    # Meshgrid computes the combinations of indexes
    # eg if pool_idx_x = pool_idx_y = (1, 3)
    # grid_x = (1, 1, 3, 3), grid_y = (1, 3, 1, 3)
    # so the points of the form (grid_x[i], grid_y[i]) correspond to the grid
    # (1, 1), (1, 3), (3,1), (3,3)
    grid_x, grid_y = torch.meshgrid(pool_idx_x, pool_idx_y, indexing='ij')

    # Average over the pooling window
    for m in range(delta):
        for n in range(delta):
            # Shift the image vertically and then horizontally
            rolled_image = torch.roll(torch.roll(images, -m, dims=2), -n, dims=3)
            # Multiply the kernel with the rolled image and \sum the result
            output += rolled_image[:,:, grid_x, grid_y]
    # Divide by delta^2
    output = output/(delta**2)
    return output

Now we will modify the CNN class from Task 1 to add pooling.

In [ ]:

class CNNPooling(nn.Module):
    def __init__(self, L=2, taps=[1,1], features=[1,4, 8], feat_dim=[28, 14, 7]):
        """
        param:L: (int) number of layers
        param:taps: (list of length L) value of k for each layer k_1,...,k_L
        param:features: (list of length L+1) number of features f_0,...,f_L
        param:feat_dim: (list of length L+1) number of features N_0,...,N_L
        """
        super(CNNPooling, self).__init__()
        # Similarly, we define an empty list and append the laters of the CNN to said list.
        self.L = L
        convs = []
        self.deltas = []

        for layer in range(L):
          # Filter Bank using Pytorch's implementation
          convs.append(nn.Conv2d(in_channels = features[layer], out_channels = features[layer+1],
                                kernel_size = 2*taps[layer]+1, stride = 1, padding = 'same', bias=False))
          # store delta for each average pooling layer
          self.deltas.append(feat_dim[layer]//feat_dim[layer+1])
        # Store conv list as module_list
        self.convs = torch.nn.ModuleList(convs)

        # Activation
        self.act = nn.ReLU()

    def forward(self, x):
        # Iterate over layers and apply the corresponding operations
        for l in range(self.L):
          x = self.convs[l](x)
          x = self.act(x)
          x = avg_pool(x, self.deltas[l])
        return x

Task 4¶

Modify the CNN of Task 3 to incorporate a readout layer. This readout layer is the same as the readout layer that we implemented for time convolutions in Lab 2C.

In [ ]:

class CNNPoolingAndReadout(nn.Module):
    def __init__(self, L=2, taps=[1,1], features=[1,4, 8], feat_dim=[28, 14, 7]):
        """
        param:L: (int) number of layers
        param:taps: (list of length L) value of k for each layer k_1,...,k_L
        param:features: (list of length L+1) number of features f_0,...,f_L
        param:feat_dim: (list of length L+1) number of features N_0,...,N_L
        param:num_classes: (int) number of output classes for the readout layer
        """
        super(CNNPoolingAndReadout, self).__init__()
        # Similarly, we define an empty list and append the laters of the CNN to said list.
        self.L = L
        convs = []
        self.deltas = []

        for layer in range(L):
          # Filter Bank using Pytorch's implementation
            convs.append(nn.Conv2d(in_channels=features[layer], out_channels=features[layer + 1],
                                   kernel_size=2 * taps[layer] + 1, stride=1, padding='same', bias=False))
          # store delta for each average pooling layer
            self.deltas.append(feat_dim[layer] // feat_dim[layer + 1])

        # Store conv list as module_list
        self.convs = torch.nn.ModuleList(convs)

        # Activation
        self.act = nn.ReLU()

        # Readout layer
        readout_in_dim = features[-1]*feat_dim[-1]**2
        num_classes = 10
        readout_init = torch.randn(readout_in_dim, num_classes)
        self.readout = nn.Parameter(readout_init)

    def forward(self, x):
        # Iterate over layers and apply the corresponding operations
        for l in range(self.L):
            x = self.convs[l](x)
            x = self.act(x)
            x = avg_pool(x, self.deltas[l])

        # Flatten the output for the readout layer
        x = x.flatten(start_dim=1)
        x = x@self.readout
        return x

Task 5:¶

Instantiate the CNN of Task 4 with 2 convolutional layers.

Use this CNN to train a classifier for digit classification. Remember that this requires splitting the dataset into train and test sets. Evaluate the train loss, the test loss, and the classification error rate.

In [ ]:

# Instantiate the CNN of task 4, using pooling and readout
model = CNNPoolingAndReadout(L=2, taps=[1,1], features=[1,4, 8], feat_dim=[28, 14, 7])

Train¶

We train and evaluate the code.

In [ ]:

# We define a function to evaluate the loss. This is an auxiliary function. We do not need it to
# implement or train the neural network. We use it for evaluation after training and/or at some
# intermediate training checkpoints. This is just for visualization.

def evaluate(test_dataloader, estimator):
    """
    Evaluate the performance of the estimator on the given dataloader.

    Args:
        dataloader (torch.utils.data.DataLoader): Data on which to evaluate performance.
        estimator (torch.nn.Module): Learning parameterization to evaluate.

    Returns:
        float: Accuracy of the estimator evaluated on the dataloader.
    """
    correct = 0  # Initialize counter for correct predictions

    with torch.no_grad():  # Disable gradient computations (not needed for evaluation)
        for data, target in test_dataloader:  # Sweep over all batches
            data = data.to(device) # Move the data tensor to the device (e.g. from CPU memory to GPU memory)
            target = target.to(device) # Move the target tensor to the device (e.g. from CPU memory to GPU memory)
            output = estimator(data)  # Get model predictions for the batch

            # Select the most likely index (class) from the output tensor
            pred = output.argmax(dim=-1)

            # Count the number of correct predictions in the batch
            correct += torch.sum(pred == target)

    # Calculate overall accuracy: percentage of correct predictions
    test_accuracy = correct / len(test_dataloader.dataset)

    return test_accuracy

In [ ]:

# Instatiate model
model = CNNPoolingAndReadout(L=2, taps=[1,1], features=[1,4, 8], feat_dim=[28, 14, 7])
model = model.to(device)

# Instantiate cross entropy loss
cross_entropy = nn.CrossEntropyLoss()

# Set the parameters of stochastic gradient descent (SGD). These include the learning rate,
# batch size, and number of epochs. Here we increased the learning rate to 0.03 and the number of epochs to 20.
lr = 0.01
batch_size = 128
n_epochs = 5
optimizer = optim.SGD(model.parameters(), lr=lr)

# Specify the loss function. For the CNN, we use cross-entropy loss since this is a classification task.
loss = nn.CrossEntropyLoss()

# The train and test dataloaders handle randomized batches in the training set and non-shuffled
# batches in the test set. We keep the same structure as before, where we use these dataloaders
# to handle loading data into memory.
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=1024, shuffle=False)

# Initialize null structures for storing the evolution of the training loss and test accuracy
# at the end of each epoch. This is not needed for training, just for displaying results.
losses = []
test_acc = []
print('\n')

# Begin the training loop. This is similar to the stochastic gradient descent (SGD) implementation
# used for the convolutional filter. In the inner loop, we sweep over batches of size batch_size.
# In the outer loop, we sweep over epochs. One epoch is a complete pass through the shuffled dataset.
#
# We follow the three steps required to run SGD: (i) Load the data, (ii) Evaluate Gradients,
# (iii) Take a gradient descent step. The only difference is the estimator, which is now a CNN.
for epoch in range(n_epochs):  # Iterate over n_epochs epochs

    for batch_idx, (x_batch, y_batch) in enumerate(train_loader):  # Iterate over all batches in the dataset

        # (Step i) Load the data. These commands send the data to the GPU memory.
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # (Step ii) Compute the gradients. Use automatic differentiation.
        model.zero_grad() # Reset the gradients to zero

        y_hat = model(x_batch).squeeze()  # Forward pass through the CNN and squeeze the output.
        cross_entropy_value = loss(y_hat, y_batch.type(torch.LongTensor).to(device))  # Compute the loss.

        cross_entropy_value.backward()  # Compute gradients by moving backwards to the gradient reset.

        # (Step iii) Update parameters by taking an SGD (or other optimizer) step.
        optimizer.step()

        # Print training stats at specified intervals to track progress.
        if batch_idx 
            print(f"Train Epoch: {epoch+1} \tLoss: {cross_entropy_value.item():.3f}")

        # Record the loss at each iteration for visualization.
        losses.append(cross_entropy_value.item())

    # End of batch loop.

    # Evaluate the performance of the CNN on the test set at the end of each epoch.
    test_accuracy = evaluate(test_loader, model)
    test_acc.append(test_accuracy.cpu())

    # Print the test accuracy at the end of each epoch to track performance.
    print(f'Epoch {epoch+1} / {n_epochs}: Test Accuracy: {test_accuracy*100:.2f}

# End of epoch loop.
print('\n\nDone\n')

# Plot the training loss versus the number of iterations.
plt.plot(losses)
plt.title("training loss")


Train Epoch: 1 	Loss: 8.802
Train Epoch: 1 	Loss: 1.978
Train Epoch: 1 	Loss: 1.763
Train Epoch: 1 	Loss: 1.740
Train Epoch: 1 	Loss: 1.593
Train Epoch: 1 	Loss: 1.416
Train Epoch: 1 	Loss: 1.329
Train Epoch: 1 	Loss: 1.327
Train Epoch: 1 	Loss: 1.212
Train Epoch: 1 	Loss: 1.225
Epoch 1 / 5: Test Accuracy: 62.32
Train Epoch: 2 	Loss: 1.243
Train Epoch: 2 	Loss: 1.107
Train Epoch: 2 	Loss: 1.132
Train Epoch: 2 	Loss: 1.090
Train Epoch: 2 	Loss: 0.964
Train Epoch: 2 	Loss: 1.054
Train Epoch: 2 	Loss: 0.994
Train Epoch: 2 	Loss: 1.015
Train Epoch: 2 	Loss: 1.008
Train Epoch: 2 	Loss: 0.866
Epoch 2 / 5: Test Accuracy: 70.89
Train Epoch: 3 	Loss: 0.880
Train Epoch: 3 	Loss: 0.828
Train Epoch: 3 	Loss: 0.950
Train Epoch: 3 	Loss: 0.788
Train Epoch: 3 	Loss: 0.828
Train Epoch: 3 	Loss: 1.030
Train Epoch: 3 	Loss: 0.889
Train Epoch: 3 	Loss: 0.855
Train Epoch: 3 	Loss: 0.814
Train Epoch: 3 	Loss: 0.707
Epoch 3 / 5: Test Accuracy: 75.60
Train Epoch: 4 	Loss: 0.846
Train Epoch: 4 	Loss: 0.698
Train Epoch: 4 	Loss: 0.962
Train Epoch: 4 	Loss: 0.808
Train Epoch: 4 	Loss: 0.608
Train Epoch: 4 	Loss: 0.568
Train Epoch: 4 	Loss: 0.803
Train Epoch: 4 	Loss: 0.689
Train Epoch: 4 	Loss: 0.795
Train Epoch: 4 	Loss: 0.771
Epoch 4 / 5: Test Accuracy: 78.11
Train Epoch: 5 	Loss: 0.743
Train Epoch: 5 	Loss: 0.603
Train Epoch: 5 	Loss: 0.647
Train Epoch: 5 	Loss: 0.624
Train Epoch: 5 	Loss: 0.589
Train Epoch: 5 	Loss: 0.571
Train Epoch: 5 	Loss: 0.610
Train Epoch: 5 	Loss: 0.726
Train Epoch: 5 	Loss: 0.561
Train Epoch: 5 	Loss: 0.512
Epoch 5 / 5: Test Accuracy: 80.86


Done

Out[ ]:

Text(0.5, 1.0, 'training loss')

In [ ]: