Download the files for Lab 5A from the following links:

We recommend that you use Google Colab, as training will be faster on the GPU.

To enable the GPU on Colab, go to Edit / Notebook settings / Hardware accelerator / select T4 GPU

Instructions on how to download and use Jupyter Notebooks can be found here. You can find a static version of the notebook below.

Lab_9

Lab 6: Language Models¶

We can think of language as a time series. In this interpretation, each word in a sentence corresponds to the equivalent of a different point in time and the words themselves represent different vectors of the time series. Consider as an example the first lines spoken by Miranda in The Tempest^[1]:

If by your art, my dearest father, you have put the wild waters in this roar, allay them.

We can, as we illustrate in Figure fig_language_is_a_time_series, parse this sentence as a time series. In this time series the first vector is $\mathbf{x}_0 = \text{“If”}$, the second vector is $\mathbf{x}_1 = \text{“by”}$, the third is $\mathbf{x}_2 = \text{“your”}$, and so on.

If we interpret language as a time series, we can use a transformer to predict the next word in a sequence as we did in Chapter 5. If we then execute this predictor recursively, we can use it to predict several words in a sequence. This is a strategy for generating language.

The first challenge to implement this strategy is how to represent words numerically. We do that with word embeddings as we discuss in the following section.

0. Environment setup¶

In [1]:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import os
from IPython.core.display_functions import clear_output
import matplotlib.pyplot as plt
import math
import wandb
import re
from tqdm import tqdm,trange
device = "cpu"
if torch.backends.mps.is_available():
    device = "mps:0"
elif torch.cuda.is_available():
    device = "cuda:0" 
print(f"Using device: {device}")

Using device: cuda:0

Word Embeddings¶

A simple approach is to encode words in the index of a long vector. Formally, suppose that we are given a collection of texts that collectively contain a total of $c$ words. We then consider a set of $c$ vectors $\mathbf{e}_i$ whose length is also $c$. These vectors have all zero entries except for the $i$th entry which we set to 1:

$$ (\mathbf{e}_i)_i = 1 \quad \text{~and~} \quad (\mathbf{e}_i)_j = 0 \text{~~for~~} i \neq j. $$

We use the vector $\mathbf{e}_i$ to encode the $i$th word in the corpus.

In corpuses used in practice, we have thousands of different words and anywhere between hundreds of thousands to trillions of sentences. In this lab, we work with a subset of Shakespeare’s plays which contains $c = 14,295$ different words and a total of 292,072 words in the corpus. But to illustrate ideas, let us work with a corpus made up of just two quotes:

If by your art, my dearest father, you have put the wild waters in this roar, allay them.
Sir, are not you my father?

In this corpus, we have a total of 24 different words including 3 punctuation marks. We therefore represent the words in the corpus with $c = 24$ vectors of length $c = 24$. Proceeding in the order of the sentence, the vector $\mathbf{e}_1 = [1, 0, \ldots, 0]$ represents the word “If,” the vector $\mathbf{e}_2 = [0, 1, 0, \ldots, 0]$ represents the word “by,” and so on. The word “father” is the eighth word that appears in the sentence and is therefore represented by the vector $\mathbf{e}_8$. This vector’s value at index 8 is $(\mathbf{e}_8)_8 = 1$ and all of its other entries are zero.

When the same word appears again in the corpus, we encode it with the same vector. For example, when the word “father” appears a second time, we still encode it with the vector $\mathbf{e}_8$. This also happens with the comma (“,”) which appears three times and is encoded with the vector $\mathbf{e}_5$ in all three appearances, and with the words “my” and “you” that appear twice and are encoded in the vectors $\mathbf{e}_6$ and $\mathbf{e}_9$. So encoded, our corpus becomes:

$\mathbf{e}_1 \quad \mathbf{e}_2 \quad \mathbf{e}_3 \quad \mathbf{e}_4 \quad \mathbf{e}_5 \quad \mathbf{e}_6 \quad \mathbf{e}_7 \quad \mathbf{e}_8 \quad \mathbf{e}_5 \quad \mathbf{e}_9 \quad \mathbf{e}_{10}$
$\mathbf{e}_{11} \quad \mathbf{e}_{12} \quad \mathbf{e}_{13} \quad \mathbf{e}_{14} \quad \mathbf{e}_{15} \quad \mathbf{e}_{16} \quad \mathbf{e}_{17} \quad \mathbf{e}_5$
$\mathbf{e}_{18} \quad \mathbf{e}_{19} \quad \mathbf{e}_{20}$
$\mathbf{e}_{21} \quad \mathbf{e}_5 \quad \mathbf{e}_{22} \quad \mathbf{e}_{23} \quad \mathbf{e}_9 \quad \mathbf{e}_6 \quad \mathbf{e}_8 \quad \mathbf{e}_{24}$

This is a defilement of Shakespeare’s work. However, this representation of the corpus can be processed with numerical techniques.

Encoding language with these index vectors is not creative and does not work well. We discuss more interesting and useful word embeddings in the next section.

Task 1¶

Get the data for this lab from dsd.seas.upenn.edu/lab6 and load it into your environment. This is a text file containing around 40,000 lines of dialogue from Shakespeare’s plays. Split the text into words, defined here to include punctuation marks and line breaks. We associate words with vectors $\mathbf{e}_i$ as in equation $(1)$. Since it would be wasteful to store vectors in which all but one entry is 0, we just store the index of the vector that represents each individual word. For example, if “father” is represented by the index vector $\mathbf{e}_8$, we do not store $\mathbf{e}_8$ to represent this word; we just store the index $i=8$.

Implement a function that turns a word into an index and the inverse function that turns an index into a word. We recommend that you use the code that we provide for this task. It is a somewhat cumbersome and not very enlightening activity.

Data¶

Data Loading¶

In [2]:

with open('input.txt') as f:
    text = f.read()
    

print("----Sample Shakespeare----")
print(text[:250])

----Sample Shakespeare----
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

Tokenization¶

Tokenization converts raw sub-sequences of text (substrings) to sequences of integers. For example, "ll." -> 208. We will be developing a character level language model, so we will be converting each individual word into an integer. For example, "Hello" -> 48.

In [3]:

def split_to_words(text):
    return re.findall(r"\w+(?:'\w+)?|[.,!?;:\"()\[\]{}<>\\/\-—–…]|\n", text)

vocab = list(set(split_to_words(text)))
c = len(vocab)
print("Number of words: {}".format(len(split_to_words(text))))
print("Number of distinct words in text: {}".format(c))

Number of words: 292072
Number of distinct words in text: 14295

Functions to encode and decode words into indices¶

In [4]:

stoi = {word:i for i, word in enumerate(vocab)}
itos = {i:word for i, word in enumerate(vocab)}

def words_to_tokens(words):
    """
    Convert a list of words to a list of tokens
    """
    return [stoi[w] for w in words]

def tokens_to_words(index_list):
    """
    Convert a list of tokens to a list of words
    """
    decoded = " ".join([itos[i] for i in index_list])
    return re.sub(r'\s+([.,!?;:"(){}\[\]<>\\/\-—–…])', r'\1', decoded)

# Checking that the word to token and back conversion works
sample_words = text[:36]
token_ids = words_to_tokens(split_to_words(sample_words))
recovered_words = tokens_to_words(token_ids)
print(f"Original text: {sample_words}\n")
print(f"Encoded text: {token_ids}\n")
print(f"Recovered text: {recovered_words}\n")

Original text: First Citizen:
Before we proceed any

Encoded text: [3263, 3770, 13378, 10129, 750, 8206, 5361, 8053]

Recovered text: First Citizen: 
 Before we proceed any

Converting dataset into tokens¶

In [5]:

tokenized_text = words_to_tokens(split_to_words(text))
print("Encoded text sample: {}".format(tokenized_text[:10]))
print(tokens_to_words(tokenized_text[:10]))

# The works of Shakespeare are now a sequence of integers representing the words in the text. Sorry, William.
tokenized_text = torch.tensor(tokenized_text)
tokenized_text.shape

Encoded text sample: [3263, 3770, 13378, 10129, 750, 8206, 5361, 8053, 3232, 12489]
First Citizen: 
 Before we proceed any further,

Out[5]:

torch.Size([292072])

Cooccurrence Matrices¶

To create richer word embeddings, we leverage the cooccurrence matrix $\mathbf{C}$. To construct this matrix, we consider a window of length $W + 1$ and scan the corpus for joint occurrences of words $\mathbf{e}_i$ and $\mathbf{e}_j$. The cooccurrence $C_{ij}$ is the number of times that $\mathbf{e}_j$ appears in a window centered at $\mathbf{e}_i$. If we index the corpus by an index $t$ and use $\mathbf{w}_t$ to represent the $t$th word in the corpus, we can write cooccurrences as:

$$ C_{ij} = \sum_t \mathbb{I}(\mathbf{w}_t = \mathbf{e}_i) = \sum_{u = -W/2}^{u = W/2} \mathbb{I}(\mathbf{w}_u = \mathbf{e}_j), $$

where we assume that the window is even for simplicity. In the above equation, the first indicator function $\mathbb{I}(\mathbf{w}_t = \mathbf{e}_i) = 1$ only when the window is centered at $\mathbf{w}_t$ and $\mathbf{w}_t = \mathbf{e}_i$. The second indicator function $\mathbb{I}(\mathbf{w}_u = \mathbf{e}_j) = 1$ whenever the word $\mathbf{e}_j$ appears in the window centered at $\mathbf{w}_t$. Thus, the second sum counts the number of times that $\mathbf{e}_j$ appears centered in a window centered at $\mathbf{w}_t = \mathbf{e}_i$. The first sum is counting the number of times that $\mathbf{e}_i$ appears in the corpus.

The cooccurrence matrix $\mathbf{C}$ is relevant because related words tend to appear near each other, and they also tend to appear next to words that indicate their relationships. In an extensive corpus, we expect to find several cooccurrences of the words “birds” and “fly,” indicating that these two words are related. We do not expect to see many cooccurrences of “dogs” and “fly” because dogs do not fly. We also expect to see cooccurrences of the words “bird” and “albatross” and of the words “bird” and “swallow,” indicating that there is some relationship between “albatross” and “swallow.”

We highlight that the cooccurrence matrix $\mathbf{C}$ is symmetric:

$$ \mathbf{C} = \mathbf{C}^T \quad \Leftrightarrow \quad C_{ij} = C_{ji} $$

This is because whenever the word $\mathbf{e}_j$ appears in a window centered at an occurrence of the word $\mathbf{e}_i$, these two words are less than $W/2$ words apart. This implies that the word $\mathbf{e}_i$ must appear in a window centered at an occurrence of the word $\mathbf{e}_j$.

Task 2¶

Compute the cooccurrence matrix for the Shakespeare corpus loaded in Task 6.1. Use a window of length $W=10$.

In [6]:

# Create co-occurrence matrix
# The co-occurrence matrix C is a c x c (c is our vocab size) symmetric matrix where C_ij is how many times the ith word appears within W words away from the jth word.
with torch.no_grad():
    W = 10
    C = torch.zeros(len(vocab),len(vocab))
    for t_idx in trange(len(tokenized_text)):
        left_bound = max(t_idx-W//2,0)
        right_bound = min(t_idx+W//2+1,len(tokenized_text))
        context_words = tokenized_text[left_bound : right_bound]
        for u_idx in range(left_bound, right_bound):
            t = tokenized_text[t_idx]
            u = tokenized_text[u_idx]
            C[t, u] += 1.0
    C = C.to(device)
    
# C should be a symmetric matrix
torch.isclose(C, C.T, atol=1e-3).all()

/tmp/ipykernel_218776/1785963264.py:22: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  C = torch.load("C.pt").to(device)

Eigenvector Embeddings¶

A vector $\mathbf{v}_k$ is said to be an eigenvector of the cooccurrence matrix $\mathbf{C}$ if there exists a constant $\lambda_k$ such that

$$ \mathbf{C} \mathbf{v}_k = \lambda_k \mathbf{v}_k. $$

Eigenvectors are peculiar vectors because the matrix multiplication $\mathbf{C} \mathbf{e}$ yields a vector that is, in general, quite different from $\mathbf{e}$. In the case of an eigenvector, the product $\mathbf{C} \mathbf{v}_k = \lambda_k \mathbf{v}_k$ is a simple scaling of $\mathbf{v}_k$. All of the components of $\mathbf{v}_k$ are multiplied by the same number.

It is known that the symmetric matrix $\mathbf{C} \in \mathbb{R}^{c \times c}$ has $c$ distinct eigenvectors. It is customary to order the corresponding $c$ eigenvalues from largest to smallest so that $\lambda_k \geq \lambda_\ell$ when $k < \ell$. Since eigenvector $\mathbf{v}_k$ is associated with eigenvalue $\lambda_k$, the eigenvectors inherit this order. When $k < \ell$, eigenvector $\mathbf{v}_k$ is associated with an eigenvalue that is not smaller than the eigenvalue associated with eigenvector $\mathbf{v}_\ell$ — it is most often larger. We will say that eigenvector $\mathbf{v}_k$ is not smaller than eigenvector $\mathbf{v}_\ell$ or that $\mathbf{v}_k$ is larger than $\mathbf{v}_\ell$ if $\lambda_k > \lambda_\ell$.

It is also customary to group eigenvectors in the eigenvector matrix

$$ \mathbf{V} = [\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_{c}], $$

in which column $k$ registers the value of eigenvector $\mathbf{v}_k$. The eigenvector matrix is an $n \times n$ matrix. It has $c$ columns representing $c$ distinct eigenvectors, which have $c$ rows each.

We consider now a number $n \leq c$ and define the dimensionality reduction matrix $\mathbf{V}_n$ grouping the first $n$ eigenvectors of $\mathbf{C}$,

$$ \mathbf{V}_n = [\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_{n}]. $$

This is a tall matrix because it has $c$ rows but only $n$ columns. These columns coincide with the first $n$ columns of $\mathbf{V}$. Instead of storing all eigenvectors, we are storing only the $n$ largest eigenvectors of $\mathbf{C}$.

We use the dimensionality reduction matrix $\mathbf{V}_n$ to construct representations of vectors $\mathbf{e} \in \mathbb{R}^{c}$ in a space of dimensionality $n$. These representations are

$$ \mathbf{x} = \mathbf{V}_n^T \mathbf{e}. $$

We say that this is a dimensionality reduction because $\mathbf{x} \in \mathbb{R}^{n}$ is a vector with $n$ components, which is (much) smaller than the number of components $c$ of the vector $\mathbf{e} \in \mathbb{R}^{c}$.

We use dimensionality reduction to compute word embeddings. Given the collection of words $\mathbf{e}_i$, we transform them into the collection of embeddings

$$ \mathbf{x}_i = \mathbf{V}_n^T \mathbf{e}_i. $$

Representations $\mathbf{x}_i$ are preferable to representations $\mathbf{e}_i$ because they have smaller dimensionality. They also turn out to capture some semantic properties in the sense that vectors $\mathbf{x}_i$ and $\mathbf{x}_j$ that are close represent similar words. This is different from the index embeddings $\mathbf{e}_i$ in which comparisons between different vectors $\mathbf{e}_i$ and $\mathbf{e}_j$ have no meaning.

Task 3

Compute the first $n = 256$ eigenvectors of the cooccurrence matrix computed in Task 6.2. Use these eigenvectors to compute the eigenvector embeddings of all of the $c$ words in the corpus loaded in Task 6.1. Store the corpus using these eigenvector embeddings. This is the time series with which we will work in subsequent tasks.

In [83]:

# n is the number of eigenvectors we want to keep
n = 256
with torch.no_grad():
    # Normalize the data
    Z = C - C.mean(dim=1, keepdim=True)
    Z /= Z.std(dim=1, keepdim=True)

    # Compute the covariance matrix
    cov = (Z @ Z.T)/(Z.shape[0] - 1)
    # Compute the eigenvectors and eigenvalues
    L, Q = torch.linalg.eigh(cov)
    # Get the n largest eigenvectors
    principal_eigv = Q[:, -n:].T

    # PCA embeddings for training
    pca_embeddings = Z @ principal_eigv.T # (c, n)

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[83], line 11
      9 cov = (Z @ Z.T)/(Z.shape[0] - 1)
     10 # Compute the eigenvectors and eigenvalues
---> 11 L, Q = torch.linalg.eigh(cov)
     12 # Get the n largest eigenvectors
     13 principal_eigv = Q[:, -n:].T

KeyboardInterrupt:

Principal Component Analysis¶

The principal component analysis (PCA) transform of a vector $\mathbf{e}$ is its projection in the eigenvector space of $\mathbf{C}$,

$$ \mathbf{y} = \mathbf{V}^T \mathbf{e}. $$

This is similar to the dimensionality reduction operation in

$$ \mathbf{x} = \mathbf{V}_n^T \mathbf{e} $$

except that we are using all $c$ eigenvectors instead of the largest $n$.

The PCA representation has the important property that it can be undone by multiplication with the eigenvector matrix. I.e., given the PCA transform $\mathbf{y}$, we can recover the original vector $\mathbf{e}$ as

$$ \mathbf{e} = \mathbf{V} \mathbf{y}. $$

The combination of the PCA transform and its inverse indicates that $\mathbf{e}$ and $\mathbf{y}$ are equivalent representations of the same information. Given $\mathbf{e}$ we can compute $\mathbf{y}$ and given $\mathbf{y}$ we can compute $\mathbf{e}$.

The same is not true of the dimensionality reduction transformation. When going from $\mathbf{e}$ to $\mathbf{x}$ we lose information precisely because we are reducing dimensionality. In this context, it is interesting to implement the dimensionality recovery operation,

$$ \tilde{\mathbf{e}} = \mathbf{V}_n \mathbf{x} = \mathbf{V}_n \left( \mathbf{V}_n^T \mathbf{e} \right), $$

and ask how close the recovered vector $\mathbf{e}$ is to the original $\tilde{\mathbf{e}}$. The answer is that for word vectors $\mathbf{e}_i$, the error is small. That is, for most word vectors,

$$ \tilde{\mathbf{e}}_i = \mathbf{V}_n \mathbf{x}_i = \mathbf{V}_n \left( \mathbf{V}_n^T \mathbf{e}_i \right) \approx \mathbf{e}_i. $$

Language Transformers¶

We use here a softmax attention transformer with multiple heads to process language sequences. For reference, a transformer with multiple heads is defined by the recursion,

$$ \mathbf{A}_\ell^h = \text{sm} \left( (\mathbf{Q}^h_\ell \mathbf{X}_{\ell-1})^T (\mathbf{K}^h_\ell \mathbf{X}_{\ell-1}) \right), $$$$ \mathbf{Y}_\ell^h = \mathbf{W}_\ell^h{}^T \mathbf{V}^h_\ell \mathbf{X}_{\ell-1} \mathbf{A}_\ell^h{}^T, $$$$ \mathbf{X}_\ell = \mathbf{X}_{\ell-1} + \sigma \left( \sum_{h=1}^H \mathbf{Y}_\ell^h \right). $$

The input to the transformer is a sequence of $T$ eigenvector word embeddings $\mathbf{X}_0 = \mathbf{X}$ and its output $\mathbf{X}_L = \Phi (\mathbf{X}, \mathcal{A})$ is another sequence of $T$ eigenvector word embeddings. The trainable tensor $\mathcal{A} = \{\mathbf{Q}^h_\ell, \mathbf{K}^h_\ell, \mathbf{V}^h_\ell, \mathbf{W}^h_\ell\}$ contains all of the query, key, value, and dimension recovery matrices of all heads and layers. We use the output sequence to predict the next word, $\mathbf{x}_T$, in the sequence in Section sec_language_readout.

$$ \mathbf{A}_\ell^h = \text{sm} \left( (\mathbf{Q}^h_\ell \mathbf{X}_{\ell-1})^T (\mathbf{K}^h_\ell \mathbf{X}_{\ell-1}) \right) $$

is the computation of softmax attention coefficients $\mathbf{A}_\ell^h$ for Layer $\ell$ and Head $h$. We use these attention coefficients to create contextual representations $\mathbf{Y}_\ell^h$ in

$$ \mathbf{Y}_\ell^h = \mathbf{W}_\ell^h{}^T \mathbf{V}^h_\ell \mathbf{X}_{\ell-1} \mathbf{A}_\ell^h{}^T. $$

The output of Layer $\ell$ is computed by summing all heads and passing the output through a nonlinear operation in

$$ \mathbf{X}_\ell = \mathbf{X}_{\ell-1} + \sigma \left( \sum_{h=1}^H \mathbf{Y}_\ell^h \right). $$

We also add the skip connection $\mathbf{X}_{\ell-1}$ to the output of Layer $\ell$ of the transformer.

Recall that in the attention and representation equations we create the intermediate representations $\mathbf{Q}^h_\ell \mathbf{X}_{\ell-1}$ (queries), $\mathbf{K}^h_\ell \mathbf{X}_{\ell-1}$ (keys), and $\mathbf{V}^h_\ell \mathbf{X}_{\ell-1}$ (values) which are of dimension $m \ll n$. In this lab and in language models in general, the reduction of dimension is aggressive. We have here that $n = 256$ at the input and choose $m = 2$ for intermediate representations.

Task 4

Code a Pytorch module to implement the language transformer as specified by $\text{(13)-(15)}$. This transformer takes sequences of length $T$ and dimension $n$ as inputs and produces sequences of length $T$ and dimension $n$ as outputs. Make the number of layers $L$ and the number of heads $H$ parameters of the transformer. Queries, keys, and values are of dimension $m$, which is also a parameter of the transformer. Use ReLU nonlinearities at each layer.

This is the same transformer of Lab 5. It is a copy and paste task. That the time series represents language is irrelevant.

MultiHeadLayer¶

In [8]:

# This is the same as the MultiheadLayer in the lab 6 notebook. It corresponds to the equations in Section 3 of this lab's writeup.
class MultiHeadLayer(nn.Module):
    """
    An implementation of the multihead attention layer.
    The difference between AttentionLayer and this class is,
    now Q,K,V are matrices of shape (H, m, n), and the attention matrix B is of shape (H, T, T)
    (one attention feature per head)
    Args:
        m (int): The dimension of the Q and K matrices.
        n (int): The number of features, n=12 in our case.
        k (int): The dimension of the W matrix.
        H (int): The number of heads.
    """
    def __init__(self, m, n, H):
        super(MultiHeadLayer, self).__init__()
        self.m = m
        self.H = H

        self.Q = nn.Parameter(torch.empty(H, m, n))
        self.K = nn.Parameter(torch.empty(H, m, n))
        self.V = nn.Parameter(torch.empty(H, m, n))

        self.W = nn.Parameter(torch.empty(H, n, m))
        
        self.nonlinearity = nn.ReLU()
        self.initialize_parameters()

    def initialize_parameters(self):
        """
        Initialize the values of the learnable parameter matrices.
        Kaiming uniform is just a type of random initialization, you don't need to 
        worry about it. It is a good default initialization for linear layers.
        """
        nn.init.kaiming_uniform_(self.Q, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.K, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.V, a=math.sqrt(5))

        nn.init.kaiming_uniform_(self.W, a=math.sqrt(5))

    def forward(self, X):
        """
        Args:
            X (torch.Tensor): The input embeddings.
        Returns:
            X_l (torch.Tensor): The output of the multihead attention layer.
        """
        B, n, T = X.shape  # X: (B, n, T)

        # Expand X to include the head dimension
        X_expanded = X.unsqueeze(1)  # (B, 1, n, T)

        # Compute QX, KX, VX for each head
        QX = torch.matmul(self.Q.unsqueeze(0), X_expanded)  # (B, H, m, T)
        KX = torch.matmul(self.K.unsqueeze(0), X_expanded)  # (B, H, m, T)
        VX = torch.matmul(self.V.unsqueeze(0), X_expanded)  # (B, H, m, T)
        
        QX_t = QX.transpose(-2, -1)  # (B, H, T, m)

        # Compute attention scores B per head
        B = torch.matmul(QX_t, KX)  # (B, H, T, T)
        A = F.softmax(B, dim=-1)
    
        A_t = A.transpose(-2,-1)
        VXA_t = torch.matmul(VX, A_t) # (B, H, m, T)
        Y = torch.matmul(self.W, VXA_t) # (B, H, T, n)
        
        X_l = X + self.nonlinearity(Y.sum(dim=1))

        return X_l
    
    
model = MultiHeadLayer(m=32, n=256, H=8).to(device)
X_tilde = torch.randn(1,256,64).to(device)
out = model(X_tilde)

print(f"out.shape: {out.shape}")

out.shape: torch.Size([1, 256, 64])

LanguageTransformer¶

In [9]:

class LanguageTransformer(nn.Module):
    """
    
    Mutlihead Transformer, analogous to the Transformer class, in the single head case.
    Args:
        m (int): The dimension of the Q and K matrices.
        n (int): The number of features, n=12 in our case.
        k (int): The dimension of the W matrix.
        L (int): The number of layers.
        H (int): The number of heads.
    """
    def __init__(self, m, n, L, H):
        super(LanguageTransformer, self).__init__()
        self.layers = nn.ModuleList([
            MultiHeadLayer(m, n, H) for _ in range(L)
        ])
        # Word embedding table. This is the only change from the previous lab's code. We have 
        # PCA embeddings to convert word indices to embeddings.
        self.embedding_table = pca_embeddings
        
    def forward(self, E):
        """
        The forward pass of the multihead transformer, stacks L multihead layers.
        This class is essentially the same as the Transformer class, but using the 
        MultiHeadLayer class instead of the AttentionLayer class.
        Args:
            E (torch.Tensor): The input word indices.
        Returns:
            X_L^{T-1} (torch.Tensor): The last vector of the output of the transformer.
        """
        # Convert word indices to embeddings. We need to transpose the result to get the shape (B, n, T).
        X = self.embedding_table[E].transpose(1,2)
        B, n, T = X.shape

        # Compute the mean token to append to the sequence.
        X_tilde = X.mean(dim=2, keepdim=True) # mean over the time dimension
        X_tilde = torch.cat((X, X_tilde), dim=-1)
        
        # X_l has shape (B, n, T+1)
        X_l = X_tilde
        for layer in self.layers:
            X_l = layer(X_l)
        
        # Output the last vector.
        return X_l[:,:,-1]

# Test
model = LanguageTransformer(L=2, H=2, m=32, n=256).to(device)
E = torch.randint(0, pca_embeddings.shape[0], (1,5)).to(device).long()
out = model(E)
print(f"output.shape: {out.shape}")

output.shape: torch.Size([1, 256])

Next Word Prediction¶

To predict word $\mathbf{x}_T$, we read the output $\mathbf{X}_L = \Phi (\mathbf{X}, \mathcal{A})$ of the transformer. A possible approach is to take the average across time. To set up this readout strategy, let $\mathbf{X}_u$ denote a sequence of $T$ words—in the form of eigenvector embeddings—starting at time $u$,

$$ \mathbf{X}_u = \big[\, \mathbf{x}_u, \mathbf{x}_{u+1}, \ldots, \mathbf{x}_{T + u -1} \,\big] = \mathbf{x}_{u: u + T-1} . $$

This is a recorded history of the language sequence. Our goal is to predict the next word $\mathbf{x}_{u+T}$ using this recorded history. We do that using the average of the output of the transformer,

$$ \hat{\mathbf{x}}_{u+T} = \Big[ \, \Phi (\mathbf{X}_u, \mathcal{A}) \, \Big] \mathbf{1}. $$

We then train the tensor $\mathcal{A} = \{\mathbf{Q}^h_\ell, \mathbf{K}^h_\ell, \mathbf{V}^h_\ell, \mathbf{W}^h_\ell\}$ to maximize prediction accuracy over the corpus. Utilizing a mean squared error (MSE), the prediction task reduces to

$$ \mathcal{A}^* = \arg\min_\mathcal{A ~} \frac{1}{C}~\sum_{u=0}^{C-1} ~ \Big\| \, \Phi \big(\, \mathbf{X}_{u}, \, \mathcal{A} \, \big) \mathbf{1} – \mathbf{x}_{u+T} \,\Big \|^2 \, . $$

In this equation, we compare the estimate $\hat{\mathbf{x}}_{u+T}$ read out from the transformer’s output with the true next word $\mathbf{x}_{u+T}$. We average the resulting MSE loss over the corpus and seek the tensor $\mathcal{A}^*$ that minimizes it. Notice that to simplify notation, we sum over the whole corpus. In reality, we can’t predict the last $T$ words because we are using histories $\mathbf{X}_u$ of length $T$. In fact, we have several other limitations in the construction of the training dataset. We may, e.g., want to avoid running over the end of a play or the end of an act. We choose to ignore these practicalities as they have little effect.

Task 5¶

Split the corpus loaded in Task 6.3 into a training set containing 90% of the total number of words and a test set containing 10% of the words. Recall that this is a time series of word embeddings. Use this training set to train a transformer that predicts the next word embedding using the loss in

$$ \mathcal{A}^* = \arg\min_\mathcal{A ~} \frac{1}{C}~\sum_{u=0}^{C-1} ~ \Big\| \, \Phi \big(\, \mathbf{X}_{u}, \, \mathcal{A} \, \big) \mathbf{1} – \mathbf{x}_{u+T} \,\Big \|^2 \, . $$

Use $ T = 64 $ for the length of the history $ \mathbf{X}_u $. Transformer parameters are your choice. If you need a recommendation, use $ L = 6 $, $ H = 8 $, and $ m = 32 $.

Evaluate the test MSE and compare it to the train MSE. Both of these MSE values indicate good prediction. However, this does not mean that we are making good predictions of the next word in the sequence. Explain.

Data Split¶

In [10]:

T = 64 # context size
split_factor = 0.9
split_index = int(split_factor * len(tokenized_text))
    
# Splitting into train and test sets
train = tokenized_text[:split_index].to(device)
test = tokenized_text[split_index:].to(device)

Dataset¶

In [11]:

class WordIndexDataset(Dataset):
    """
    This Dataset class takes and encoded tensor of word indices and returns a tensor of context windows of size T.
    The tensors returned by this dataset are not yet one-hot encoded.
    """
    def __init__(self, text, T):
        self.text = text
        self.T = T
        assert self.T < len(text), "context_size (T) must be less than len(text)"

    def __len__(self):
        return len(self.text) - self.T

    def __getitem__(self, idx):
        """
        Return a single context window of size T. 
        The context window is a sequence of T words.

        During training, we will predict the next token of every word in the context window,
        so Y_item is the next word for every word in the context window.
        """
        X_item = self.text[idx:idx + self.T]
        Y_item = self.text[idx + 1:idx + self.T + 1]

        return X_item, Y_item

train_dataset = WordIndexDataset(train, T)
test_dataset = WordIndexDataset(test, T)


# Example of a batch
B = 64
train_loader = DataLoader(train_dataset, batch_size=B, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=B, shuffle=False)

E, y_idx = next(iter(train_loader))
print(f"X_idx shape: {E.shape}")
print(f"y_idx shape: {y_idx.shape}")

X_idx shape: torch.Size([64, 64])
y_idx shape: torch.Size([64, 64])

In [12]:

# Training
n_epochs = 3
m = 32
n = 256
L = 6
T = 64
H = 8

estimator = LanguageTransformer(m, n, L, H).float().to(device)
optimizer = torch.optim.SGD(estimator.parameters(), lr=1e-5)

train_loader = DataLoader(train_dataset, batch_size=B, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=B, shuffle=False)

cross_entropy_loss = nn.MSELoss()
estimator.train()
train_loss = []

for epoch in range(n_epochs): # Iterate over n_epochs epochs

    for x_batch, y_batch in tqdm(train_loader): # Iterate over all batches in the dataset 
        # Load the embeddings for the target word
        # We want to predict the last word of the context window for this exercise.
        y_word_to_predict = y_batch[:,-1]
        Y_embeddings = pca_embeddings[y_word_to_predict].transpose(0,1).to(device) # (B, n)
        
        # (Step i) Load the data. These commands send the data to the GPU memory.
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # (Step ii) Compute the gradients. We use automated differentiation.
        optimizer.zero_grad() # Gradient reset to indicate where the backward computation stops.

        # Call the neural network. In this case, we will take the average of the output of the
        # transformer as the prediction.
        y_hat = estimator(x_batch).mean(dim=-1)
        cross_entropy_value = cross_entropy_loss(y_hat,Y_embeddings)

        cross_entropy_value.backward() # Compute gradients moving backwards untit the gradient reset.

        # (Step iii) Update parameters by taking an SGD (or other optimizer) step.
        optimizer.step()

        train_loss.append(cross_entropy_value.item())
    if epoch print(f"Epoch {epoch}/{n_epochs} Loss: {train_loss[-1]}")

    # End of batch loop.

# Evaluate test loss
estimator.eval()
with torch.no_grad():
    test_losses = []
    for x_batch, y_batch in tqdm(test_loader):
        y_word_to_predict = y_batch[:,-1]
        Y_embeddings = pca_embeddings[y_word_to_predict].transpose(0,1).to(device) # (B, n)
        y_hat = estimator(x_batch).mean(dim=-1)
        test_losses.append(cross_entropy_loss(y_hat,Y_embeddings).item())
    test_loss = torch.tensor(test_losses).mean().item()

print(f"Train loss: {train_loss[-1]}")
print(f"Test loss: {test_loss}")

  0%|          | 0/4107 [00:00<?, ?it/s]/home/jporras/miniconda3/envs/lab_5a_trf/lib/python3.12/site-packages/torch/nn/modules/loss.py:608: UserWarning: Using a target size (torch.Size([256, 64])) that is different to the input size (torch.Size([64])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)
100%|█████████▉| 4106/4107 [01:14<00:00, 55.12it/s]/home/jporras/miniconda3/envs/lab_5a_trf/lib/python3.12/site-packages/torch/nn/modules/loss.py:608: UserWarning: Using a target size (torch.Size([256, 16])) that is different to the input size (torch.Size([16])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)
100%|██████████| 4107/4107 [01:14<00:00, 55.26it/s]

Epoch 0/3 Loss: 1.7093441486358643

100%|██████████| 4107/4107 [01:14<00:00, 55.14it/s]
100%|██████████| 4107/4107 [01:14<00:00, 55.06it/s]
 98%|█████████▊| 445/456 [00:03<00:00, 128.91it/s]/home/jporras/miniconda3/envs/lab_5a_trf/lib/python3.12/site-packages/torch/nn/modules/loss.py:608: UserWarning: Using a target size (torch.Size([256, 24])) that is different to the input size (torch.Size([24])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)
100%|██████████| 456/456 [00:03<00:00, 129.09it/s]

Train loss: 1.551069974899292
Test loss: 1.2948185205459595

6. Probability Readout¶

The predictions $\hat{\mathbf{x}}_{u+T}$ in

$$ \hat{\mathbf{x}}_{u+T} = \Big[ \, \Phi (\mathbf{X}_u, \mathcal{A}) \, \Big] \mathbf{1} $$

may have a small MSE when compared to the observed words $\mathbf{x}_{u+T}$ but they are not a good strategy for estimating the next word. This is because $\hat{\mathbf{x}}_{T}$ need not be a valid word. Indeed, it most likely will not be a valid word.

Word $\mathbf{e}_i$ is represented by the eigenvector encoding $\mathbf{x}_i = \mathbf{V}_n^T \mathbf{e}_i$ as stated in

$$ \mathbf{x}_i = \mathbf{V}_n^T \mathbf{e}_i. $$

Since there are a total of $c$ words in our corpus, there are a total of $c$ vectors $\mathbf{x}_i$ that represent valid words. The vectors at the output of the transformer are most unlikely to be one of these vectors, and the estimate $\hat{\mathbf{x}}_{T}$ is just as unlikely unless we manage to drive the train and test MSEs to zero.

To solve this problem, we must force the readout to be a valid word. We do that with a readout layer whose output is a vector of $\tilde{d}_n$ probabilities for each of the $\tilde{d}_n$ words in the corpus. This readout layer is a softmax applied to the output of a fully connected layer that acts on the output of the transformer,

$$ \boldsymbol{\pi} (\mathbf{X}) = \text{sm} \Big[\, \mathbf{A} \, \text{vec} \big( \Phi (\mathbf{X}, \mathcal{A})\big) \, \Big] . $$

The matrix $\mathbf{A}$ is a trainable parameter with $nT$ columns and $c$ rows. After applying the softmax normalization, the entries of the output $\boldsymbol{\pi}(\mathbf{X})$ add up to one and can be interpreted as a set of probabilities that dictate the likelihood of the next word in the sequence. The $i$th entry $\boldsymbol{\pi}_i(\mathbf{X})$ is the predicted probability that the next word is $\mathbf{e}_i$.

We refer to the probabilities in

$$ \boldsymbol{\pi} (\mathbf{X}) $$

as a policy. To train this policy, we minimize the cross-entropy loss between the true word at time $u+T$ and the probabilities $\boldsymbol{\pi}(\mathbf{X})$,

$$ \mathcal{A}^*, \mathbf{A}^* = \arg\min_{\mathcal{A},\, \mathbf{A}} ~ \frac{1}{C}~\sum_{u=0}^{C-1} ~ \big(\mathbf{e}_{u+T}\big)^T \big( \log \boldsymbol{\pi}(\mathbf{X}_u) \big) . $$

Notice that in

$$ \mathcal{A}^*, \mathbf{A}^* = \arg\min_{\mathcal{A},\, \mathbf{A}} ~ \frac{1}{C}~\sum_{u=0}^{C-1} ~ \big(\mathbf{e}_{u+T}\big)^T \big( \log \boldsymbol{\pi}(\mathbf{X}_u) \big) . $$

the vector $\mathbf{e}_{u+T}$ is the index encoding of the word at time $u+T$. This is a vector with all zeros except that it has a 1 at the entry that corresponds to the index of the word that is observed at time $u+T$. It is therefore a valid probability index that we can incorporate into a cross-entropy comparison.

Further notice that the optimization is joint over the trainable parameters $\mathcal{A}$ of the transformer and the readout matrix $\mathbf{A}$. These two parameters are implicit in

$$ \mathcal{A}^*, \mathbf{A}^* = \arg\min_{\mathcal{A},\, \mathbf{A}} ~ \frac{1}{C}~\sum_{u=0}^{C-1} ~ \big(\mathbf{e}_{u+T}\big)^T \big( \log \boldsymbol{\pi}(\mathbf{X}_u) \big) . $$

They appear because $\boldsymbol{\pi} (\mathbf{X}_u)$ depends on $\mathbf{A}$ and $\mathcal{A}$. In the hope that it is revealing to make this dependence explicit, we instantiate $\mathbf{X} = \mathbf{X}_u$ in

$$ \boldsymbol{\pi} (\mathbf{X}) = \text{sm} \Big[\, \mathbf{A} \, \text{vec} \big( \Phi (\mathbf{X}, \mathcal{A})\big) \, \Big] $$

and substitute the result in

$$ \mathcal{A}^*, \mathbf{A}^* = \arg\min_{\mathcal{A},\, \mathbf{A}} ~ \frac{1}{C}~\sum_{u=0}^{C-1} ~ \big(\mathbf{e}_{u+T}\big)^T \big( \log \boldsymbol{\pi}(\mathbf{X}_u) \big) . $$

to write

$$ \mathcal{A}^*, \mathbf{A}^* = \arg\min_{\mathcal{A},\, \mathbf{A}} ~ \frac{1}{C}~\sum_{u=0}^{C-1} ~ \Big[\mathbf{e}_{u+T}\Big]^T \bigg[ \log \text{sm} \Big[\, \mathbf{A} \, \text{vec} \big(\, \Phi (\mathbf{X}_u, \mathcal{A}) \, \big) \, \Big]\, \bigg] . $$

We solve this empirical risk minimization (ERM) problem to predict the next word in a sequence of text. This prediction is based on observing a history of length $T$ that is processed by a transformer

$$ \mathbf{X}_\ell = \mathbf{X}_{\ell-1} + \sigma\bigg( \sum_{h=1}^H \mathbf{Y}_\ell^h \,\bigg) . $$

with a probability readout layer

$$ \boldsymbol{\pi} (\mathbf{X}) = \text{sm} \Big[\, \mathbf{A} \, \text{vec} \big( \Phi (\mathbf{X}, \mathcal{A})\big) \, \Big] . $$

Different from the readout strategy in

$$ \hat{\mathbf{x}}_{u+T} = \Big[ \, \Phi (\mathbf{X}_u, \mathcal{A}) \, \Big] \mathbf{1} $$

and the training procedure in

$$ \mathcal{A}^* = \arg\min_{\mathcal{A}} \frac{1}{C}~\sum_{u=0}^{C-1} ~ \Big\| \, \Phi \big(\, \mathbf{X}_{u}, \, \mathcal{A} \, \big) \mathbf{1} – \mathbf{x}_{u+T} \,\Big \|^2 \, . $$

the ERM problem in

$$ \mathcal{A}^*, \mathbf{A}^* = \arg\min_{\mathcal{A},\, \mathbf{A}} ~ \frac{1}{C}~\sum_{u=0}^{C-1} ~ \big(\mathbf{e}_{u+T}\big)^T \big( \log \boldsymbol{\pi}(\mathbf{X}_u) \big) . $$

produces parameters $\mathcal{A}^*$ and $\mathbf{A}^*$ that map directly to predictions of actual words.

Task 6¶

Modify the transformer of Task 6.4 to add the readout layer in

$$ \boldsymbol{\pi} (\mathbf{X}) = \text{sm} \Big[\, \mathbf{A} \, \text{vec} \big( \Phi (\mathbf{X}, \mathcal{A})\big) \, \Big]. $$

In [13]:

class LanguageTransformerWithReadout(nn.Module):
    """
    A slight modification of the LanguageTransformer class of Task 4.
    Args:
        m (int): The dimension of the Q and K matrices.
        n (int): The number of features, n=12 in our case.
        k (int): The dimension of the W matrix.
        L (int): The number of layers.
        H (int): The number of heads.
        c (int): The vocabulary size.
    """
    def __init__(self, m, n, L, H, c):
        super(LanguageTransformerWithReadout, self).__init__()
        self.layers = nn.ModuleList([
            MultiHeadLayer(m, n, H) for _ in range(L)
        ])

        self.embedding_table = pca_embeddings

        # Adding readout layer
        self.readout = nn.Parameter(torch.empty(c, n).to(device))
        nn.init.kaiming_uniform_(self.readout, a=math.sqrt(5))
        
    def forward(self, E):
        """
        We change the forward pass from the previous Transformer.
        Instead of concatenating a vector to the sequence, we now output a vector of probabilities for each word in the sequence.
        Args:
            E (torch.Tensor): The input word indices.
        Returns:
            Y_hat (torch.Tensor): The output of the transformer, passed through the readout layer.
        """
        X = self.embedding_table[E].transpose(1,2)

        B, n, T = X.shape
        
        # X_l has shape (B, n, T+1)
        X_l = X
        for layer in self.layers:
            X_l = layer(X_l)

        # We implement the readout layer as a linear mapping on each word in the sequence.
        Y_hat = torch.matmul(self.readout, X_l) # (B, c, T)

        # Notice, we don't apply the softmax here, because we keep the probabilities unnormalized until 
        # we call the loss function, for numerical stability.
        return Y_hat

# testing. Now the transformer outputs a vector of probabilities for each word in the sequence.
E = torch.randint(0, len(vocab), (1,5)).to(device).long()
model = LanguageTransformerWithReadout(m=32, n=256, L=6, H=8, c=c).to(device)
out = model(E)
print(f"out.shape: {out.shape}")

out.shape: torch.Size([1, 14295, 5])

Task 7¶

Split the corpuses loaded in Task 6.1 and Task 6.3 into a training set containing 90% of the total number of words and a test set containing 10% of the words. Recall that these two are equivalent time series except that the information is encoded differently. In Task 6.1, we store words using index encodings, and in Task 6.3, we store words using eigenvector embeddings. We are loading both here because the eigenvector encodings are the input to the transformer, and the index encodings are needed for the crossentropy comparison in

$$ \mathcal{A}^*, \mathbf{A}^* = \arg\min_{\mathcal{A}, \, \mathbf{A}} \frac{1}{C} \sum_{u=0}^{C-1} \Big[\mathbf{e}_{u+T}\Big]^T \bigg[ \log \text{sm} \Big[\, \mathbf{A} \, \text{vec} \big(\, \Phi (\mathbf{X}_u, \mathcal{A}) \, \big) \, \Big]\, \bigg]. $$

Make sure that time indexes match in your data.

Use the training set to train a transformer that predicts next word probabilities using the transformer with readout of Task 6.6. Use $T=32$ for the length of the history $\mathbf{X}_u$. Transformer parameters are your choice. If you need a recommendation, use $L=6$, $H=8$, and $m=32$.

Evaluate the crossentropy loss in the test set and compare it to the crossentropy loss in the training set.

In [75]:

# Training
n_epochs = 5
B = 64
m = 32
n = 256
L = 6
T = 32
H = 8

estimator = LanguageTransformerWithReadout(m, n, L, H, c).float().to(device)
optimizer = torch.optim.SGD(estimator.parameters(), lr=1e-5)

train_loader = DataLoader(train_dataset, batch_size=B, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=B, shuffle=False)

# We use the Cross Entropy loss for estimating the probabilities of the next word.
cross_entropy_loss = nn.CrossEntropyLoss()
estimator.train()
train_loss = []

for epoch in range(n_epochs): # Iterate over n_epochs epochs

    for x_batch, y_batch in tqdm(train_loader): # Iterate over all batches in the dataset 
        # We want to predict tha last word of the context window for this exercise.
        y_word_to_predict = y_batch[:,-1]
        
        # Load the embeddings for the words.
        X_embeddings = pca_embeddings[x_batch].transpose(1,2) # (B, n, T)
        
        # (Step i) Load the data. These commands send the data to the GPU memory.
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # (Step ii) Compute the gradients. We use automated differentiation.
        optimizer.zero_grad() # Gradient reset to indicate where the backward computation stops.

        # Call the neural network. Get the prediction for the last word.
        y_hat = estimator(x_batch)[:,:,-1]

        # The softmax function is applied internally to the transformer's output y_hat.
        cross_entropy_value = cross_entropy_loss(y_hat,y_word_to_predict)

        cross_entropy_value.backward() # Compute gradients moving backwards untit the gradient reset.

        # (Step iii) Update parameters by taking an SGD (or other optimizer) step.
        optimizer.step()

        train_loss.append(cross_entropy_value.item())

    print(f"Epoch {epoch}/{n_epochs} Loss: {train_loss[-1]}")

    # End of batch loop.
# Evaluate test loss at the end of training
estimator.eval()
with torch.no_grad():
    test_losses = []
    for x_batch, y_batch in tqdm(test_loader):
        y_word_to_predict = y_batch[:,-1]
        Y_embeddings = pca_embeddings[y_word_to_predict].transpose(0,1).to(device) # (B, n)
        y_hat = estimator(x_batch).mean(dim=-1)
        cross_entropy_value = cross_entropy_loss(y_hat,y_word_to_predict)
        test_losses.append(cross_entropy_value.item())
    test_loss = torch.tensor(test_losses).mean().item()

print(f"Train loss: {train_loss[-1]}")
print(f"Test loss: {test_loss}")

100%|██████████| 4107/4107 [02:05<00:00, 32.77it/s]

Epoch 0/5 Loss: 9.349272727966309

100%|██████████| 4107/4107 [02:06<00:00, 32.48it/s]
100%|██████████| 4107/4107 [02:06<00:00, 32.45it/s]
100%|██████████| 4107/4107 [02:06<00:00, 32.45it/s]
100%|██████████| 4107/4107 [02:06<00:00, 32.46it/s]
100%|██████████| 456/456 [00:06<00:00, 75.92it/s]

Train loss: 5.0682373046875
Test loss: 7.467085361480713

Model Sampling¶

After solving the empirical risk minimization (ERM) problem in

we have trained values $\mathcal{A}^*$ for the transformer and $\mathbf{A}^*$ for the probability readout layer. With these trained values, we can execute

$$ \mathbf{\pi}^* (\mathbf{X}) = \text{sm} \Big[\, \mathbf{A}^* \, \text{vec} \big( \Phi (\mathbf{X}, \mathcal{A}^*)\big) \, \Big] $$

for any given text sequence $\mathbf{X}$ of length $T$. The result is the (optimal) vector of probabilities.

This is not yet a word; it is a vector of probabilities that assigns probabilities to each of the $c$ words in the corpus. To generate a word, we need to implement a sampling strategy.

Let us denote as $\pi^* (\mathbf{e}_i | \mathbf{X})$ the probability of choosing word $\mathbf{e}_i$. This is the $i$th entry of the vector of probabilities $\mathbf{\pi}$. A possible sampling strategy is to sample the word $\mathbf{e}_i$ with the highest probability:

$$ \hat{\mathbf{e}} = \arg\max_{\mathbf{e}_i} \pi^* (\mathbf{e}_i | \mathbf{X}). $$

Alternatively, we can sample predictions randomly by choosing different words according to their corresponding probabilities. We write

$$ \hat{\mathbf{e}} = \mathbf{e}_i \sim \pi^* (\mathbf{e}_i | \mathbf{X}) $$

to signify that we choose $\hat{\mathbf{e}} = \mathbf{e}_i$ with probability $\pi^* (\mathbf{e}_i | \mathbf{X})$.

Sampling according to the largest probability (as in the first equation) is a good strategy if we want to predict the next word in the sequence. However, sampling randomly according to word probabilities (as in the second equation) is a better strategy for generating text. Random sampling better imitates the natural variability of human language, and we will use random sampling.

Task 8¶

Given trained parameters $\mathcal{A}^*$ and $\mathbf{A}^*$, implement the following:

(a) A transformer with parameters $\mathcal{A}^*$ that takes language sequences $\mathbf{X}$ of length $T$ as inputs.

(b) A readout layer that postprocesses the output of the transformer to yield a vector of probabilities $\mathbf{\pi}^* (\mathbf{X})$.

(c) A sampler that takes probabilities $\mathbf{\pi}^*(\mathbf{X})$ as inputs and returns words $\hat{\mathbf{e}}$ sampled according to

$$ \hat{\mathbf{e}} = \mathbf{e}_i \sim \pi^* (\mathbf{e}_i | \mathbf{X}). $$

The transformer and readout implementations are just instances of the transformer and readout modules from Tasks 6.4 and 6.6. The only new piece here is the sampler.

Try your sampler for a few input sequences.

In [15]:

# Taking a snippet of the text set to test the model.
starting_point = torch.randint(0, len(test)-T, (1,))
initial_indices = test[starting_point:starting_point+T].unsqueeze(0)

log_probabilities = model(initial_indices)
print(f"log_probabilities.shape: {log_probabilities.shape}")
last_word_probabilities = log_probabilities[:,:,-1]
probabilities = F.softmax(last_word_probabilities, dim=-2)

print(f"Input text: {tokens_to_words(initial_indices.reshape(-1).tolist())}")
print(f"\nThe most likely next word is: {tokens_to_words([torch.argmax(probabilities).item()])}")

print("\nSampled words according to a multinomial distribution (either could be the next word when using sampling):")
for _ in range(10):
    sampled_word = torch.multinomial(probabilities, num_samples=1).item()
    print(f"{tokens_to_words([sampled_word])}", end=" ")

log_probabilities.shape: torch.Size([1, 14295, 32])
Input text: you are cloudy. 
 
 SEBASTIAN: 
 Foul weather? 
 
 ANTONIO: 
 Very foul. 
 
 GONZALO: 
 Had I plantation of this isle,

The most likely next word is: behove

Sampled words according to a multinomial distribution (either could be the next word when using sampling):
mutinous along sans cheerly banish'd prophet's keen Foul rebellion predecessors

Language Generation¶

In the Language Generation section we adopted a transformer to predict the next word of a sequence of length $ T $. We adapt this model to language generation with a rolling execution.

Begin with a language sequence entered by a user, which we call a prompt. From the prompt we construct a time series $ \mathbf{X}_0 $ with the eigenvector encodings of its words

$$ \mathbf{X}_0 = [\mathbf{x}_0, \ldots, \mathbf{x}_{T-1}]. $$

We assume, for simplicity, that this prompt has length $ T $. Using this prompt we predict the next word in the sequence using the policy $ \boldsymbol{\pi}^* $,

$$ \mathbf{x}_T \sim \boldsymbol{\pi}^*(\mathbf{X}_0). $$

Although the input $ \mathbf{x}_T $ has been generated by the policy $ \boldsymbol{\pi}^* $, we reinterpret it as a given word. We then roll the prompt backward and append the generated word $ \mathbf{x}_T $ to construct the series

$$ \mathbf{X}_1 = [\mathbf{x}_1, \ldots, \mathbf{x}_{T-1}, \mathbf{x}_{T}]. $$

In this sequence the first $ T-1 $ entries are part of the user prompt. The last one, $ \mathbf{x}_{T} $, has been generated. We ignore this distinction and proceed to estimate word $ T+1 $ as

$$ \mathbf{x}_{T+1} \sim \boldsymbol{\pi}^*(\mathbf{X}_1). $$

We then proceed to append this generated word to the time series in the previous equation and roll the series backward. This procedure yields the time series,

$$ \mathbf{X}_2 = [\mathbf{x}_2, \ldots, \mathbf{x}_{T-1}, \mathbf{x}_{T}, \mathbf{x}_{T+1}]. $$

In this time series we have the last $ T-2 $ words of the user prompt and two words that have been generated by policy $ \boldsymbol{\pi}^* $. These are the words $ \mathbf{x}_{T} $ and $ \mathbf{x}_{T+1} $ generated in the previous equations. We again ignore this distinction and generate the next word as,

$$ \mathbf{x}_{T+2} \sim \boldsymbol{\pi}^*(\mathbf{X}_2). $$

We append word $ \mathbf{x}_{T+2} $ to the time series, roll the time series backward, and use the updated series to predict the next word in the sequence. In general, at generative step $ u $ we take as an input the time series

$$ \mathbf{X}_u = [\mathbf{x}_u, \ldots, \mathbf{x}_{T-1+u}], $$

in which the last $ u $ samples have been generated—it can be that all of the samples are generated if $ u \geq T $. From this time series we generate the word in position $ T+u $ as,

$$ \mathbf{x}_{T+u} \sim \boldsymbol{\pi}^*(\mathbf{X}_u). $$

The output of the generative language model is the string of text $[\mathbf{x}_T, \ldots, \mathbf{x}_{T_{\max}}]$ where $ T_{\max} $ is a prespecified limit for the length of the language sequence to be generated. Of course, rather than returning the eigenvector embeddings $[\mathbf{x}_T, \ldots, \mathbf{x}_{T_{\max}}]$ we return the sequence of corresponding words.

Task 9¶

Implement the generative language model as specified by the recursion $\text{(31)-(32)}$

Take prompts of length $ T=64 $ as inputs and generate language sequences of length $ T_{\max}=500 $. To make this task more interesting, modify your implementation to accept prompts of any length $ T’ \leq T $. This is not difficult because absent words in the prompt can be set to zero.

Try your generative model for some prompts.

In [16]:

def generate_text(model, X, max_generate_tokens=500):
    """
    Generate text from a model given an initial input token sequence.
    Args:
        model (nn.Module): The model to use for generation.
        input_tokens (torch.Tensor): The initial input token sequence.
        max_generate_tokens (int): The maximum number of tokens to generate.
    Returns:
        torch.Tensor: The generated token sequence.
    """
    with torch.no_grad():
        context = X.clone()
        generated_sequence = X.cpu().squeeze().tolist()  # Ensure it's a 1D list
        for _ in range(max_generate_tokens):
            logits = model(context)
            
            last_word_embeddings = logits[:,:,-1]
            probs = F.softmax(last_word_embeddings, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Slide context window: remove the first token and append the next token
            context = torch.cat([context[:, 1:], next_token], dim=1)  
            generated_sequence.append(next_token.squeeze().item())  # Add new token to generated sequence
        generated_words = tokens_to_words(generated_sequence)
        generation_string = "".join(generated_words)
        return generation_string

# Test generate
#model = LanguageTransformerWithReadout(m=32, n=256, L=6, H=8, c=c).to(device)
starting_point = torch.randint(0, len(test)-T, (1,))
initial_indices = test[starting_point:starting_point+T].unsqueeze(0)
print(f"========== INPUT TEXT ==========")
print(f"{tokens_to_words(initial_indices.reshape(-1).tolist())}\n")

# This is the model from task 7
print(f"========== INPUT + GENERATED TEXT ==========")
print(generate_text(estimator, initial_indices, max_generate_tokens=100))
print(f"====================================")

========== INPUT TEXT ==========

 Or as twere perfumed by a fen. 
 
 GONZALO: 
 Here is everything advantageous to life. 
 
 ANTONIO: 
 True; save means to live

========== INPUT + GENERATED TEXT ==========

 Or as twere perfumed by a fen. 
 
 GONZALO: 
 Here is everything advantageous to life. 
 
 ANTONIO: 
 True; save means to live Merry destined buckets Clifford Lieutenant unthought Vienna Usurps undertakes sighing divorced blacks Judge boldness flaky groans affable Officed Weak northern Day mocker squire complaining expecting foil'd ibat You'd An descends Show lethargy Trimm'd Lend sprightly woe's thief Praised medlars Drop incident discolour'd furred mightst honourably weary Bastards resolute enjoin'd appeall'd Twas look pass'd celestial standest withholds avoid Philip thanks amazed rarely shreds equall'd Unwieldy stall hairs life of parson's swounded death suggestion camels detect Fertile discharged ungentle darkened presumes attorneys assuage wherewith bay dwell cedars conditions O'erthrows kine seem'd grinning distemper horsing He'ld gentlemen fare daws crowning hole press'd shaft
====================================

In [74]:

def generate_from_prompt(estimator,prompt,max_generate_tokens=100):
    words = split_to_words(prompt)
    tokens = words_to_tokens(words)
    token_tensor = torch.tensor(tokens).to(device).unsqueeze(0)
    return generate_text(estimator, token_tensor, max_generate_tokens=max_generate_tokens)
print(generate_from_prompt(estimator,"Alas mother,",max_generate_tokens=10))

Alas mother, which didst, 
 By your daughter's my hate:

In [82]:

# Some examples of generated text with the trained model
shakespeare_quotes = [
    "All the world's a stage, and all the men and women merely players.",  # As You Like It (Act 2, Scene 7)
    "A fool thinks himself to be wise, but a wise man knows himself to be a fool.",  # As You Like It (Act 5, Scene 1)
    "How beauteous mankind is! O brave new world!",  # The Tempest (Act 5, Scene 1) – Miranda.
    "O brave new world, that has such people in't!",  # The Tempest (Act 5, Scene 1) – Miranda.
    "Love all, trust a few, do wrong to none.",  # All's Well That Ends Well (Act 1, Scene 1)
    "To be or not to be, that is the question.",  # Hamlet (Act 3, Scene 1)
]

for quote in shakespeare_quotes:
    try: 
        print(f"========== INPUT ==========")
        print(f"{' '.join(split_to_words(quote))}")
        print(f"========== INPUT + GENERATED TEXT ==========")
        print(generate_from_prompt(estimator,quote,max_generate_tokens=15))
        print(f"====================================")
    except Exception as e:
        # Some of those words weren't on our vocabulary so the model doesn't know what to do.
        print(f"Error generating from prompt: {e}")

========== INPUT ==========
All the world's a stage , and all the men and women merely players .
========== INPUT + GENERATED TEXT ==========
All the world's a stage, and all the men and women merely players. wanton's Drums 
 
 ability posts WILLOUGHBY sympathy pierced Blown Fitzwater longing need privileged candle
====================================
========== INPUT ==========
A fool thinks himself to be wise , but a wise man knows himself to be a fool .
========== INPUT + GENERATED TEXT ==========
A fool thinks himself to be wise, but a wise man knows himself to be a fool. Fortune steel'd 
 departure collar spread noble medlar you Rather Split'st unharm'd begins wolvish Bores
====================================
========== INPUT ==========
How beauteous mankind is ! O brave new world !
========== INPUT + GENERATED TEXT ==========
How beauteous mankind is! O brave new world! need hawk palsy bleared prevailing the, dined Beauty heels planets ruth 
 earliness 

====================================
========== INPUT ==========
O brave new world , that has such people in't !
========== INPUT + GENERATED TEXT ==========
O brave new world, that has such people in't!: suggest Padua Uninhabitable 
 ungracious convict unfinish'd Beg Coriolanus,, Capitol rate
====================================
========== INPUT ==========
Love all , trust a few , do wrong to none .
========== INPUT + GENERATED TEXT ==========
Love all, trust a few, do wrong to none. wavering harden'd, velvet 
 numbering wont 
 I Providence merriment cardinally whiteness labours 

====================================
========== INPUT ==========
To be or not to be , that is the question .
========== INPUT + GENERATED TEXT ==========
To be or not to be, that is the question., 
 dews rescue direful maidenhead, incensed parcels forswore bull penny inky Mist boasts
====================================

Positional Encoding¶

The output of a transformer is equivariant to a permutation of the entries of the time series. If we exchange the positions of $\mathbf{x}_t$ and $\mathbf{x}_u$, the output of the transformer remains the same, except that $[\Phi(\mathbf{X}_u, \mathcal{A})]_u$ and $[\Phi(\mathbf{X}_t, \mathcal{A})]_u$ also exchange places. Positional encoding is a strategy to break this symmetry so that words can have different effects depending on their positions.

We use oscillations to define positional encodings. For a time series made up of vectors $\mathbf{x}_t$ with $n$ entries, we define $n/2$ frequencies $\alpha_i$. For each of these frequencies, we define a time series $\mathbf{P}$ in which the values for time $t$ and index $i$ are given by

$$ p_{ti} = \begin{cases} \cos \left(2\pi\, \alpha_{(i+1)/2} \, \frac{t}{T}\right), & \text{if } i \text{ is odd} \\ \sin \left(2\pi\, \alpha_{i/2} \, \frac{t}{T}\right), & \text{if } i \text{ is even}. \end{cases} $$

As per the equation above, positional encoding includes sines and cosines of different frequencies in different rows of the positional encoding time series. Odd rows of $\mathbf{P}$ are cosines of frequency $\alpha_{(i+1)/2}$. Even rows of $\mathbf{P}$ are sines of frequency $\alpha_{i/2}$.

The use of sines and cosines in this context is motivated by the Fourier basis, which has intimate connections with convolution. This is a story for another day.

Task 10¶

Modify the transformer of Task 6.6 to incorporate positional encoding. Implement the positional encoding $\mathbf{P}$ as a learnable parameter.

In [18]:

class LanguageTransformerWithReadoutAndPositionalEncoding(nn.Module):
    """
    Modification of the LanguageTransformerWithReadout class of Task 7 to include positional encoding.
    Positional encoding is a learnable matrix that is added to the embeddings of the input tokens.
    
    Each entry in the positional encoding matrix is is a vector of size n that represents a position in the sequence.
    """
    def __init__(self, m, n, L, H, c):
        super(LanguageTransformerWithReadoutAndPositionalEncoding, self).__init__()
        self.layers = nn.ModuleList([
            MultiHeadLayer(m, n, H) for _ in range(L)
        ])

        # Learnable parameters for positional encoding. 
        # Each entry in the positional encoding matrix is is a vector of size n that represents a position in the sequence.
        self.position_embedding = nn.Embedding(T, n)

        self.embedding_table = pca_embeddings

        # Adding readout layer
        self.readout = nn.Parameter(torch.empty(c, n).to(device))
        nn.init.kaiming_uniform_(self.readout, a=math.sqrt(5))
        
    def forward(self, E):
        """
        We change the forward pass from the previous Transformer.
        Instead of concatenating a vector to the sequence, we now output a vector of probabilities for each word in the sequence.
        Args:
            E (torch.Tensor): The input word indices.
        Returns:
            Y_hat (torch.Tensor): The output of the transformer, passed through the readout layer.
        """
        B, T = E.shape

        # Word embeddings
        X = self.embedding_table[E].transpose(1,2) # (B, n, T)

        # To create positional encodings, we need to create a vector for each position in the sequence.
        P = self.position_embedding(torch.arange(T, device=device)).transpose(0,1) # (n, T)
        
        # Adding word embeddings and positional encoding
        # Although P is (n,T), this is broadcasted to (B, n, T), which means that the same 
        # positional encoding is added to every sequence in the batch.
        X_tilde = X + P
        
        # X_l has shape (B, n, T+1)
        X_l = X_tilde
        for layer in self.layers:
            X_l = layer(X_l)

        # We implement the readout layer as a linear mapping on each word in the sequence.
        Y_hat = torch.matmul(self.readout, X_l) # (B, c, T)

        # Notice, we don't apply the softmax here, because we keep the probabilities unnormalized until 
        # we call the loss function, for numerical stability.
        return Y_hat

# testing. Now the transformer outputs a vector of probabilities for each word in the sequence.
E = torch.randint(0, len(vocab), (1,5)).to(device).long()
print(f"E.shape: {E.shape}")
model = LanguageTransformerWithReadoutAndPositionalEncoding(m=32, n=256, L=6, H=8, c=c).to(device)
out = model(E)
print(f"out.shape: {out.shape}")

E.shape: torch.Size([1, 5])
out.shape: torch.Size([1, 14295, 5])

Practical Considerations¶

Layer Normalization¶

In order to improve training stability and convergence, one common implementation trick is to normalize the output vectors of a layer to have zero mean and unit variance. This is commonly referred to as layer normalization. If $\mathbf{X} = \mathbf{W}^\ell \mathbf{X}_\ell$ is the output of layer $\ell$, the normalized output $\hat{\mathbf{X}}$ at layer $\ell$ is computed as:

$$ \hat{\mathbf{X}}_{ij} = \gamma_i \cdot \frac{\mathbf{X}_{ij} – \mu_i}{\sqrt{\sigma_{i}^2 + \epsilon}} + \beta_i $$

Here, $\epsilon > 0$ is a small number to avoid dividing by zero, and $\mu$ and $\sigma^2$ are the row-wise mean and variance of the elements $\mathbf{X}_{ij}$ at layer $\ell$:

$$ \mu_i = \frac{1}{n} \sum_{j=1}^n \mathbf{X}_{ij} $$$$ \sigma_i^2 = \frac{1}{n} \sum_{j=1}^n (\mathbf{X}_{ij} – \mu_i)^2 $$

The learnable parameters $\gamma_i$ and $\beta_i$ play the role of recovering the mean and the variance. This might seem like we didn’t do anything, but now the learnable parameters do not depend on the computation of $\mathbf{X}$. This results in more stable training.

By normalizing each hidden vector, layer normalization helps to mitigate internal covariate shift and ensures more stable gradients during training.

Task 11¶

Modify the transformer of Task 6.10 to incorporate layer normalization at the output of each transformer layer. Use the PyTorch function nn.LayerNorm.

In [19]:

# We now need to modify both MultiHeadLayer and the LanguageTransformer class to include layer normalization.
class MultiHeadLayer(nn.Module):
    """
    A modified version of the MultiHeadLayer class with layer normalization.
    It will have two normalization layers, one after the multi-head attention and one after the nonlinearity.
    """
    def __init__(self, m, n, H):
        super(MultiHeadLayer, self).__init__()
        self.m = m
        self.H = H

        self.Q = nn.Parameter(torch.empty(H, m, n))
        self.K = nn.Parameter(torch.empty(H, m, n))
        self.V = nn.Parameter(torch.empty(H, m, n))

        self.W = nn.Parameter(torch.empty(n, m))
        
        # First layer normalization object.
        # Layernorm will average over the n dimensions of each element in the sequence.
        self.layer_norm1 = nn.LayerNorm(n)
        
        self.nonlinearity = nn.ReLU()
        
        # Second layer normalization object.
        self.layer_norm2 = nn.LayerNorm(n)
        
        self.initialize_parameters()

    def initialize_parameters(self):
        """
        Initialize the values of the learnable parameter matrices.
        Kaiming uniform is just a type of random initialization, you don't need to 
        worry about it. It is a good default initialization for linear layers.
        """
        nn.init.kaiming_uniform_(self.Q, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.K, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.V, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.W, a=math.sqrt(5))

    def forward(self, X):
        """
        Forward pass of the multihead attention layer with layer normalization.

        Args:
            X (torch.Tensor): The input embeddings.
        Returns:
            X_l (torch.Tensor): The output of the multihead attention layer.
        """
        B, n, T = X.shape  # X: (B, n, T)

        # First layer normalization.
        # An annoying Pytorch detail: layer norm function expects the normalization to be over the last dimension.
        # Therefore, we need to transpose the last two dimensions of the input to shape (B, T, n) each time we normalize, then transpose back.
        # (X.transpose(-2,-1) means that we are transposing over the last two dimensions)
        X = self.layer_norm1(X.transpose(-2,-1)).transpose(-2,-1)

        # Expand X to include the head dimension
        X_expanded = X.unsqueeze(1)  # (B, 1, n, T)

        # Compute QX, KX, VX for each head
        QX = torch.matmul(self.Q.unsqueeze(0), X_expanded)  # (B, H, m, T)
        KX = torch.matmul(self.K.unsqueeze(0), X_expanded)  # (B, H, m, T)
        VX = torch.matmul(self.V.unsqueeze(0), X_expanded)  # (B, H, m, T)
        
        QX_t = QX.transpose(-2, -1)  # (B, H, T, m)

        # Compute attention scores B per head
        B = torch.matmul(QX_t, KX)  # (B, H, T, T)
        A = F.softmax(B, dim=-1)
    
        A_t = A.transpose(-2,-1)
        VXA_t = torch.matmul(VX, A_t) # (B, H, m, T)
        Y = torch.matmul(self.W, VXA_t) # (B, H, T, n)

        # Second layer normalization. Transpose over the last two dimensions
        Y = self.layer_norm2(Y.transpose(-2,-1)).transpose(-2,-1)
        
        X_l = X + self.nonlinearity(Y.sum(dim=1))

        return X_l

# Testing the change
model = MultiHeadLayer(m=32, n=256, H=2).to(device)
X = torch.randn(1,256,5).to(device)
out = model(X)
print(f"out.shape: {out.shape}")

out.shape: torch.Size([1, 256, 5])

In [20]:

class LanguageTransformer(nn.Module):
    """
    Taken from Task 10 and added layer normalization. This is the final version of this class.
    """
    def __init__(self, m, n, L, H, c, T):
        super(LanguageTransformer, self).__init__()
        
        self.layers = nn.ModuleList([
            MultiHeadLayer(m, n, H) for _ in range(L)
        ])

        # PCA Word embeddings
        self.embedding_table = pca_embeddings
        
        # Positional encoding
        self.position_embedding = nn.Embedding(T, n)

        # Layer normalization
        self.layer_norm = nn.LayerNorm(n)

        # Adding readout layer
        self.readout = nn.Parameter(torch.empty(c, n).to(device))
        nn.init.kaiming_uniform_(self.readout, a=math.sqrt(5))
        
    def forward(self, E):
        """
        Args:
            E (torch.Tensor): The input word indices.
        Returns:
            Y_hat (torch.Tensor): The output of the transformer, passed through the readout layer.
        """
        B, T = E.shape

        # Word embeddings
        X = self.embedding_table[E].transpose(1,2) # (B, n, T)

        # To create positional encodings, we need to create a vector for each position in the sequence.
        P = self.position_embedding(torch.arange(T, device=device)).transpose(0,1) # (n, T)
        
        X_tilde = X + P
        
        # X_l has shape (B, n, T+1)
        X_l = X_tilde
        for layer in self.layers:
            X_l = layer(X_l)

        X_l = self.layer_norm(X_l.transpose(-2,-1)).transpose(-2,-1)

        # We implement the readout layer as a linear mapping on each word in the sequence.
        Y_hat = torch.matmul(self.readout, X_l) # (B, c, T)

        
        return Y_hat

# testing. 
E = torch.randint(0, pca_embeddings.shape[0], (1,5)).to(device).long()
print(f"E.shape: {E.shape}")
model = LanguageTransformer(m=32, n=256, L=6, H=8, c=c, T=5).to(device)
out = model(E)
print(f"out.shape: {out.shape}")

E.shape: torch.Size([1, 5])
out.shape: torch.Size([1, 14295, 5])

Future Masking¶

Notice that in Equation $\text{(13)}$, we have attention coefficients for each pair of words in a sequence. This means that our model can potentially learn to have $\mathbf{A}_{ij}$ will have nonzero attention even if the word $\mathbf{w}_j$ is ahead of the word $\mathbf{w}_i$. For tasks such as word generation, this is undesirable: we want to ensure that our attention coefficients only focus on past words, so that we can effectively predict the next one better.

We can apply future masking to ensure this. The idea is to reweight $\mathbf{a}_t$ so that the attention weight is zero for all the words beyond $t$:

$$ \mathbf{a}_t = [a_{1t}\ a_{2t}\ \dots a_{tt}\ 0\ \dots\ 0], $$

while making sure that the nonzero attention coefficients sum to 1.

An implementation trick to achieve this is to manually set the coefficients to $-\infty$ before passing them to softmax:

$$ \mathbf{B}_{ij} = [(\mathbf{Q}\mathbf{X})^T(\mathbf{K}\mathbf{X})]_{ij} $$$$ \tilde{\mathbf{B}}_{ij} = \begin{cases} \mathbf{B}_{ij} & \text{if } j > i \\ -\infty & \text{otherwise} \end{cases} $$

So for each head in $\text{(13)}$, $\mathbf{A}^h_\ell$ is replaced by

$$ \tilde{\mathbf{A}}^h_\ell = \text{sm}(\tilde{\mathbf{B}}^h_\ell). $$

Task 12¶

Modify the transformer of Task 6.11 to incorporate future masking at all layers.

In [ ]:

# this is a small example to gain an intuition of how masking will work. 
B = torch.randn(5,5)
print("B:")
display(B)
# FUTURE MASKING: 
# To mask attention, we create a matrix that indicates if an entry in B is a word in the future
mask = torch.triu(torch.ones(T, T), diagonal=1).to(device)
print()

# If an entry is in the future, we set it to -inf, 
# so that when we apply softmax, the probability of that word is 0, while 
# the rest of the words sum to 1.
B = B.masked_fill(mask == 1, float('-inf'))

In [21]:

# We need to modify both MultiHeadLayer and the LanguageTransformer class to include layer normalization.
class MultiHeadLayer(nn.Module):
    """
    A modified version of the MultiHeadLayer class with layer normalization.
    It will have two normalization layers, one after the multi-head attention and one after the nonlinearity.
    """
    def __init__(self, m, n, H):
        super(MultiHeadLayer, self).__init__()
        self.m = m
        self.H = H

        self.Q = nn.Parameter(torch.empty(H, m, n))
        self.K = nn.Parameter(torch.empty(H, m, n))
        self.V = nn.Parameter(torch.empty(H, m, n))

        self.W = nn.Parameter(torch.empty(n, m))
        
        # First layer normalization object.
        # Layernorm will average over the n dimensions of each element in the sequence.
        self.layer_norm1 = nn.LayerNorm(n)
        
        self.nonlinearity = nn.ReLU()
        
        # Second layer normalization object.
        self.layer_norm2 = nn.LayerNorm(n)
        
        self.initialize_parameters()

    def initialize_parameters(self):
        """
        Initialize the values of the learnable parameter matrices.
        Kaiming uniform is just a type of random initialization, you don't need to 
        worry about it. It is a good default initialization for linear layers.
        """
        nn.init.kaiming_uniform_(self.Q, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.K, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.V, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.W, a=math.sqrt(5))

    def forward(self, X):
        """
        Forward pass of the multihead attention layer with layer normalization.

        Args:
            X (torch.Tensor): The input embeddings.
        Returns:
            X_l (torch.Tensor): The output of the multihead attention layer.
        """
        B, n, T = X.shape  # X: (B, n, T)

        # First layer normalization.
        # An annoying Pytorch detail: layer norm function expects the normalization to be over the last dimension.
        # Therefore, we need to transpose the last two dimensions of the input to shape (B, T, n) each time we normalize, then transpose back.
        # (X.transpose(-2,-1) means that we are transposing over the last two dimensions)
        X = self.layer_norm1(X.transpose(-2,-1)).transpose(-2,-1)

        # Expand X to include the head dimension
        X_expanded = X.unsqueeze(1)  # (B, 1, n, T)

        # Compute QX, KX, VX for each head
        QX = torch.matmul(self.Q.unsqueeze(0), X_expanded)  # (B, H, m, T)
        KX = torch.matmul(self.K.unsqueeze(0), X_expanded)  # (B, H, m, T)
        VX = torch.matmul(self.V.unsqueeze(0), X_expanded)  # (B, H, m, T)
        
        QX_t = QX.transpose(-2, -1)  # (B, H, T, m)

        # Compute attention scores B per head
        B = torch.matmul(QX_t, KX)  # (B, H, T, T)

        # FUTURE MASKING: 
        # To mask attention, we create a matrix that indicates if an entry in B is a word in the future
        mask = torch.triu(torch.ones(T, T), diagonal=1).to(device)
        
        # If an entry is in the future, we set it to -inf, 
        # so that when we apply softmax, the probability of that word is 0, while 
        # the rest of the words sum to 1.
        B = B.masked_fill(mask == 1, float('-inf'))

        # Now when we apply softmax, only the words in the past are have nonzero probability.
        A = F.softmax(B, dim=-1)
    
        A_t = A.transpose(-2,-1)
        VXA_t = torch.matmul(VX, A_t) # (B, H, m, T)
        Y = torch.matmul(self.W, VXA_t) # (B, H, T, n)

        # Second layer normalization. Transpose over the last two dimensions
        Y = self.layer_norm2(Y.transpose(-2,-1)).transpose(-2,-1)
        
        X_l = X + self.nonlinearity(Y.sum(dim=1))

        return X_l

# Testing the change
model = MultiHeadLayer(m=32, n=256, H=2).to(device)
X = torch.randn(1,256,5).to(device)
out = model(X)
print(f"out.shape: {out.shape}")

out.shape: torch.Size([1, 256, 5])

Task 13¶

Modify the transformer in Task 6.12 to incorporate future masking in the attention layers.

In [27]:

# Training
n_epochs=5

L = 6
H = 8
m=32
n=256
lr = 1e-4
T=64
B=32

train_loader = DataLoader(train_dataset, batch_size=B, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=B, shuffle=False)

estimator = LanguageTransformer(m=m, n=n, L=L, H=H, c=c,T=T).to(device)
estimator.train()

# Here we use AdamW instead of SGD. It is just a different optimizer.
optimizer = optim.AdamW(estimator.parameters(), lr=lr)

# We use the Cross Entropy loss for estimating the probabilities of the next word.
cross_entropy_loss = nn.CrossEntropyLoss()

train_loss = []
for epoch in range(n_epochs): # Iterate over n_epochs epochs

    for x_batch, y_batch in tqdm(train_loader): # Iterate over all batches in the dataset 
        # (Step i) Load the data. These commands send the data to the GPU memory.
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)
        
        batch_size, T = x_batch.shape

        # (Step ii) Compute the gradients. We use automated differentiation.
        optimizer.zero_grad() # Gradient reset to indicate where the backward computation stops.

        # Call the neural network. In this case, we will take the average of the output of the
        # transformer as the prediction.
        y_hat = estimator(x_batch)

        # Reshape logits and y to be able to evaluate cross entropy on 
        # each token in the sequence.
        y_hat = y_hat.permute(0,2,1)
        y_hat = y_hat.reshape(batch_size * T, c)

        # Y should also be condensed into one dimension.
        y_batch = y_batch.view(batch_size * T, -1).squeeze()

        # When using cross entropy loss, we need to pass the target as a 1D tensor of class indices.
        # The softmax function is applied internally to the transformer's output y_hat.
        cross_entropy_value = cross_entropy_loss(y_hat,y_batch)

        cross_entropy_value.backward() # Compute gradients moving backwards untit the gradient reset.

        # (Step iii) Update parameters by taking an SGD (or other optimizer) step.
        optimizer.step()

        train_loss.append(cross_entropy_value.item())
    print(f"Epoch {epoch}/{n_epochs} Loss: {train_loss[-1]}")

    # End of batch loop.

estimator.eval()
with torch.no_grad():
    test_losses = []
    for x_batch, y_batch in tqdm(test_loader):
        x_batch = x_batch.to(device)
        batch_size, T = x_batch.shape
        y_batch = y_batch.to(device)

        optimizer.zero_grad()
        y_hat = estimator(x_batch)

        y_hat = y_hat.permute(0,2,1)
        y_hat = y_hat.reshape(batch_size * T, c)
        y_batch = y_batch.view(batch_size * T, -1).squeeze()
        cross_entropy_value = cross_entropy_loss(y_hat,y_batch)
        test_losses.append(cross_entropy_value.item())
    test_loss = torch.tensor(test_losses).mean().item()

print(f"Train loss: {train_loss[-1]}")
print(f"Test loss: {test_loss}")

100%|██████████| 8213/8213 [02:21<00:00, 58.23it/s]

Epoch 0/5 Loss: 3.9054839611053467

100%|██████████| 8213/8213 [02:21<00:00, 57.97it/s]
100%|██████████| 8213/8213 [02:21<00:00, 57.98it/s]
100%|██████████| 8213/8213 [02:21<00:00, 57.97it/s]
100%|██████████| 8213/8213 [02:21<00:00, 57.98it/s]

Task 14¶

Repeat the generative exercise in Task 6.9 using the transformer trained in Task 6.13.

In [31]:

initial_indices.shape

Out[31]:

torch.Size([1, 64])

In [32]:

starting_point = torch.randint(0, len(test)-T, (1,))
# Example sampled from test set.
initial_indices = test[starting_point:starting_point+T].unsqueeze(0)
print(f"========== INPUT TEXT ==========")
print(f"{tokens_to_words(initial_indices.reshape(-1).tolist())}\n")

# This is the model from task 7
print(f"========== INPUT + GENERATED TEXT ==========")
print(generate_text(estimator, initial_indices, max_generate_tokens=100))
print(f"====================================")

========== INPUT TEXT ==========
despair not. 
 
 HORTENSIO: 
 Madam, tis now in tune. 
 
 LUCENTIO: 
 All but the base. 
 
 HORTENSIO: 
 The base is right; tis the base knave that jars. 
 How fiery and forward our pedant is! 
 Now, for my life, the knave doth court my love

========== INPUT + GENERATED TEXT ==========
despair not. 
 
 HORTENSIO: 
 Madam, tis now in tune. 
 
 LUCENTIO: 
 All but the base. 
 
 HORTENSIO: 
 The base is right; tis the base knave that jars. 
 How fiery and forward our pedant is! 
 Now, for my life, the knave doth court my love! 
 Sirs him there: how fares I'll write; I give them 
 To the heir night that I shall plead 
 And Juliet without good to go. 
 What, ho there! fear, let's away. 
 
 COMINIUS: 
 My lord; I hear it well. 
 
 POLIXENES: 
 Do not pray, sir, be gone: come, 
 I come get your brother. 
 
 FLORIZEL: 
 Let them accuse, 
 In your people to our peace else: pray, play 

====================================

Compare the following with the tranformer we trained without positional encoding and other implementation tricks:

In [70]:

# Some examples of generated text with the trained model
shakespeare_quotes = [
    "All the world's a stage, and all the men and women merely players.",  # As You Like It (Act 2, Scene 7)
    "A fool thinks himself to be wise, but a wise man knows himself to be a fool.",  # As You Like It (Act 5, Scene 1)
    "How beauteous mankind is! O brave new world!",  # The Tempest (Act 5, Scene 1) – Miranda.
    "O brave new world, that has such people in't!",  # The Tempest (Act 5, Scene 1) – Miranda.
    "Love all, trust a few, do wrong to none.",  # All's Well That Ends Well (Act 1, Scene 1)
    "To be or not to be, that is the question.",  # Hamlet (Act 3, Scene 1)
]

for quote in shakespeare_quotes:
    try: 
        print(f"========== INPUT ==========")
        print(f"{' '.join(split_to_words(quote))}")
        print(f"========== INPUT + GENERATED TEXT ==========")
        print(generate_from_prompt(estimator,quote,max_generate_tokens=15))
        print(f"====================================")
    except Exception as e:
        # Some of those words weren't on our vocabulary so the model doesn't know what to do.
        print(f"Error generating from prompt: {e}")

========== INPUT ==========
Alas mother ,
========== INPUT + GENERATED TEXT ==========
Alas mother, 
 All children puts your suit 
 Marcus that, 
 That taught was for herself. Why, 
 It is life and prey to play their chamber! and have I, seeing thou tapp'd cousin, but hear have effect my forest woods for us his face bride canst, 
 That thus affect'st disorder'd: 
 I thank an elder; 
 Let me, 
 Dare as a judge. 
 
 BENVOLIO: 
 Out, unvalued this: 
 I think not; 
 Lest them, or stones? anon, ho
====================================
========== INPUT ==========
All the world's a stage , and all the men and women merely players .
========== INPUT + GENERATED TEXT ==========
All the world's a stage, and all the men and women merely players. 
 
 ABHORSON: 
 Who is't of your highness? 
 
 ISABELLA: 
 Friar but my the wrong; for, sir, now, I must also. 
 
 ABHORSON: 
 We take you to you both some news: I must, and you may it, 
 give it your honour's, shall not be true; 
 For I be your best- bed, 
 Against the duty be join'd: although your natural stay, 
 And follow for't. 
 
 LUCIO: 
 
 ANGELO: 

====================================
========== INPUT ==========
A fool thinks himself to be wise , but a wise man knows himself to be a fool .
========== INPUT + GENERATED TEXT ==========
A fool thinks himself to be wise, but a wise man knows himself to be a fool. 
 
 ABHORSON: 
 Pray, let it strike from the right- place; 
 Be your due. O and that's the boy, 
 When you are weary most set, he hath done 
 I'll hold him for a man; no man could not 
 desire his issue grace. 
 
 SICINIUS: 
 But, if you love. 
 
 MENENIUS: 
 That's o'er hands Edward, Myself, and death again, Angelo. 
 
 RICHMOND: 
 The princes and noble King Henry- tree 
 Is
====================================
========== INPUT ==========
How beauteous mankind is ! O brave new world !
========== INPUT + GENERATED TEXT ==========
How beauteous mankind is! O brave new world! 
 Never what news! arm, and not I; 
 For he hath a man, some power in the rain. 
 
 Third Watchman: 
 Tis so, my good lord. 
 
 LEONTES: 
 Why, then, then, tis well better to be; 
 The blood that art not, to the nobility doth ride 
 An cold makes what I return to be at home, 
 When he. My liege, if you were gracious people, 
 Your ancient father to make his heavenly Edward
====================================
========== INPUT ==========
O brave new world , that has such people in't !
========== INPUT + GENERATED TEXT ==========
O brave new world, that has such people in't! 
 
 QUEEN MARGARET: 
 Thou hast the throne; and thou shalt be proclaim'd, 
 Be valiant to thy purpose. 
 
 HENRY BOLINGBROKE: 
 Kind oath, Catesby, wronged; and, friends, and dare, and-- 
 
 FROTH: 
 Nay, take your honour. 
 
 First Senator: 
 Be you the city at your sin. 
 
 ISABELLA: 
 Nay, dispatch, indeed, she speaks your love length. 
 
 LADY CAPULET: 
 Are down have I,
====================================
========== INPUT ==========
The fault , dear Brutus , is not in our stars , but in ourselves , that we are underlings .
========== INPUT + GENERATED TEXT ==========
Error generating from prompt: 'underlings'
========== INPUT ==========
To be , or not to be : that is the question .
========== INPUT + GENERATED TEXT ==========
To be, or not to be: that is the question. 
 
 LUCIO: 
 
 Provost: 
 What further abroad abroad? 
 
 POMPEY: 
 Thou art a fool, and not a need, 
 Soon, dishonour than thy riches, hath lengthen'd 
 Till forth and empty; he bears my oath, 
 To meet and answer, I beseech you, now: out 
 And steal such a life? 
 
 MAMILLIUS: 
 I learnt your moved the wrong, ask. 
 
 DUCHESS OF YORK: 
 Either me my royal use? I am heard
====================================
========== INPUT ==========
Cowards die many times before their deaths ; the valiant never taste of death but once .
========== INPUT + GENERATED TEXT ==========
Error generating from prompt: 'Cowards'
========== INPUT ==========
The better part of Valour , is Discretion .
========== INPUT + GENERATED TEXT ==========
Error generating from prompt: 'Valour'
========== INPUT ==========
Love all , trust a few , do wrong to none .
========== INPUT + GENERATED TEXT ==========
Love all, trust a few, do wrong to none. 
 
 GLOUCESTER: 
 Welcome, how now the mark Juliet up, 
 To know the king is nothing: meantime my death 
 Persuade all but seldom for that time means; and, as cheap, 
 Take more than one, more to kiss the open 
 As we should do, for themselves for the gods serve. 
 Meantime, Romeo! what a one was gadding is to live. 
 
 GLOUCESTER: 
 What ever the sun devise out to- morrow morning? 
 Ah, what a man
====================================
========== INPUT ==========
Some are born great , some achieve greatness , and some have greatness thrust upon them .
========== INPUT + GENERATED TEXT ==========
Some are born great, some achieve greatness, and some have greatness thrust upon them. 
 I saw his soul: and his noble heart 
 To undertake we home you and your grace 
 Upon his party with him. And to tell him: 
 An the glory is he tyrannous, and took 
 A moiety like the witness o the worst. 
 
 MENENIUS: 
 The consent and stand his again is out: 
 A most heir- sickness will command and 
 Only their swords. 
 
 ISABELLA: 
 Hath not the matter; the people they shall be gone. 
 
 AUTOLYCUS:
====================================
========== INPUT ==========
The lady doth protest too much , methinks .
========== INPUT + GENERATED TEXT ==========
The lady doth protest too much, methinks. 
 Hence, gentle Richard: they must have used 
 Till, as mistaken were. 
 
 LUCIO: 
 Then shall know my constant may do this brawl. 
 
 LADY ANNE: 
 If I had the grace so fair up. 
 
 QUEEN MARGARET: 
 So much will all within within that are mad 
 Betwixt to nose that is a king of grief. 
 
 KING RICHARD II: 
 What if you will, I hear. 
 
 BARNARDINE: 
 Nay, nurse, gentlemen; grieve
====================================
========== INPUT ==========
Good night , good night ! Parting is such sweet sorrow , that I shall say good night till it be morrow .
========== INPUT + GENERATED TEXT ==========
Error generating from prompt: 'Parting'
========== INPUT ==========
We are such stuff as dreams are made on , and our little life is rounded with a sleep .
========== INPUT + GENERATED TEXT ==========
Error generating from prompt: 'rounded'
========== INPUT ==========
But , soft ! What light through yonder window breaks ? It is the east , and Juliet is the sun .
========== INPUT + GENERATED TEXT ==========
But, soft! What light through yonder window breaks? It is the east, and Juliet is the sun. 
 good uncle not- beards. 
 
 SICINIUS: 
 Here. 
 
 BRUTUS: 
 Attend your tribunes? 
 
 BRUTUS: 
 Let's with point counsel safe. 
 
 MENENIUS: 
 With fair gentlewoman, and make one, and go on, 
 Have more given now and made us ill lies. 
 Have you no man that, he has amen fled. 
 
 BRAKENBURY: 
 I do, my lord, I would not move you; but what he hath? 
 
 CAPULET:
====================================
========== INPUT ==========
Good night , good night ! Parting is such sweet sorrow , that I shall say good night till it be morrow .
========== INPUT + GENERATED TEXT ==========
Error generating from prompt: 'Parting'

In honor of the best Miranda.