Download the Jupyter Notebook for this lab in the following link.
Download the data for the lab in the following link.
Download the pdf assignment for the lab in the following link.
1 Collaborative Filtering¶
The objective of this lab is to design a recommendation system that predicts the ratings that customers would give to a certain product. Say, the rating that a moviegoer would give to a specific movie, or the rating that an online shopper would give to a particular offering. A possible approach to making these predictions is to leverage the ratings that customers have given to this or similar products in the past. This is called collaborative filtering.
We model collaborative filtering as a problem that involves a graph $\mathbf{S}$ and graph signals $\mathbf{x}_u$ supported on the nodes of the graph. Nodes $p$ of the graph $\mathbf{S}$ represent different products and the weighted edges $S_{pq}$ denote simlarities between products $p$ and $q$. The entries of the graph signal $\mathbf{x}_u$ represent the ratings that user $u$ has given to each product.
The motivation for developing a recommendation system is that customers have rated a subset of prodcuts. Each of us has seen a small number of movies or bought a small number of offerings. Thus, the ratings vector $\mathbf{x}_u$ contains only some entries that correspond to rated products. For the remaining entries we adopt the convention that $x_{ui}=0$. Our goal is to learn a function $\Phi(\mathbf{x}_{u}; \mathcal{H})$ that takes as input the vector $\mathbf{x}_u$ of available ratings and fills in predictions for the ratings that are not available.
We will develop and evaluate a graph neural network (GNN) for making these rating predictions.
Movie Rating Data¶
As a specific example, we use the MovieLens-100k dataset. The MovieLens-100k dataset consists of ratings given by $U$ users to $P$ movies (products). The existing movie ratings are integer values between 1 and 5. Therefore, the data are represented by a $U\times P$ matrix $X$ where $x_{up}$ is the rating that user $u$ gives to movie $p$. If user $u$ has not rated movie $p$, then $x_{up} = 0$. We see that each row of this matrix corresponds to a vector of ratings $\mathbf{x}_u$ of a particular user.
To build the collaborative filtering system, we use the rating history of all movies to compute a graph of movie similarities. In this graph edges are crosscorrelations of movie ratings. The graph can be constructed from the raw $U\times P$ movie rating matrix, but to make matters simpler we have constructed this graph already and are making it available as part of the dataset.
To train this collaborative filtering system we use rating histories to create a dataset with entries $(\mathbf{x}_n, y_n, p_n)$. In these entries $\mathbf{x}_n$ is a vector that contains the ratings of a particular user, $y_n$ is a scalar that contains a rating that we want to predict, and $p_n$ is the index of the movie (product) that corresponds to the rating $y_n$. To evaluate this collaborative filtering system we use rating histories to create a dataset with entries having the same format. Both of these datasets can be constructed from the raw $U\times P$ movie rating matrix, but to make matters simpler we have constructed them already and are making them available as part of the dataset.
Environment setup¶
Before we beging we need to import the necessary Python Packages. We use Numpy to load data, pytorch for training, and matplotlib to plot and visualize results.
import numpy as np
import torch; torch.set_default_dtype(torch.float64)
import torch.nn as nn
import torch.optim as optim
import copy
import pickle as pkl
import math
import matplotlib.pyplot as plt
from tqdm import tqdm
device = 'cuda' if torch.cuda.is_available() else 'cpu'
1.1 Product Similarity Graph¶
To build the collaborative filtering system, we use the rating history of all movies and all users to compute a graph of product similarities. This is a graph in which nodes $p$ represent different movies and weighted edges $S_{pq}$ denote similarities between products $p$ and $q$. The edges of the graph are grouped in the adjacency matrix $\mathbf{S}$.
To compute the entries $S_{pq}$ of the product similarity graph we use the raw $U$ × $P$ movie rating matrix to evaluate crosscorrelations between movie ratings of products $p$ and $q$. To make matters simpler we have constructed this graph already and are making it available as part of the dataset.
TASK 1¶
Download the movie rating data to your computer and upload the data “movie_data.p” to this processing environment. Plot the adjacency matrix $\mathbf{S}$ as an image.
Report deliverable: Adjacency matrix plotted as an image
# The following command loads the data. For this to work the file "movie_data.p"
# must be uploaded to this notebook. To do that, navigate to the Jupyter Notebook
# interface home page and use the “Upload” option.
contents = pkl.load(open("movie_data.p","rb"))
data = contents['data']
# The data file contains a graph of movie similarities. The graph has P nodes.
# This is the graph we will use to run the GNN.
S = data['S']
P = S.shape[0]
# The data file contains a training set with entries (x_n, y_n, p_n).
# These entries are stored in the arrays xTrain, yTrain, and pTrain,
# respectively. All of these arrays have nTrain columns, with each
# column corresponding to a different entry in the dataset.
xTrain = data['train']['x'] # Available ratings for making predictions.
yTrain = data['train']['y'] # Rating to be predicted
pTrain = data['train']['p'] # Index of the movie whose rating is being predicted
nTrain = xTrain.shape[0]
# Same thing with test data
xTest = data['test']['x']
yTest = data['test']['y']
pTest = data['test']['p']
nTest = xTest.shape[0]
# Plot the adjacency matrix S
plt.figure(1)
plt.title('Pattern of the Movie Recommendation Adjacency Matrix')
plt.xlabel('Movie index (p)')
plt.ylabel('Movieindex (q)')
plt.imshow(S.numpy())
<matplotlib.image.AxesImage at 0x7f9e57be0fd0>
Success in Task 1 must have produced the plot shown above.
In the figure above each bright dot corresponds to a large entry $S(p,q)$. This denotes a pair of movies that watchers tend give similar scores to. For instance, say that when someone scores “Star Wars IV” highly, they are likely to score “Star Wars V” highly and that the converse is also true; poor scores in one correlate with poor scores in the other. The entry $S(\text{“Star Wars IV”}, \text{“Star Wars V”})$ is large.
Fainter entries $S(p,q)$ denote pairs of movies with less socre correlation. Perhaps between “Star Wars” and “Star Trek” which have overlapping but not identical fan bases. Dark entries $S(p,q)$ correspond to pairs movies with no correlation between audience scores. Say when $p$ is the index of “Star Wars” and $q$ is the index of “Little Miss Sunshine.”
1.2 Rating Signals¶
The vector of ratings $\mathbf{x}_u$ of a particular user is interpreted as a signal supported on the graph. That is, a signal in which the pth component $x_{up}$ is associated with node $p$. In this context, the weights of the product similarity graph become an expectation of similarity between ratings $x_{up}$ and $x_{uq}$. If $S_{pq}$ is large we expect these ratings to be similar. If $S_{pq}$ is small we have no expectation of proximity or not between them.
We then have a system with the architecture shown in Figure 3. Rating signals $x_u$ of individual users are extracted from the raw rating matrix and are interpreted as signals supported on the graph $\mathbf{S}$ that we loaded in Task 1. We want to leverage the graph $\mathbf{S}$ to make rating predictions $y_u$ for this particular user.
We will, more precisely, develop and evaluate a graph filter and a graph neural network (GNN) for making these rating predictions.
1.3 Rating Data Format and Rating Loss¶
To train the collaborative filtering system we use rating histories to create a dataset with entries ($\mathbf{x}_n$, $y_n$, $p_n$). In these entries $\mathbf{x}_n$ is a vector that contains the ratings of a particular user, $y_n$ is a scalar that contains a rating that we want to predict, and $p_n$ is the index of the movie (product)that corresponds to the rating $y_n$. To evaluate this collaborative filtering system we use rating histories to create a dataset with entries having the same format. Both of these datasets can be constructed from the raw $U$ × $P$ movie rating matrix, but to make matters simpler we have constructed them already and are making them available as part of the dataset.
If we have a function $\hat{\mathbf{y}}_n=\Phi(\mathbf{x}_{n}; \mathcal{H})$ that makes rating predictions out of availbale ratings, we can evaluate the goodness of this function with the squared loss
\begin{align}\tag{1} \ell \big(\Phi(\mathbf{x}_{n}; \mathcal{H}), y_{n} \big) ~=~ \frac{1}{2} \Big[ \big(\hat{\mathbf{y}}_n \big)_{p_n} – y_n \Big]^2 ~=~ \frac{1}{2} \Big[ \big( \Phi(\mathbf{x}_{n}; \mathcal{H}) \big)_{p_n} – y_n \Big]^2. \end{align}Notice that in this expression the function $\hat{\mathbf{y}}_n=\Phi(\mathbf{x}_{n}; \mathcal{H})$ makes predictions for all movies. However, we isolate entry $p_n$ and compare it against the rating $y_n$. We do this, because the rating $y_n$ of movie $p_n$ is the one we have available in the training or test sets.
We remark that the fact that the function $\hat{\mathbf{y}}=\Phi(\mathbf{x}; \mathcal{H})$ makes predictions for all movies is important during operation. The idea of the recommendation system is to identify the subset of products that the customer would rate highly. They are the ones that we will recommend. This is why we want a system that has a graph signal as an output even though the availbale dataset has scalar outputs.
def movieMSELoss(yHat, y, idxMovie):
"""
evaluates the square of comparing yHat(p) with y. The function
accepts as an input tensors with multiple movie ratings.
Inputs:
yHat: torch.Tensor (NumUsers x 1 x NumMovies)
A set of rating estimates. It includes estimates of all movies.
y: torch.Tensor (NumUsers x 1)
A specific rating of a specific movie
idxMovie: torch.Tensor
The index of the movie whose rating is given by y
Outputs:
mse: jnp.ndarray
Computed mean squared error.
"""
# the .squeeze() method is needed for dimension match between y and yHat
yHat = yHat.squeeze()
# Isolate the predictions in yHat that we will use for comparison
prediction = yHat[:,idxMovie.numpy()]
return torch.mean((prediction-y)**2)
2 Graph Convolutions¶
Let $\mathbf{S}$ denote a matrix representation of a graph. Supported on the nodes of the graph we are given a graph signal $\mathbf{x}$. We also consider a set of $K$ coefficients $h_k$ from k = 0 to k = K − 1. A graph convolutional filter is a linear map acting on $\mathbf{x}$ defined as a polynomial on the matrix representation of the graph with coefficients $h_k$,
\begin{equation}\tag{2} \mathbf{z} = \mathbf{h}*\mathbf{Sx} = \sum_{k=0}^{K-1} h_k \mathbf{S}^k \mathbf{x}\ \end{equation}Graph convolutions generalize convolutions in time to graphs. That this is true can be seen if we represent time with a directed line graph. Considering (2) for the particular case in which $\mathbf{S}$ is the adjacency matrix of this line graph, the product $\mathbf{S}^k\mathbf{x}$ results in a $k$-shift of the time signal $\mathbf{x}$. For this reason we sometimes refer to $\mathbf{S}$ as a shift operator. We also point out that although we work with an adjacency matrix in this lab any matrix representation of the graph can be used in (2).
One advantage that graph filters share with time convolutions is their locality. To see this, define the diffusion sequence as a collection of graph signals $\mathbf{u}_k$ = $\mathbf{S}^k\mathbf{x}$ and rewrite the filter in (2) as,
\begin{equation}\tag{3} \mathbf{z} = \sum_{k=0}^{K-1} h_k \mathbf{S}^k \mathbf{x}\ = \sum_{k=0}^{K} h_k \mathbf{u}_k\ \end{equation}It is ready to see that the diffusion sequence is given by the recursion $\mathbf{z}_k$ = $\mathbf{Sz}_k−1$ with $\mathbf{z}_0$ = $\mathbf{x}$. Further observing that $S_{ij} \neq 0$ only when the pair $(i, j)$ is an edge of the graph, we see that the entries of the diffusion sequence satisfy
\begin{equation}\tag{4} \ u_{k,i} \ = \ \sum_{j:(i,j)\in\mathcal{E}} S_{ij} \mathbf{u}_{k-1,j}, \end{equation}We can therefore interpret graph filters as operators that propagate information through adjacent nodes. This is analogous to the propagation of information in time with the application of time shifts. The locality of graph convolutions is one of the motivations for their use in the processing of information supported on graphs. The other reason is their equivariance to permutations.
Because it aggregates with a weighted sum the information from neighboring nodes, the operation in $(4)$ is sometimes called an aggregation information. Because it aggregates at node $i$ information that is passed from adjacent nodes $j$, we sometimes say that graph filters are messagepassing architectures and the GNNs that are derived from them are called message passing GNNs.
2.1 Graph Convolutions with Multiple Features¶
To increase the representation power of graph filters we extend them to add multiple features. In these filters the input is a matrix $\mathbf{X}$ and the output is another matrix matrix $\mathbf{Y}$. The filter coefficients are matrices $\mathbf{H}_k$ and the filter itself is a generalization of $(2)$ in which the matrices $\mathbf{H}_k$ replace the scalars $h_k$,
\begin{equation}\tag{5} \mathbf{Z} = \sum_{k=0}^{K} \mathbf{S}^k \mathbf{X} \mathbf{H}_k\ \end{equation}In $(5)$, the input feature matrix $\mathbf{X}$ has dimension $N × F$ and the output feature matrix $\mathbf{Y}$ has dimension $N × G$. This means that each of the $F$ columns of $\mathbf{X}$ represents a separate input feature whereas each of the $G$ columns of $\mathbf{Y}$ represents an output feature. To match dimensions, the filter coefficient matrices $\mathbf{H}_k$ must be of dimension $F × G$.
Other than the fact that it represents an input-output relationship between matrices instead of vectors, $(5)$ has the same structure of $(2)$.
In particular, we can define a diffusion sequence $\mathbf{U}_k$ through the recursion $\mathbf{U}_k$ = $\mathbf{SU}_{k-1}$ and rewrite $(5)$ as
\begin{equation}\tag{6} \mathbf{Z} = \sum_{k=0}^{K-1} h_k \mathbf{S}^k \mathbf{X}\ = \sum_{k=0}^{K} h_k \mathbf{U}_k\ \end{equation}This is worth remarking because we can write the diffusion sequence as a message passing aggregation operation. Indeed, if $\mathbf{u}_{k,i}$ is the $i$th row of $\mathbf{U}_k$ we can write the diffusion sequence recursion as
\begin{equation}\tag{7} \ u_{k,i} \ = \ \sum_{j:(i,j)\in\mathcal{E}} S_{ij} \mathbf{u}_{k-1,j}, \end{equation}In $(7)$ nodes $j$ in the neighborhood of $i$ pass the message $\mathbf{u}_{k-1,j}$. Node $i$ aggregates these messages to create the updated message $\mathbf{u}_{k,i}$ that it passes on to its neighbors.
TASK 3¶
Write a function that implements a graph filter. This function takes as inputs the shift operator $\mathbf{S}$, the filter coefficients $\mathbf{H}_k$ and the input signal $\mathbf{X}$. To further improve practical performance we add a bias term $\mathbf{B}$ to the filter operation. That is, we refine $(5)$ with the opearation,
\begin{equation}\tag{8} \mathbf{Z} = \sum_{k=0}^{K} \mathbf{S}^k \mathbf{X} \mathbf{H}_k + \mathbf{B}\ \end{equation}The bias $\mathbf{B}$ is also passed as a parameter to the filter function.
Report deliverable: Do not report
def FilterFunction(x, h, S, b):
'''
The following function defines a Graph Filter.
Inputs:
x: Input to Graph Filter
h: Filter
S: Graph Shift Operator
b: Bias term
Outpus:
y: Output of Graph Filter
'''
# X is B x G x N
# S is NxN
B, G, N = x.shape
K, _, F = h.shape
y = torch.zeros((B, N, F), device=device)
# The following for-loop is utilized to perform the graph diffusions
for k in range(K):
# sum S^k x * h_k
# We permute x dimensions it to get dimensions B x N x G
# The filter has dimensions G x F
y += torch.matmul(x.permute((0,2,1)), h[k])
# diffuse signal
x = torch.matmul(x, S)
# y has dims B x N x F,
# add bias of shape F
y = y + b
# permute dimensions to get the desired dimensions B x F x N
y = y.permute(0, 2, 1)
return y
TASK 4¶
Train a graph filter to predict movie ratings. Plot the evolution of the training loss and evaluate the loss in the test dataset. To obtain a good loss we need to experiment with the length of the filter – the number of filter taps $K$.
Report deliverables:
- Number of filter taps used
- Plot with training loss and test loss
- Train and test loss at stopping time
class FilterModule(nn.Module):
#First we initialize the graph filter class and define the attributes
def __init__(self, k, f_in, f_out):
"""
initialize graph filter
:param k: order of the filter
:param f_in: filter input dimension
:param f_out: filter output dimension
"""
super().__init__()
self.k = k
self.f_in = f_in
self.f_out = f_out
# initialize weights and bias
self.bias = nn.parameter.Parameter(torch.Tensor(self.f_out))
self.h = nn.Parameter(torch.randn(self.k, self.f_in, self.f_out))
self.reset_parameters()
def reset_parameters(self):
stdv = 1. / math.sqrt(self.f_in * self.k)
self.h.data.uniform_(-stdv, stdv)
self.bias.data.uniform_(-stdv, stdv)
def forward(self, x, S):
"""
PyTorch module interface implementation; just apply the filter
"""
return FilterFunction(x, self.h, S, self.bias)
We will define a FilterBank with F filters and a single output.
class GraphFilter(nn.Module):
def __init__(self, k, f):
"""
initialize graph filter
:param k: order of the filters
:param f: number of filters
"""
super().__init__()
self.k = k
self.f = f
# one input feature, f filter outputs
self.filter_bank = FilterModule(self.k, 1, self.f)
# add last linear layer, output dim=1
self.readout = FilterModule(1, self.f, 1)
def forward(self, x, S):
x = self.filter_bank(x, S)
return self.readout(x, S)
We now instantiate a Graph filter with the following properties:
- Filter order: 5
- Output Features: 64
k = 5
f = 64
model = GraphFilter(k, f)
Training method and parameters¶
Below we define the training parameters and specify the optimization methods.
# learning parameters
nEpochs = 5
learningRate = 0.01
batchSize = 20
optimizer = optim.Adam(model.parameters(), lr=learningRate)
Training loop¶
Below is the main trainig loop implementation. This is a very straightforward implementation based on a fixed number of epochs.
# Helper variables for data loading during training.
batchIndex = np.append(np.arange(0, nTrain, batchSize), nTrain)
nBatches = int(np.ceil(nTrain/batchSize))
lossTrain = []
for epoch in range(nEpochs):
randomPermutation = np.random.permutation(nTrain)
idxEpoch = [int(i) for i in randomPermutation]
epoch_loss=0
for batch in tqdm(range(nBatches)):
# Determine batch indices
thisBatchIndices = idxEpoch[batchIndex[batch]
: batchIndex[batch+1]]
# Get the samples in this batch
xTrainBatch = xTrain[thisBatchIndices,:,:]
yTrainBatch = yTrain[thisBatchIndices]
pTrainBatch = pTrain[thisBatchIndices]
model.zero_grad()
# Obtain the output of the architectures
yHatTrainBatch = model(xTrainBatch, S)
# Compute loss
lossValueTrain = movieMSELoss(yHatTrainBatch, yTrainBatch, pTrainBatch)
# Compute gradients
lossValueTrain.backward()
# Update parameters
optimizer.step()
#accumulate loss
epoch_loss+=lossValueTrain.item()
lossTrain.append(lossValueTrain.item())
# Now compute the test loss
model.eval() # Set the model to evaluation mode
with torch.no_grad(): # No need to track gradients
# Assuming your test data is xTest, yTest, and pTest
yHatTest = model(xTest, S)
test_loss = movieMSELoss(yHatTest, yTest, pTest).item() # Compute test loss
lossTest.append(test_loss) # Store test loss
print(f"Epoch: {epoch+1} \t Loss: {epoch_loss/nBatches:.2f} \n")
100%|█████████████████████████████████████████| 138/138 [00:07<00:00, 19.24it/s]
Epoch: 1 Loss: 1.96
100%|█████████████████████████████████████████| 138/138 [00:07<00:00, 19.33it/s]
Epoch: 2 Loss: 1.09
100%|█████████████████████████████████████████| 138/138 [00:07<00:00, 18.87it/s]
Epoch: 3 Loss: 1.10
100%|█████████████████████████████████████████| 138/138 [00:07<00:00, 17.92it/s]
Epoch: 4 Loss: 1.10
100%|█████████████████████████████████████████| 138/138 [00:07<00:00, 18.39it/s]
Epoch: 5 Loss: 1.10
Test error evaluation¶
We evaluate test performance for the trained graph filter.
yHatTest = model(xTest, S)
testMSE = movieMSELoss(yHatTest, yTest, pTest)
print(f"Final Training Loss: {lossTrain[-1]:.2f}")
print(f"Final Test Loss: {lossTest[-1]:.2f}")
Final Training Loss: 1.03 Final Test Loss: 1.00
Training error evolution¶
We plot the training cost after every epoch for the 2-layer GNN:
plt.figure(figsize = (8,8))
plt.plot(lossTrain, label = 'Train MSE')
plt.ylabel('Mean Squared Error')
plt.xlabel('Batches')
plt.title('Graph Filter Training Loss Evolution')
Text(0.5, 1.0, 'Graph Filter Training Loss Evolution')