Download the files for Lab 2B from the following links:

We recommend that you use Google Colab, as training will be faster on the GPU.

To enable the GPU on Colab, go to Edit / Notebook settings / Hardware accelerator / select T4 GPU

Instructions on how to download and use Jupyter Notebooks can be found here. You can find a static version of the notebook below.

notebook_lab_2B

Lab 2: Signals in Time: Audio Processing and Classification¶

Audio is mathematically modeled as a function $x(t)$ in which $t$ represents time and $x(t)$ is an electric signal that is generated by transforming air pressure waves with a microphone. The same pressure waves can be reconstructed from the electrical signal using a speaker. In this lab we want to use what we learned in Lab 1 to process audio. In particular, we want to clean up an audio signal (Labs 2A and 2B) and we want to recognize spoken words (Lab 2C).

Lab 2A: Audio Processing with Convolutional Neural Networks (CNNs)¶

0. Environment setup¶

In [1]:

# To run this notebook on a local computer, go back to the Jupyter home page and click the upload
# button. A dialog box will appear. Locate the file data_lab_2b.pt and upload.
#
# To run on Google Colab, upload the data_lab_2B.pt file to a folder in your Google Drive and 
# uncomment the next four lines of code. The code assumes the data_lab_2B.pt file is in a folder 
# labeled "ese2000/Lab2B." Change the name to the proper folder if needed.

## import os
# from google.colab import drive

## Mount the drive. It prompts for an authorization.
# drive.mount('/content/drive')

## Specify the directory where the data is located. Change "ese2000/Lab2B" to your own folder name.
# folder_path = '/content/drive/My Drive/ese2000/Lab2B'

## Change the directory to this folder to perform operations within it
# os.chdir(folder_path)

# Import libraries
import torch
import torch.nn as nn
import torch.functional as F
import torch.optim as optim
from torch.nn.functional import relu
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio, display
import random
np.random.seed(1)
torch.manual_seed(1)
random.seed(1)

print('\n Done \n')

 Done

Convolutional Neural Networks¶

A convolutional neural network (CNN) is a neural network in which the linear maps that are used in each layer are convolutional filters. Using $h_{\ell k}$ to denote the filter coefficients used at Layer $\ell$, a CNN is defined by the recursion

\begin{equation}\tag{1} \mathbf{x}_0 = \mathbf{x}, \quad \mathbf{x}_\ell = \sigma \Big(\, \mathbf{z}_\ell \,\Big) = \sigma \left(\, \sum_{k=0}^K h_{lk}\,\mathcal{S}^k \mathbf{x} \, \right) ,\quad \mathbf{x}_L = \mathbf{\phi}(\mathbf{x}; \mathcal{H}). \end{equation}

In (1) we use $\mathbf{\phi}(\mathbf{x}; \mathcal{H})$ to denote the output of the CNN, with $\mathcal{H}=[h_{11},\ldots,h_{KL}]$ being a tensor that groups all of the filter coefficients that make up the layers of the CNN.

Notice that since convolutions are linear operations, a CNN is a particular case of a standard neural network. This being true, a CNN is still a composition of layers each of which is itself the composition of a linear map with a pointwise nonlinearity. The only difference is that instead of generic linear maps, layers use (linear) convolutions.n When we want to highlight that a neural network is not convolutional, we call it a fully connected neural network (FCNN).

An important observation to make is that since the nonlinear operations that are used at each layer are pointwise, the CNN inherits the shift invariance of convolutions. Locality and shift invariance are the motivations for introducing convolutions. CNNs inherit these two properties. We can then think of CNNs as generalizations of convolutions. We can, in fact, argue that CNNs are minor variations of convolutions. They are nonlinear operators designed to stay as close as possible to linear convolutional filters. We just add pointwise nonlinearities.

Do notice that we are not saying that this minor architectural variation is irrelevant. Quite the contrary, it has a major effect on the practical performance of CNNs as we explore in this lab.

TASK 1¶

Program a class to implement CNNs. This class receives as initialization parameters the number of layers $L$ and the number of taps of the filters of each layer. The number of taps per layer is given as a vector with $L$ entries. Endow the class with a forward methods that takes a vector $\mathbf{x}$ as an input and works through the recursion in (1) to return the CNN output $\mathbf{\phi}(\mathbf{x}; \mathcal{H})$. Use tanh nonlinearities in all layers.

In [2]:

# Define the convolutional function as in LAB2A
def convolution_1D(signal,filter):
    taps = filter.shape[-1]
    z = torch.zeros(signal.shape)
    for k in range(taps):
        z = z + signal * filter[k]
        signal = torch.roll(signal, shifts=1, dims=-1)
        signal[:,:,0] = 0
    return z

# Define the convolutional layer that inherits from nn.Module
# and implements (1) using convolution_1D
class Convolutional_Layer(nn.Module):
    def __init__(self,taps):
        super().__init__()
        self.filter = torch.nn.Parameter(torch.ones(taps))

    def forward(self,x):
        return convolution_1D(x,self.filter)

# Define the CNN, which also inherits from nn.Module
# and has L layers, where each layer instantiates the convolutional_layer class
class CNN(nn.Module):
    def __init__(self, L,taps):
        '''
        inputs:
            L (int): number of layers
            taps (list[int]): list with number of taps for each layer
        '''
        super().__init__()
        self.L = L
        assert len(taps) == L, f"There are {L} layers, but taps is a vector of length {len(taps)}"
        self.conv_layers = []
        for i in range(L):
            self.conv_layers.append(Convolutional_Layer(taps[i]))


    def forward(self, x):
        for i in range(self.L):
            # compute the convolution
            x = self.conv_layers[i](x)
            if i != self.L-1:
                x = torch.tanh(x)
        return x

print('\n Done \n')

 Done

Multiple-Input-Multiple-Output (MIMO) Filters¶

Multiple input (MI) features can be processed with separate filterbanks to produce multiple output (MO) features. In these MIMO filters the input is a matrix $\mathbf{x}$ and the output is another matrix matrix $\mathbf{y}$. The MIMO filter coefficients are matrices $\mathbf{H}_k$ and the MIMO filter itself is a generalization of (1) in which the matrix $\mathbf{H}_k$ replaces the vector $\mathbf{h}_k$,

\begin{equation}\tag{2} \mathbf{y} = \sum_{k=0}^K \mathcal{S}^k \mathbf{x} \mathbf{H}_k . \end{equation}

In (2), the input feature matrix $\mathbf{x}$ has dimension $N \times F$ and the output feature matrix $\mathbf{y}$ has dimension $N \times G$. This means that each of the $F$ columns of $\mathbf{x}$ represents a separate input feature whereas each of the $G$ columns of $\mathbf{y}$ represents an output feature. To match dimensions, the filter coefficient matrices $\mathbf{H}_k$ must be of dimension $F \times G$.

TASK 2¶

Program a class that implements a MIMO filter. This class has as attributes the length of the filter $K$ and the dimensions $F$ and $G$ of the filter coefficients. The filter coefficients themselves are also an attribute of the class. Endow the class with a forward method that takes an input feature $\mathbf{x}$ and produces the corresponding output feature $\mathbf{y}$.

In [3]:

# Define the function that implementss a MIMO convolution
def MIMO_convolution_1D(signal,filter):
    taps = filter.shape[0]
    G = filter.shape[1]
    N = signal.shape[-1]
    batch = signal.shape[0]
    z = torch.zeros((batch, G, N)).to(device)
    for k in range(taps):
        z += torch.matmul(filter[k], signal)
        signal = torch.roll(signal, shifts=1, dims=-1)
        signal[:,:,0] = 0
    return z

# Define the convolutional layer class that inherits from nn.module
class MIMO_convolutional_layer(nn.Module):
    def __init__(self,F,G,K):
        '''
        inputs:
            G(int): Number of output features
            F(int): Number of input features
            K(int): Number of filter taps
        '''
        super().__init__()
        self.filter = torch.nn.Parameter(torch.empty(K,G,F))
        nn.init.xavier_normal_(self.filter)
    def forward(self,x):
        return MIMO_convolution_1D(x,self.filter)

print('\n Done \n')

 Done

Real Convolutional Neural Networks¶

The MIMO filters of the last Section can be used to define MIMO CNNs. A MIMO CNN is a neural network in which each of the layers is the composition of a MIMO convolutional filter with a pointwise nonlinearity. If at each layer we denote filter coefficients as $\mathbf{H}_{\ell k}$, the MIMO CNN is given by the recursion

\begin{equation} \mathbf{x}_0 = \mathbf{x}, \quad \mathbf{x}_\ell = \sigma \Big(\, \mathbf{z}_\ell \,\Big) = \sigma \left(\, \sum_{k=0}^{K_\ell} \mathcal{S}^k \mathbf{x} \mathbf{H}_{\ell k} \, \right) ,\quad \mathbf{x}_L = \mathbf{\phi}(\mathbf{x}; \mathcal{H}). \end{equation}

Thus, we have a composition of layers each of which is itself the composition of a MIMO convolutional filter with a pointwise nonlinearity.

The CNN of the first Section was introduced for didactic purposes because almost all CNNs used in practice are MIMO CNNs — hence, the title of this section. For that reason we never call them MIMO CNNs, we just call them CNNs.

Convolutional Neural Network Specification¶

To specify a CNN we need to specify the number of layers $L$ and the characteristics of the filers that are used at each layer. The latter are the number of filter taps $K_\ell$ and the number of features $F_\ell$ at the output of the layer. The number of features $F_0$ must match the number of features at the input and the number of features $F_L$ must match the number of features at the output. Observe that the number of features at the output of Layer $(\ell-1)$ determines the number of features at the input of Layer $\ell$. Then, the filter coefficients at Layer $\ell$ are of dimension $F_{\ell-1} \times F_\ell$.

TASK 3¶

Program a class that implements a CNN with $L$ layers. This class receives as initialization parameters a CNN specification consisting of the number of layers $L$ and vectors $[K_1, \ldots, K_L]$ and $[F_0, F_1, \ldots, F_L]$ containing the number of taps and the number of features of each layer.

Endow the class with a forward method that takes an input feature $\mathbf{x}$ and produces the corresponding output feature $\mathbf{\phi}(\mathbf{x}; \mathcal{H})$.

In [4]:

# Define the CNN

class MIMO_CNN(nn.Module):
    def __init__(self, L,taps,features):
        '''
        inputs:
            L (int): number of layers
            taps (list[int]): list with number of taps for each layer [K1, ..., KL]
            features: list with number of features of each layer [F0, ..., FL]
        '''
        super().__init__()
        self.L = L
        assert len(taps) == L, f"There are {L} layers, but taps is a vector of length {len(taps)}"
        assert len(features) == L+1, f"There are {L} layers, but features is a vector of length {len(features)} (There has to be L+1 features)"
        self.layers = nn.ModuleList()
        for i in range(L):
            self.layers.append(MIMO_convolutional_layer(F=features[i],G=features[i+1],K=taps[i]))
            
    def forward(self, x):
        for i in range(self.L):
            x = self.layers[i](x)
            if i != self.L-1:
                x = torch.tanh(x)
        return x

print('\n Done \n')

 Done

Audio Processing with Convolutional Neural Networks¶

We have prepared a dataset similar to the one we used in Lab 2A. We ask that you process it with a CNN.

TASK 4¶

Load audio data from the link provided in the course site, and unzip it. This file contains a pytorch tensor with Q = 300 pairs of audio recordings $(\mathbf{x}_q, \mathbf{y}_q)$. Split the dataset into training and testing sets, keep 50 samples for testing. Train a CNN to remove the background noise. This CNN has 2 layers, the numbers of features per layer are $[1,5,1]$ and the filter lengths are $[40,40]$. Use a learning rate of 0.5, batch size of 32 and train for 20 epochs. Evaluate the train and test loss (use L1 loss as in lab 2A).

In [5]:

dataset_name = 'data_lab_2B.pt'
sample_rate = 32000

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device("cpu")
dataB = torch.load(dataset_name, map_location=device, weights_only = False)
print(f'Device: {device}')

# Take one sample and separate x and y
data_0 = dataB[0]
x_0 = data_0[0].cpu()
y_0 = data_0[1].cpu()

print(len(dataB))
print(x_0.shape)

# Noisy signal plot and play
plt.plot(x_0[0])
plt.show()
display(Audio(x_0[0],rate=sample_rate))

# Original signal plot and play
plt.plot(y_0[0])
plt.show()
display(Audio(y_0,rate=sample_rate))

# Split into train and test

test_size = 100
train_size = len(dataB) - test_size

test_set_B, train_set_B =  torch.utils.data.random_split(dataB, [test_size, train_size])

print('\n Done \n')

Device: cpu
1500
torch.Size([1, 32000])

No description has been provided for this image

 Done

In [7]:

def evaluate(dataloader, estimator):
    loss = 0
    with torch.no_grad():
        for x_batch, y_batch in dataloader:
            yHat = estimator.forward(x_batch)
            loss += torch.mean((yHat-y_batch)**2)
    numBatchs = len(dataloader)
    loss /= numBatchs
    return loss.item()

print('\n Done \n')

 Done

In [20]:

# SAME TRAINING LOOP as in previous lab BUT WITH MIMO_CNN AS ESTIMATOR

L = 2
taps = [40, 40]
features = [1,5,1]
estimator = MIMO_CNN(L=L,taps=taps,features=features)
lr = 0.5
batch_size = 32
n_batches = np.ceil(train_size/batch_size).astype(int)

optimizer = optim.SGD(estimator.parameters(), lr=lr)
loss_L1 = nn.L1Loss()
estimator.to(device)

# Instantiate Data Loaders
train_loader_CL = DataLoader(
    train_set_B,
    batch_size=batch_size,
    shuffle=True,
)

test_loader_CL = DataLoader(
    test_set_B,
    batch_size=batch_size,
    shuffle=False,
)

loss_evolution = []
loss_evolution_test = []

# Training loop
n_epochs = 20

print('\nStart Training Loop\n')

# Iterate n_epochs times over the whole dataset.
for ep in range(n_epochs):
    i=0
    #iterate over all batches in the dataset
    for x_batch, y_batch in train_loader_CL:
        i+=1

        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # Set gradients to zero
        estimator.zero_grad()

        # Compute predictions
        yHat = estimator.forward(x_batch)

        # Compute loss
        loss = loss_L1(yHat,y_batch)
        loss.backward()

        # Update parameters
        optimizer.step()

        print(f'epoch {ep+1} / {n_epochs}, batch {i} / {n_batches},  ', end='\r')
    
    epoch_loss_filter = evaluate(train_loader_CL, estimator)
    loss_evolution.append(epoch_loss_filter)

    lossTest_filter = evaluate(test_loader_CL, estimator)
    loss_evolution_test.append(lossTest_filter)
    print(f'epoch {ep+1} / {n_epochs}, batch {i} / {n_batches}, loss {lossTest_filter}  ', end='\r')

print('\n\nDone\n')

Start Training Loop

epoch 20 / 20, batch 44 / 44, loss 6.924397411012251e-08   

Done

In [23]:

# Plot results
plt.plot(loss_evolution, ".", label="train_filter")
plt.title("Mean squared error evolution")
plt.xlabel("Epoch")
plt.ylabel("L1 loss")
plt.plot(loss_evolution_test, ".", label="test_filter")
plt.legend()
plt.savefig("cnn_mse.pdf")
plt.show()

print('\nDone\n')

Done

In [22]:

# Listen to the samples

# (change to try other audios)
nmbr_example = 0

x_batch, y_batch = next(iter(test_loader_CL))
sample_x, sample_y = x_batch[nmbr_example], y_batch[nmbr_example]
sample_x = sample_x.reshape(sample_x.shape[0], 1, sample_x.shape[1])
predicted_y = estimator(sample_x).detach()[0]

sample_x = sample_x.cpu()
sample_y = sample_y.cpu()
predicted_y = predicted_y.cpu()

print(f'Original Audio')
display(Audio(sample_y, rate=sample_rate))

print(f'Noisy Audio')
display(Audio(sample_x.reshape(sample_x.shape[0],sample_x.shape[2]), rate=sample_rate))

print(f'Audio recovered with filter')
display(Audio(predicted_y, rate=sample_rate))

Original Audio

Noisy Audio

Audio recovered with filter

In [ ]:

import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio

signal1 = sample_x.cpu()[0]
signal2 = predicted_y.cpu()

sample_rate=32000

# Compute the Fourier Transform
fft_signal1 = np.fft.fft(signal1)
fft_signal2 = np.fft.fft(signal2)

# Compute the frequency axis
N = signal1.shape[1]
freq = np.fft.fftfreq(N, d=1/sample_rate)

# Plot the magnitude spectrum of the Fourier Transform for both signals
plt.figure(figsize=(12, 6))

plt.subplot(2, 1, 1)
plt.plot(freq, np.abs(fft_signal1.T))
plt.title('Original')
plt.xlabel('Frequency')
plt.ylabel('Magnitude')
plt.xlim(0, sample_rate / 2)  # Limiting x-axis to half the sampling rate (Nyquist frequency)

plt.subplot(2, 1, 2)
plt.plot(freq, np.abs(fft_signal2.T))
plt.title('Denoised')
plt.xlabel('Frequency')
plt.ylabel('Magnitude')
plt.xlim(0, sample_rate / 2)  # Limiting x-axis to half the sampling rate (Nyquist frequency)

plt.tight_layout()
plt.show()

display(Audio(signal1,rate=sample_rate))
display(Audio(signal2,rate=sample_rate))

In [ ]: