Download the files for Lab 2A in the following links:

Instructions on how to download and use Jupyter Notebooks can be found here. You can find a static version of the notebook below.

notebook_lab_2A

Lab 2: Signals in Time: Audio Processing and Classification¶

Audio is mathematically modeled as a function $x(t)$ in which $t$ represents time and $x(t)$ is an electric signal that is generated by transforming air pressure waves with a microphone. The same pressure waves can be reconstructed from the electrical signal using a speaker. In this lab we want to use what we learned in Lab 1 to process audio. In particular, we want to clean up an audio signal (Labs 2A and 2B) and we want to recognize spoken words (Lab 2C).

Lab 2A: Audio Processing with Convolutions¶

0. Environment setup¶

In [1]:

# To run this notebook on a local computer, go back to the Jupyter home page and click the upload
# button. A dialog box will appear. Locate the file data_lab_2A.pt and upload.
#
# To run on Google Colab, upload the data_lab_2A.pt file to a folder in your Google Drive and 
# uncomment the next four lines of code. The code assumes the data_lab_2A.pt file is in a folder 
# labeled "ese2000/Lab2A." Change the name to the proper folder if needed.

## import os
# from google.colab import drive

## Mount the drive. It prompts for an authorization.
# drive.mount('/content/drive')

## Specify the directory where the data is located. Change "ese2000/Lab2A" to your own folder name.
# folder_path = '/content/drive/My Drive/ese2000/Lab2A'

## Change the directory to this folder to perform operations within it
# os.chdir(folder_path)

# Import libraries
import torch
import matplotlib.pyplot as plt
from IPython.display import Audio,display
import numpy as np
import torch.nn as nn
import torch.functional as F
from torch.nn.functional import relu
import torch.optim as optim
from torch.utils.data import DataLoader

1. Audio Signals¶

Audio is a function $x(t)$ in which $t$ represents time. We can create a digital representation of an audio signal through sampling. To do so define a sampling time $T_s$ and a number of components $N$ and proceed to sample the signal $x(t)$ every $T_s$ units of time. This results in the digital audio signal $\mathbf{x}$, which is mathematically represented by the vector

\begin{equation}\tag{1} \mathbf{x} ~=~ [x(0); x(T_s); x(2T_s); \ldots; x((N-1)T_s) ] ~=~ [x_0; x_1; x_2; \ldots; x_{N-1}] . \end{equation}

A digital audio signal is more convenient than the original analog representation because it is easier to process. The vector $\mathbf{x}$ can be manipulated to extract information or make it better in a number of ways. In this lab we have excerpts of human speech that are contaminated with background noise. Our goal is to remove as much of the background noise as possible.

Task 1¶

Load audio data dataA.zip. Unzip it. This file contains a pytorch tensor with Q = 1500 pairs of audio recordings $(\mathbf{x}_q, \mathbf{y}_q)$. The signals $\mathbf{x}_q$ are human speech recorded with background noise. The signals $\mathbf{y}_q$ are the same speech files recorded without background noise. Each of these audio signals contains $N=64000$ samples recorded with a sampling time of $T_s = 17 \mu s$. Play sample speech recordings for clean and contaminated audio.

In [2]:

# Specify the file containing the data. This has to be loaded into the notebook (see Section 0)
dataset_name = 'data_lab_2A.pt'

# Computations are getting intensive. The amount of samples that we are processing, the size of each
# individual sample and the complexity of the learning parameterization are all much larger than in 
# Lab 1. The first line below sets up the notebook to use a GPU if it is available. The second line
# loads the data into the processing device. The GPU if available, the CPU otherwise. Loading the 
# data creates a tensor dataA 
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device("cpu")
dataA = torch.load(dataset_name, map_location = device, weights_only = False)

# The dataset dataA contains a list of 1,500 tuples (x,y). Both entries in each tuple are audio 
# signals with a duration of T = 1 second (s) and a sampling frequency of 32 KHz (32 thousand 
# samples per second). This implies a sampling times of T_s = 31.25 micro seconds and audio vectors
# with N = 32,000 entries. In each tuple, x is a signal contaminated with noise and y is a clean 
# signal without the noise.

# Dataset parameters. THEY DO NOT MAKE SENSE TO ME
sample_freq = 32000
seconds_per_clip = 1
n_time_samples = sample_freq * seconds_per_clip

# Select one sample and separate x and y for display and playblack
nr_data_sample =  0
data_sample = dataA[nr_data_sample,]
x_data_sample = data_sample[0].cpu()
y_data_sample = data_sample[1].cpu()

# Show the dimension of each audio vector. This is a good time to remember that the last coordinate
# of a tensor is the row coordinate and that the next to last coordinate is the column coordinate.
print('\n')
print('Dimension of x samples:', x_data_sample.shape, '.')
print('    Thus, this is a vector with', x_data_sample.shape[1], 'rows', end = ' ')
print('and', x_data_sample.shape[0], 'column(s).')
print('\n')
print('Dimension of y samples:', y_data_sample.shape)
print('    Thus, this is a vector with', y_data_sample.shape[1], 'rows', end = ' ')
print('and', y_data_sample.shape[0], 'column(s).')
print('\n')

# Plot the noisy sample x and the clean sample y.
plt.plot(x_data_sample[0,:], label = "Noisy Data Sample")
plt.plot(y_data_sample[0,:], label = "Clean Data Sample")
plt.title("Data Sample, Clean and Noisy Audio Waveforms")
plt.xlabel("Time Sample Index")
plt.ylabel("Signal Value")
plt.legend()
plt.savefig("audio_samples.pdf")
plt.show()

print('\n')

# Play the clean sample and the noisy sample
print(f'Noisy Audio Sample:')
display(Audio(x_data_sample[0,:], rate = sample_freq))
print(f'Clean Audio Sample:')
display(Audio(y_data_sample[0,:], rate = sample_freq))

print('\nDone\n')


Dimension of x samples: torch.Size([1, 32000]) .
    Thus, this is a vector with 32000 rows and 1 column(s).


Dimension of y samples: torch.Size([1, 32000])
    Thus, this is a vector with 32000 rows and 1 column(s).

No description has been provided for this image


Noisy Audio Sample:

Clean Audio Sample:

Done

It is clear that this is a problem that we can formulate as an empirical risk minimization (ERM). Given audio inputs $\mathbf{x}$ contaminate with background noise we want to produce estimates $\hat{\mathbf{y}}$ of the corresponding clean audio signals $\mathbf{y}$. Using a learning parametrization that produces estimates $\hat{\mathbf{y}}=\mathbf{\Phi}(\mathbf{x}; \mathbf{h})$ we want to find the parameter $\mathbf{h}$ that minimizes the empirical risk over the $Q$ audio pairs for a given loss function $\ell(\mathbf{y},\hat{\mathbf{y}})$,

\begin{equation}\tag{2} \mathbf{h}^* = \text{argmin}_{\mathbf{h}} \frac{1}{Q} \sum_{q=1}^Q \ell \Big(\,\mathbf{y}_q,\,\mathbf{\Phi}(\mathbf{x}_q; \mathbf{h}) \,\Big). \end{equation}

It is interesting that this mathematical formulation is searching for an artificial intelligence (AI) that is trying to undo a natural effect — rather than mimicking a natural effect. The original data are clean audio inputs $\mathbf{y}_q$. These data are contaminated with background noise to produce the audio files $\mathbf{x}_q$. Our AI takes as inputs these signals contaminated with background noise and attempts to estimate the clean audio signals that generated the data contaminated with background noise. This difference in interpretation of the role of the AI does not alter the mathematical formulation of the ERM problem.

Throughout this lab we will use $L_1$ losses to compare clean audio signals and their estimates. The $L_1$ loss is defined as the sum of the absolute values of individual component differences. If we write $\mathbf{y} = [y_0; \ldots; y_{N-1}]$ and $\hat{\mathbf{y}}=[\hat{y}_0; \ldots; \hat{y}_{N-1}]$ the $L_1$ loss is given by

\begin{equation} \ell \big(\, \mathbf{y},\,\hat{\mathbf{y}} \,\big) ~=~ \big|\, \mathbf{y} – \hat{\mathbf{y}} \,\big| ~=~ \sum_{n=0}^{N-1} \big|\, y_n – \hat{y}_n \,\big| . \end{equation}

We are now ready to make a first attempt at training an AI to remove background noise from audio signals.

Task 2¶

Split the data loaded in Task 1 into a test set with 100 samples and a training set containing the remaining samples. Use as learning parametrization a neural network with $N_1 = 100$ hidden neurons in a single layer.

Evaluate the training and test error. Play some sample speech recordings of entries of the test dataset after they are cleaned with the neural network. You are likely surprised by the sound. Comment.

In [3]:

# We split the dataset into a trainin set and a test set. We specify the size of the test set and
# use the remaining data samples to construct the training set. 
test_size = 100
train_size = len(dataA) - test_size

# We use a Pytorch utility to split into train and test randomly. We did this manually in Lab 1C by
# splitting data tesnors after an index reshufling operation. The end outcome is the same.
train_set, test_set =  torch.utils.data.random_split(dataA, [train_size, test_size])

In [4]:

# We define a fullly connected (FC) neural network (NN) with two layers. This is the same NN that we 
# coded in Lab 1C. We modify the linear operation in the layers to add bias parameters. Biases are 
# numbers added after the linear operation. I.e., instead of having z = Ax we have z = Ax + b, where 
# both, the matrix A and the bias term b are trainable parameters.

class FCNN(nn.Module):

    # Init method. This is where attributes are initialized
    def __init__(self, n_input, n_layer_1, n_output): # numbers of neurons are passed as arguments
        
        super().__init__() 
        
        self.A1 = nn.parameter.Parameter( torch.rand(n_input, n_layer_1) - 0.5 ) # Layer 1 matrix
        self.A2 = nn.parameter.Parameter( torch.rand(n_layer_1, n_output) - 0.5 ) # Layer 2 matrix

        self.bias1 = nn.parameter.Parameter( torch.rand(1, n_layer_1) - 0.5 ) # # Layer 1 bias
        self.bias2 = nn.parameter.Parameter( torch.rand(1, n_output) - 0.5 ) # Layer 2 bias
   
    # Forward method. This is where the NN is implemented. 
    #
    #     Input: A tensor x0 with three coordinates, x0[sample, column, row]. The sample coordinate
    #            indexes samples in the batch. The column coordinate must be 1, because these are
    #            vectors that we input to the NN. The row coordinate indexes different entries of
    #            the data. Its length must match the number if input components n_input.
    #
    #     Output: A tensor x2 with three coordinates, x2[sample, column, row]. Same interpretation
    #             as Input tensor. Except that the length of the row coordinate must match the 
    #             number of output components n_outut
    #             
    # The forward method processes all of the vectors in the bacth simultaneously  
    
    def forward(self, x0): 
        
        # First layer takes x0 as input (passed when calling the function) and gives x1 as output
        z1 = torch.matmul( x0, self.A1 ) + self.bias1 # z1 = A1 x0 + b1 
        x1 = nn.functional.relu( z1 ) # x1 = sigma(z1) = relu(z1)
        
        # Second layer takes x1 as input and gives x2 as output
        z2 = torch.matmul( x1, self.A2 ) + self.bias2 # z2 = A2 x1 + b2 
        x2 = relu( z2 ) # x1 = sigma(z2) = relu(z2)
        
        return x2 # This is a two-layer NN. The output is x2

In [5]:

# We define a function to evaluate the loss. This is an auxiliary function. We do not need it to 
# implement or train the neural network. We use it for evaluation after training and/or at some
# intermediate training checkpoints. This is just for visualization.
#
#     Input: dataloader = Data on which we want to evaluate performance.
#     Input: estimator = Learning parameterization we want to evaluate.
#     Output: Mean squared error (MSE) of estimator evluated in dataloader

def evaluate(dataloader, estimator): 
    
    MSE = 0 # Initialize MSE
    
    with torch.no_grad(): # Disable gradient computations (the are not needed for evaluation).
        for x_batch, y_batch in dataloader: # Sweep over all batches
            yHat = estimator.forward(x_batch) # Estimate outputs on a batch
            MSE += torch.mean((yHat-y_batch)**2) # Evaluate MSE on batch and accumulate to MSE total
       # end for
    # end with    
    
    n_batchs = len(dataloader) # Total number of batches
    MSE /= n_batchs # Divide accumulated error by number of batches
    
    return MSE.item()

In [6]:

# Create an estimator using the FCNN class and load it to the processing device (CPU or GPU). This 
# FCNN has two layers. At the input of Layer 1 and the output of Layer 2 the dimensionality (number
# of neurons) is the number of the samples in the audio signal. The dimensioanlity of the output of
# Layer 1 (number of hidden neurons) is set to 100.
n_hidden_neurons = 100
estimator = FCNN(n_time_samples, n_hidden_neurons, n_time_samples).to(device)

# We set the parameters of stochastic gradient descent (SGD). They are the stepsize or learning 
# rate, the batch size, and the number of epochs. We furhter specify the use of SGD as the 
# optimization algorithm
lr = 0.1 
batch_size = 128
n_epochs = 50
optimizer = optim.SGD(estimator.parameters(), lr=lr)

# Specify the loss function
loss = nn.MSELoss(reduction = 'mean')

# Instantiate the train and test dataloaders. The train data loader handles the selection of 
# randomized batches for implementing sampling without replacement. Both data loaders handle the 
# loading of data on the device's memory. 
train_loader = DataLoader( train_set, batch_size = batch_size, shuffle = True )
test_loader = DataLoader( test_set, batch_size = batch_size, shuffle = False )

# Initialize null structures for storing the evolution of the train and test MSE. We will save both
# at the end of every epoch. This is not needed for training. It's just for displaying results
mse_evolution_train = []
mse_evolution_test = []
print('\n')

# Begin training loop. This is an implementation of stochastic gradient descent (SGD) using sampling 
# without replacement. In the inner loop we are sweeping over batches of size batch_size. Samples in
# the batch are selected by the train_loader object. In the outer loop we are sweeping over epochs.
# One epoch is a complete run over the randomly reshufled dataset. 
#
# In each inner iteration we have the three steps required to run SGD: (i) Load the data. 
# (ii) Evaluate Gradients. (iii) Take a gradient descent step.

for epoch in range(n_epochs): # Iterate over n_epochs epochs

    for x_batch, y_batch in train_loader: # Iterate over all batches in the dataset 

        # (Step i) Load the data. These commands send the data to the GPU memory.
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # (Step ii) Compute the gradients. We use automated differentiation.
        estimator.zero_grad() # Gradient reset to indicate where the backward computation stops.

        yHat = estimator.forward(x_batch) # Call the neural network.
        mse = loss(yHat,y_batch) # Call the loss functions.

        mse.backward() # Compute gradients moving backwards untit the gradient reset.

        # (Step iii) Update parameters by taking an SGD (or other optimizer) step.
        optimizer.step()
    
    # End of batch loop.

    # Evaluate the performance of the neural netowrk on the training data at the end of each epoch
    mse_train_set = evaluate(train_loader, estimator)
    mse_evolution_train.append(mse_train_set)

    # Evaluate the performance of the neural netowrk on the test data at the end of each epoch
    mse_test_set = evaluate(test_loader, estimator)
    mse_evolution_test.append(mse_test_set)
    
    # Print train and test MSE to track training performance
    print(f'Epoch {epoch+1} / {n_epochs}: ',end='')
    print(f'Train loss: {mse_train_set}; ',end='')
    print(f'Test Loss: {mse_test_set}',end='\r')

# End of epoch loop.

print('\n\nDone\n')


Epoch 50 / 50: Train loss: 0.04202495515346527; Test Loss: 0.0420018173754215246

Done

In [7]:

# Plot MSE across epochs for the train and test sets
plt.plot(mse_evolution_train, ".", label = "Training Set MSE")
plt.plot(mse_evolution_test, "x", label = "Test Set MSE")
plt.title("Mean Squared Error (MSE) as a Function of Epoch Index")
plt.xlabel("Epoch Index")
plt.ylabel("Mean Squared Error")
plt.legend()
plt.savefig("fcnn_mse.pdf")
plt.show()

print('\nDone\n')

Done

In [8]:

# Listen to some samples to hear the quality of noise removal

# Isolate an audio sample and pass it to the neural network
n_data_sample = 1042 # Sample choice. Change to hear other samples
sample_x, sample_y = dataA[n_data_sample] # Isolate sample from dataset
predicted_y = estimator.forward(sample_x) # Pass sample to the forward method of the neural network
predicted_y = predicted_y.detach() # Detatch tensor values before passing to audio library. 
                                   # This is technicality

# Play the clean audio, the noisy audio, and the processed audio
print('\n')
print(f'Clean Audio Sample:')
display(Audio(sample_y.cpu(), rate = sample_freq))
print(f'Noisy Audio Sample:')
display(Audio(sample_x.cpu(), rate = sample_freq))
print(f'Denoised Audio Sample:')
display(Audio(predicted_y.cpu(), rate = sample_freq))

print('\nDone\n')


Clean Audio Sample:

Noisy Audio Sample:

Denoised Audio Sample:

Done

The neural network in Task 2 has failed to clean the audio files. This is a surprise because we have seen neural networks work in Lab 1C and the problem formulation in (2) is essentially the same problem formulation. What has changed between Lab 1C and Lab 2A is dimensionality. In Lab 1Aa we were processing signals with $N=2$ components. We are now processing signals with $N=64,000$. The complexity of the data is such that a neural network fails to learn any meaningful action that may result in removal of background noise.

This phenomenon is typical. When we consider problems in which the input dimension is small pretty much any learning parametrization works fine. Neural networks, in particular, work well and have become a de-facto standard. When we consider signals in high dimensions, not all parameterizations work well. Finding good parameterizations requires that we leverage the structure of the signal. In the case of signals in time, this is done with convolutions.

2. Convolutions¶

Convolutions are linear operations that we use to process time signals. Consider then an input signal $\mathbf{x}$ with $N$ components $x(n)$ along with a filter $\mathbf{h}$ having $K$ coefficients $h(k)$. The convolution of the filter $\mathbf{h}$ with the signal $\mathbf{x}$ is a signal $\mathbf{y}= \mathbf{h} * \mathbf{x}$. This signal has $N$ components $x(n)$ which are given by

\begin{equation}\tag{3} y(n) = \sum_{k=0}^{K-1} h(k) x(n-k), \quad \end{equation}

In this definition we adopt the convention that $x(n-k)=0$ whenever the argument $(n-k) \notin [0,N-1]$. This is needed because for some values of $n$ and $k$ we may have that $n-k$ is outside of the range $[0,N-1]$ — for example, when $n=0$ and $k>0$. When this happens $x(n-k)$ is not defined. It follows that without the adoption of some convention for these values of the index $(n-k)$ the definition in (3) is improper. This convention is called a border artifact and it is not a significant issue if $K \ll N$. This is not required in the definition in (3) but it is almost always true in practice.

Task 3¶

Implement a class to represent convolutional filters with $K$ taps. In this class $K$ is an initialization parameter and the filter taps $h_k$ are class attributes. Endow the class with a forward method that takes a signal $\mathbf{x}$ as an input and produces as an output the signal $\mathbf{y}= \mathbf{h} * \mathbf{x}$.

In [9]:

# Define the ConvolutionalFilter class to represent a convolutional filter

class ConvolutionalFilter(nn.Module):

    def __init__(self, taps):

        super().__init__()
        self.filter = torch.nn.Parameter(torch.ones(taps))
        self.taps = taps
        
    def forward(self,x):
       
        # Instantiate the accumulated results from the convolution.
        z = torch.zeros(x.shape).to(device)
        
        for k in range(self.taps):
        
            # Add the results for the current tap
            z = z + x * self.filter[k]
            
            # Rolls the tensor one positions (see pytorch docs for more details)
            x = x.roll(1)
            
            # Set the signal at the first position to zero
            x[:,0] = 0
        
        # End sweep of filter taps
        
        return z

Task 4¶

Split the data loaded in Task 1 into a test set with 100 samples and a training set containing the remaining samples. Use the class of Task 3 to train a convolutional filter with $K = 81$ taps to remove background noise.

Evaluate the test error. Play some sample speech recordings of entries of the test dataset after they are cleaned with the neural network.

In [14]:

# To train the convolutional filter we run the same training loop we run for training the FCNN. The
# only difference is that we need to change the estimator. Instead of instantiating an object from 
# the FCNN class, we instantiate an object from the ConvolutionalFilter class. We also need to 
# modify some hyperparameters (learning rate, batch size, and number of epochs) as detailed below.
estimator = ConvolutionalFilter( taps = 10 ).to(device)

# We set the parameters of stochastic gradient descent (SGD). They are the stepsize or learning 
# rate, the batch size, and the number of epochs. We furhter specify the use of SGD as the 
# optimization algorithm. These numbers are different from the numbers used for neural networks. 
lr = 0.005
batch_size = 16
n_epochs = 20
optimizer = optim.SGD(estimator.parameters(), lr=lr)

# Specify the loss function
loss = nn.MSELoss(reduction = 'sum')

# Instantiate the train and test dataloaders. The train data loader handles the selection of 
# randomized batches for implementing sampling without replacement. Both data loaders handle the 
# loading of data on the device's memory. We do not rewrite this code because we already run it in
# an earlier cell. But we keep it here for completeness.
train_loader = DataLoader( train_set, batch_size = batch_size, shuffle = True )
test_loader = DataLoader( test_set, batch_size = batch_size, shuffle = False )

# Initialize null structures for storing the evolution of the train and test MSE. We will save both
# at the end of every epoch. This is not needed for training. It's just for displaying results
mse_evolution_train = []
mse_evolution_test = []
print('\n')

# Begin training loop. This is an implementation of stochastic gradient descent (SGD) using sampling 
# without replacement. In the inner loop we are sweeping over batches of size batch_size. Samples in
# the batch are selected by the train_loader object. In the outer loop we are sweeping over epochs.
# One epoch is a complete run over the randomly reshufled dataset. 
#
# In each inner iteration we have the three steps required to run SGD: (i) Load the data. 
# (ii) Evaluate Gradients. (iii) Take a gradient descent step. It is the exact same code we wrote
# above. It is just that we run with a different estimator. A convolutional filter instead of a
# fully connected neural network.

for epoch in range(n_epochs): # Iterate over n_epochs epochs

    for x_batch, y_batch in train_loader: # Iterate over all batches in the dataset 

        # (Step i) Load the data. These commands send the data to the GPU memory.
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        # (Step ii) Compute the gradients. We use automated differentiation.
        estimator.zero_grad() # Gradient reset to indicate where the backward computation stops.

        yHat = estimator.forward(x_batch) # Call the neural network.
        mse = loss(yHat,y_batch) # Call the loss functions.

        mse.backward() # Compute gradients moving backwards untit the gradient reset.

        # (Step iii) Update parameters by taking an SGD (or other optimizer) step.
        optimizer.step()

    # End of batch loop.

    # Evaluate the performance of the neural netowrk on the training data at the end of each epoch
    mse_train_set = evaluate(train_loader, estimator)
    mse_evolution_train.append(mse_train_set)

    # Evaluate the performance of the neural netowrk on the test data at the end of each epoch
    mse_test_set = evaluate(test_loader, estimator)
    mse_evolution_test.append(mse_test_set)
    
    # Print train and test MSE to track training performance
    print(f'Epoch {epoch+1} / {n_epochs}: ',end='')
    print(f'Train loss: {mse_train_set}; ',end='')
    print(f'Test Loss: {mse_test_set}  ',end='\r')

# End of epoch loop.

print('\n\nDone\n')


Epoch 20 / 20: Train loss: 6.037453914586877e-08; Test Loss: 8.118066574525074e-08   

Done

In [11]:

# Plot MSE across epochs for the train and test sets. The same code we wrote above.
plt.plot(mse_evolution_train, ".", label = "Training Set MSE")
plt.plot(mse_evolution_test, "x", label = "Test Set MSE")
plt.title("Mean Squared Error (MSE) as a Function of Epoch Index")
plt.xlabel("Epoch Index")
plt.ylabel("Mean Squared Error")
plt.legend()
plt.savefig("convolution_mse.pdf")
plt.show()

print('\nDone\n')

Done

In [12]:

# Listen to some samples to hear the quality of noise removal. This is the same code we wrote above.
# However, the estimator now is the trained convolutional filter. We therefore to hear a different
# audio when we play the denoised signal.

# Isolate an audio sample and pass it to the neural network
n_data_sample = 500 # Sample choice. Change to hear other samples
sample_x, sample_y = dataA[n_data_sample] # Isolate sample from dataset
predicted_y = estimator.forward(sample_x) # Pass sample to the forward method of the neural network
predicted_y = predicted_y.detach() # Detatch tensor values before passing to audio library. 
                                   # This is technicality

# Play the clean audio, the noisy audio, and the processed audio
print('\n')
print(f'Clean Audio Sample:')
display(Audio(sample_y.cpu(), rate = sample_freq))
print(f'Noisy Audio Sample:')
display(Audio(sample_x.cpu(), rate = sample_freq))
print(f'Denoised Audio Sample:')
display(Audio(predicted_y.cpu(), rate = sample_freq))

print('\nDone\n')


Clean Audio Sample:

Noisy Audio Sample:

Denoised Audio Sample:

Done

Contrary to the fully connected neural network of Task 2, this simple linear convolutional filter works well. This illustrates the importance of leveraging signal structure in machine learning. In Lab 2B we will see how to use convolutions to construct convolutional neural networks.

Appendix: Spectral Representation¶

Convolutions have frequency representations. These frequency representations are Fourier transforms of inputs and outputs and it can be shown that a convolutional filter can always be interpreted as increasing or attenuating the height of different components of this Fourier transform.

This cell plots the frequency spectrum of the noisy audio and its denoised version. Notice how the noise in the high frequencies is diminished by the learned filter.

In [13]:

# Isolate an audio sample and pass it to the neural network
n_data_sample = 1042 # Sample choice. Change to hear other samples
sample_x, sample_y = dataA[n_data_sample] # Isolate sample from dataset
predicted_y = estimator.forward(sample_x) # Pass sample to the forward method of the neural network
sample_x = sample_x.detach() # Detatch tensor values before passing to audio library. 
                             # This is technicality
predicted_y = predicted_y.detach() # Detatch tensor values before passing to audio library. 
                                   # This is technicality

signal1 = sample_x
signal2 = predicted_y

sample_freq = 32000

# Compute the Fourier Transform
fft_signal1 = np.fft.fft(signal1)
fft_signal2 = np.fft.fft(signal2)

# Compute the frequency axis
N = len(signal1)
freq = np.fft.fftfreq(N, d=1/sample_freq)

# Plot the magnitude spectrum of the Fourier Transform for both signals
plt.figure(figsize=(12, 6))

plt.subplot(2, 1, 1)
plt.plot(freq, np.abs(fft_signal1))
plt.title('Original')
plt.xlabel('Frequency')
plt.ylabel('Magnitude')
plt.xlim(0, sample_freq / 2)  # Limiting x-axis to half the sampling rate (Nyquist frequency)

plt.subplot(2, 1, 2)
plt.plot(freq, np.abs(fft_signal2))
plt.title('Denoised')
plt.xlabel('Frequency')
plt.ylabel('Magnitude')
plt.xlim(0, sample_freq / 2)  # Limiting x-axis to half the sampling rate (Nyquist frequency)

plt.tight_layout()
plt.show()

display(Audio(signal1, rate = sample_freq))
display(Audio(signal2, rate = sample_freq))

print('\nDone\n')

Done

In [ ]: