TP1: The Multi layer perceptron¶

The First Deep Learning Model¶

By: Alexandre Verine¶

In this practical work, we will implement a simple neural network with two hidden layers. We will use the relu activation function for the hidden layer and the softmax activation function for the output layer. We will use the cross-entropy loss as the loss function. We will use the backpropagation algorithm to train the model. We will use the MNIST dataset to train and test the model.

First, we load the necessary libraries:

# PyTorch library, provides tensor computation and deep neural networks
import torch

# Package that provides access to popular datasets and image transformations for computer vision
import torchvision

import torch.nn as nn  # Provides classes to define and manipulate neural networks
import torch.nn.functional as F  # Contains functions that do not have any parameters, such as relu, tanh, etc.
import torch.optim as optim  # Package implementing various optimization algorithms

# Library for the Python programming language, adding support for large, multi-dimensional arrays and matrices.
import numpy as np

import matplotlib.pyplot as plt  # Library for creating static, animated, and interactive visualizations in Python

We set the main hyperparameters of the training algorithm:

n_epochs = 5 # Number of epochs to train the model
batch_size_train = 100 # Number of training examples utilized in one iteration
batch_size_test = 10000 # Number of test examples utilized in one iteration
learning_rate = 5e-4 # Learning rate for the optimizer
log_interval = 100 # Number of batches to wait before logging training status

random_seed = 1 # Random seed for reproducibility
torch.backends.cudnn.enabled = False # Disables cuDNN for reproducibility
torch.manual_seed(random_seed); # Sets the seed for generating random numbers

We load the MNIST dataset:

transform = torchvision.transforms.Compose([  # Preprocessing the data
    torchvision.transforms.ToTensor(),  # Converts the image to a tensor
    torchvision.transforms.Normalize((0.1307,), (0.3081,))  # Normalizes the image
])

# Loads the MNIST dataset
train_dataset = torchvision.datasets.MNIST('../data/', train=True, download=True, transform=transform)  
test_dataset = torchvision.datasets.MNIST('../data/', train=False, download=True, transform=transform)

# Creates a DataLoader for the datasets
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size_train, shuffle=True)   
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size_test, shuffle=False)

We can observe the first batch of the dataset. Every time the object train_loader is called, it returns a batch of images and their corresponding labels:

examples = enumerate(train_loader)
batch_idx, (example_data, example_targets) = next(examples)
print(f"Shape of the image: {example_data.shape}")
print(f"Shape of the target: {example_targets.shape}")
Shape of the image: torch.Size([100, 1, 28, 28])
Shape of the target: torch.Size([100])
plt.figure(figsize=(8, 4)) 
for i in range(6):
    plt.subplot(2,3,i+1)
    plt.tight_layout()
    plt.imshow(example_data[i][0], cmap='gray', interpolation='none')
    plt.title(f"Label: {example_targets[i]}")
    plt.xticks([])
    plt.yticks([])

We can now define the model. The model is a class that inherits from the nn.Module class. The init method defines the layers of the model. The forward method defines the forward pass of the model.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        pass  

    def forward(self, x):
        pass

In particular, the model has two hidden layers with 64 and 32 neurons, respectively. The hidden layers use the relu activation function. The output layer has 10 neurons and uses the softmax activation function.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(784, 64)   # Fully connected layer with 784 input features and 64 output features
        self.fc2 = nn.Linear(64, 32)   # Fully connected layer with 64 input features and 32 output features
        self.fc3 = nn.Linear(32, 10)  # Fully connected layer with 32 input features and 10 output features

    def forward(self, x):
        x = x.view(-1, 784) # Reshapes the input tensor to a 1D tensor
        x = F.relu(self.fc1(x)) # Applies the rectified linear unit function to the output of the first fully connected layer
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

We initialize the model and the optimizer. We also define the loss function:

network = Net().cuda() # Instantiates the neural network
optimizer = optim.SGD(network.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss() # Instantiates the loss function

We can count the number of parameters of the model:

model_parameters = filter(lambda p: p.requires_grad, network.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])
print(f"Number of trainable parameters: {params}")
Number of trainable parameters: 52650

Let us observe the output of the model before training:

first_image = example_data[0:1].cuda()
first_target = example_targets[0:1].cuda()
with torch.no_grad():
    print(f"Output shape: {network(first_image).shape}")
    print(f"Output: {network(first_image)}")
Output shape: torch.Size([1, 10])
Output: tensor([[-0.0082, -0.0773,  0.0621,  0.1702, -0.1853,  0.1528,  0.0084,  0.2774,
          0.3845,  0.1428]], device='cuda:0')

The predicted label is the index of the neuron with the highest activation:

print(f"Prediction: {network(first_image).argmax()} vs. Target: {first_target}")
Prediction: 8 vs. Target: tensor([4], device='cuda:0')

If we compute the accuary of the first batch, we obtain:

predictions = network(example_data.cuda()).argmax(dim=1)
number_of_correct_predictions = (predictions == example_targets.cuda()).sum().item()
print(f"Accuracy: {number_of_correct_predictions}/{batch_size_train} ({100*number_of_correct_predictions/batch_size_train:.2f}%)")
Accuracy: 6/100 (6.00%)

The model is not trained yet, so the predictions are random.

For a given batch of the dataset, we will update the parameters using gradient descent with the backpropagation algorithm:

optimizer.zero_grad() # Clears the gradients of all optimized torch.Tensors
output = network(data.cuda()) # Forward pass: computes predicted outputs by passing inputs to the model
loss = criterion(output, target.cuda())  # Computes the loss
loss.backward() # Backward pass: computes gradient of the loss with respect to model parameters
optimizer.step() # Updates the parameters of the model

We will apply this loop for every batch of the dataset multiples times:

def train(epoch):
  for batch_idx, (data, target) in enumerate(train_loader): # Iterates over the training dataset
    optimizer.zero_grad() # Clears the gradients of all optimized torch.Tensors
    output = network(data) # Forward pass: computes predicted outputs by passing inputs to the model
    loss = criterion(output, target)  # Computes the loss
    loss.backward() # Backward pass: computes gradient of the loss with respect to model parameters
    optimizer.step() # Updates the parameters of the model

We add a few lines of code to improve the evaluation of the training algorithm:

def train(epoch, log_interval):
  network.train()
  for batch_idx, (data, target) in enumerate(train_loader): # Iterates over the training dataset
    optimizer.zero_grad() # Clears the gradients of all optimized torch.Tensors
    output = network(data.cuda()) # Forward pass: computes predicted outputs by passing inputs to the model
    loss = criterion(output, target.cuda())  # Computes the loss
    loss.backward() # Backward pass: computes gradient of the loss with respect to model parameters
    optimizer.step() # Updates the parameters of the model

    if batch_idx % log_interval == 0 or batch_idx == len(train_loader)-1:
        print('Train Epoch: {} [{:5d}/{:5d} ({:3.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
    train_losses.append(loss.item())
    train_counter.append(
            (batch_idx*batch_size_train) + ((epoch-1)*len(train_loader.dataset)))

We can also define a function to evaluate the model on the test set:

def test():
  network.eval()
  test_loss = 0
  correct = 0
  with torch.no_grad():
    for data, target in test_loader:
      output = network(data.cuda())
      test_loss += criterion(output, target.cuda()).sum()
      pred = output.data.max(1, keepdim=True)[1]
      correct += pred.eq(target.cuda().data.view_as(pred)).sum()
  test_losses.append(test_loss.item())
  print('Test set: Average loss: {:.4f},\t Accuracy: {}/{} ({:.0f}%)\n'.format(
    test_loss, correct, len(test_loader.dataset),
    100. * correct / len(test_loader.dataset)))

We can now train the model:

train_losses,train_counter,  test_losses = [], [], [] 
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

test()
for epoch in range(1, n_epochs + 1):
    train(epoch, 100)
    test()
Test set: Average loss: 2.3118,	 Accuracy: 846/10000 (8%)

Train Epoch: 1 [    0/60000 (  0%)]	Loss: 2.319541
Train Epoch: 1 [10000/60000 ( 17%)]	Loss: 2.287766
Train Epoch: 1 [20000/60000 ( 33%)]	Loss: 2.280313
Train Epoch: 1 [30000/60000 ( 50%)]	Loss: 2.287734
Train Epoch: 1 [40000/60000 ( 67%)]	Loss: 2.268912
Train Epoch: 1 [50000/60000 ( 83%)]	Loss: 2.254742
Train Epoch: 1 [59900/60000 (100%)]	Loss: 2.245768
Test set: Average loss: 2.2522,	 Accuracy: 1537/10000 (15%)

Train Epoch: 2 [    0/60000 (  0%)]	Loss: 2.231626
Train Epoch: 2 [10000/60000 ( 17%)]	Loss: 2.236296
Train Epoch: 2 [20000/60000 ( 33%)]	Loss: 2.231608
Train Epoch: 2 [30000/60000 ( 50%)]	Loss: 2.225644
Train Epoch: 2 [40000/60000 ( 67%)]	Loss: 2.218160
Train Epoch: 2 [50000/60000 ( 83%)]	Loss: 2.193765
Train Epoch: 2 [59900/60000 (100%)]	Loss: 2.170033
Test set: Average loss: 2.1845,	 Accuracy: 2521/10000 (25%)

Train Epoch: 3 [    0/60000 (  0%)]	Loss: 2.192061
Train Epoch: 3 [10000/60000 ( 17%)]	Loss: 2.179559
Train Epoch: 3 [20000/60000 ( 33%)]	Loss: 2.184042
Train Epoch: 3 [30000/60000 ( 50%)]	Loss: 2.128206
Train Epoch: 3 [40000/60000 ( 67%)]	Loss: 2.168460
Train Epoch: 3 [50000/60000 ( 83%)]	Loss: 2.097113
Train Epoch: 3 [59900/60000 (100%)]	Loss: 2.091380
Test set: Average loss: 2.0877,	 Accuracy: 3294/10000 (33%)

Train Epoch: 4 [    0/60000 (  0%)]	Loss: 2.038798
Train Epoch: 4 [10000/60000 ( 17%)]	Loss: 2.061571
Train Epoch: 4 [20000/60000 ( 33%)]	Loss: 2.046076
Train Epoch: 4 [30000/60000 ( 50%)]	Loss: 2.022704
Train Epoch: 4 [40000/60000 ( 67%)]	Loss: 2.045566
Train Epoch: 4 [50000/60000 ( 83%)]	Loss: 2.022622
Train Epoch: 4 [59900/60000 (100%)]	Loss: 1.921506
Test set: Average loss: 1.9549,	 Accuracy: 4655/10000 (47%)

Train Epoch: 5 [    0/60000 (  0%)]	Loss: 1.946001
Train Epoch: 5 [10000/60000 ( 17%)]	Loss: 1.863767
Train Epoch: 5 [20000/60000 ( 33%)]	Loss: 1.935797
Train Epoch: 5 [30000/60000 ( 50%)]	Loss: 1.856108
Train Epoch: 5 [40000/60000 ( 67%)]	Loss: 1.749203
Train Epoch: 5 [50000/60000 ( 83%)]	Loss: 1.838765
Train Epoch: 5 [59900/60000 (100%)]	Loss: 1.793422
Test set: Average loss: 1.7810,	 Accuracy: 5573/10000 (56%)

We can define a function to plot the training and test losses:

def plot_loss(train_counter, train_losses, test_counter, test_losses):
    plt.figure(figsize=(10, 6))
    plt.annotate('', xy=(0, np.max(train_losses)*1.2), xytext=(0, -0.02),
                    arrowprops=dict(linewidth=2, arrowstyle='->', color='k'), 
                    annotation_clip=False) 
    plt.annotate('', xy=(1.05*max(train_counter), -0.0), xytext=(-100, -0.0),
                    arrowprops=dict(linewidth=2, arrowstyle='->', color='k'), 
                    annotation_clip=False)

    plt.plot(train_counter, train_losses, color='#08457E', clip_on=False)
    plt.ylim([0, np.max(train_losses)*1.2])
    plt.xlim([0, max(train_counter)])
    plt.scatter(test_counter, test_losses, color='#F44336',zorder=+100, clip_on=False)
    plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
    plt.xlim([0, max(train_counter)])
    plt.xlabel('Number of Examples Seen by the model')
    plt.ylabel('Cross-Entropy')
    plt.show()
plot_loss(train_counter, train_losses, test_counter, test_losses)

Since the training is long, we can increase the learning rate:

learning_rate = 0.9 # Learning rate for the optimizer

network = Net().cuda() # Instantiates the neural network
optimizer = optim.SGD(network.parameters(), lr=learning_rate)
train_losses,train_counter,  test_losses = [], [], []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

test()
for epoch in range(1, n_epochs + 1):
    train(epoch, 1000)
    test()
Test set: Average loss: 2.3283,	 Accuracy: 758/10000 (8%)

Train Epoch: 1 [    0/60000 (  0%)]	Loss: 2.319880
Train Epoch: 1 [59900/60000 (100%)]	Loss: 1.490455
Test set: Average loss: 1.3445,	 Accuracy: 5187/10000 (52%)

Train Epoch: 2 [    0/60000 (  0%)]	Loss: 1.353252
Train Epoch: 2 [59900/60000 (100%)]	Loss: 1.857095
Test set: Average loss: 1.7789,	 Accuracy: 2350/10000 (24%)

Train Epoch: 3 [    0/60000 (  0%)]	Loss: 1.791084
Train Epoch: 3 [59900/60000 (100%)]	Loss: 1.898722
Test set: Average loss: 1.7090,	 Accuracy: 2527/10000 (25%)

Train Epoch: 4 [    0/60000 (  0%)]	Loss: 1.754364
Train Epoch: 4 [59900/60000 (100%)]	Loss: 1.540458
Test set: Average loss: 1.6095,	 Accuracy: 3389/10000 (34%)

Train Epoch: 5 [    0/60000 (  0%)]	Loss: 1.632120
Train Epoch: 5 [59900/60000 (100%)]	Loss: 1.860286
Test set: Average loss: 1.7762,	 Accuracy: 2129/10000 (21%)

plot_loss(train_counter, train_losses, test_counter, test_losses)

To have a fast and good training, we must find the right learning rate:

learning_rate = 5e-2 # Learning rate for the optimizer

network = Net().cuda() # Instantiates the neural network
optimizer = optim.SGD(network.parameters(), lr=learning_rate)
train_losses,train_counter,  test_losses = [], [], []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

test()
for epoch in range(1, n_epochs + 1):
    train(epoch, 1000)
    test()
Test set: Average loss: 2.3157,	 Accuracy: 1009/10000 (10%)

Train Epoch: 1 [    0/60000 (  0%)]	Loss: 2.316067
Train Epoch: 1 [59900/60000 (100%)]	Loss: 0.384729
Test set: Average loss: 0.2376,	 Accuracy: 9275/10000 (93%)

Train Epoch: 2 [    0/60000 (  0%)]	Loss: 0.317440
Train Epoch: 2 [59900/60000 (100%)]	Loss: 0.187381
Test set: Average loss: 0.1667,	 Accuracy: 9509/10000 (95%)

Train Epoch: 3 [    0/60000 (  0%)]	Loss: 0.170198
Train Epoch: 3 [59900/60000 (100%)]	Loss: 0.183906
Test set: Average loss: 0.1368,	 Accuracy: 9584/10000 (96%)

Train Epoch: 4 [    0/60000 (  0%)]	Loss: 0.055773
Train Epoch: 4 [59900/60000 (100%)]	Loss: 0.115236
Test set: Average loss: 0.1222,	 Accuracy: 9615/10000 (96%)

Train Epoch: 5 [    0/60000 (  0%)]	Loss: 0.057502
Train Epoch: 5 [59900/60000 (100%)]	Loss: 0.114205
Test set: Average loss: 0.1100,	 Accuracy: 9659/10000 (97%)

plot_loss(train_counter, train_losses, test_counter, test_losses)

We can now visualize the predictions of the model:

predictions = network(example_data.cuda()).argmax(dim=1)
plt.figure(figsize=(8, 4)) 
for i in range(6):
    plt.subplot(2,3,i+1)
    plt.tight_layout()
    plt.imshow(example_data[i][0], cmap='gray', interpolation='none')
    plt.title(f"Prediction: {predictions[i].item()}")
    plt.xticks([])
    plt.yticks([])