Lab 5: Intro to automatic differentiation and PyTorch

Lab 5: Intro to automatic differentiation and PyTorch#

This notebook is intended to introduce:

the basic ideas of automatic differentiation and how to use it in PyTorch;
the basic ideas of PyTorch and how to use it to train a simple MLP.

Installation#

Follow the instructions on the PyTorch website to install PyTorch. We will only use the CPU version in this lab, so you can ignore the CUDA related options.

PyTorch basics#

PyTorch can be used as a replacement of Numpy#

import numpy as np
import torch

M1 = np.array([[1., 2., 3.], [4., 5., 6.]])
M1

array([[1., 2., 3.],
       [4., 5., 6.]])

M1.shape

(2, 3)

v1 = np.array([1., 2., 3.]).reshape(3, 1)

v1

array([[1.],
       [2.],
       [3.]])

M1@v1

array([[14.],
       [32.]])

M2 = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
M2

tensor([[1., 2., 3.],
        [4., 5., 6.]])

M2.shape

torch.Size([2, 3])

v2 = torch.tensor([1., 2., 3.]).reshape(3, 1)

M2@v2

tensor([[14.],
        [32.]])

Converting between numpy and torch#

torch.from_numpy(M1)

tensor([[1., 2., 3.],
        [4., 5., 6.]], dtype=torch.float64)

M2.numpy()

array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)

Automatic differentiation with PyTorch#

We will use the following simple example to illustrate the basic ideas of automatic differentiation.

\(L = (y - wx)^2\)

\(\frac{\partial L}{\partial w} = 2(y - wx)(-x)\)

\(\frac{\partial L}{\partial x} = 2(y - wx)(-w)\)

x = torch.tensor([1., 2., 3.])
w = torch.tensor([4., 5., 6.], requires_grad=True)
y = torch.zeros(3)

diff =  y - w*x
diff

tensor([ -4., -10., -18.], grad_fn=<SubBackward0>)

square = diff**2
square

tensor([ 16., 100., 324.], grad_fn=<PowBackward0>)

L = torch.sum(square)
L

tensor(440., grad_fn=<SumBackward0>)

Get the gradient of L with respect to w

# dw = torch.autograd.grad(L, w)

# dw

Cannot get grad of L with respect to x (because x does not require grad)

# dx = torch.autograd.grad(L, x)

Alternative way to get grad instead of calling autograd.grad#

Note:

this is the recommended way to get the gradient.
cannot backpropagate a second time. So comment out the above dw cells to run the below.

L.backward()

w.grad

tensor([  8.,  40., 108.])

print(x.grad)

None

You can even get higher order derivatives#

def get_loss(w):
    x = torch.tensor([1., 2., 3.])
    y = torch.zeros(3)

    L = torch.sum((y - w*x)**2)

    return L

Get the Hessian

from torch.autograd.functional import hessian

w = torch.tensor([4., 5., 6.], requires_grad=True)

hessian(get_loss, w)

tensor([[ 2.,  0.,  0.],
        [ 0.,  8.,  0.],
        [ 0.,  0., 18.]])

MLP with PyTorch#

import torch.nn as nn

class MyTwoLayerMLP(nn.Module):

    def __init__(self, num_input_nodes, num_hidden_nodes, num_out_nodes):
        super().__init__()

        self.num_input_nodes = num_input_nodes
        self.num_hidden_nodes = num_hidden_nodes
        self.num_out_nodes = num_out_nodes

        self.linear1 = nn.Linear(num_input_nodes, num_hidden_nodes)
        self.linear2 = nn.Linear(num_hidden_nodes, num_out_nodes)



    def forward(self, X):

        h = self.linear1(X)
        h = torch.relu(h)
        y = self.linear2(h)

        return y

Let’s create some data to test our model#

torch.manual_seed(0)  # random seed

def generate_data(N, num_in=3):
    """Generate some random data with num_in=3 features x1, x2, x3.
    The target is y = x1^2 + x2^2 + x3^2.

    Args:
        N: number of samples
        num_in: number of input features
    """
    X = torch.randn(N, num_in)
    y = torch.square(X).sum(dim=1).reshape(N, 1)

    return X, y

X, y = generate_data(N=100, num_in=3)

X.shape

torch.Size([100, 3])

y.shape

torch.Size([100, 1])

Create a model with 10 hidden nodes#

num_in = 3
num_hidden = 10
num_out  = 1

model = MyTwoLayerMLP(num_in, num_hidden, num_out)

Let’s check that the model works

y_pred = model(X)

y_pred.shape

torch.Size([100, 1])

Train the model#

def train_one_step(model, optimizer, X, y):
    """Performs one step of gradient descent on the given model.

    Args:
        model: the model to train
        optimizer: the optimizer to use
        X: the input data
        y: the target data
    """
    y_pred = model(X)
    loss = torch.mean((y_pred - y)**2)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss

### Using SGD optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

num_steps = 30

losses = []
for s in range(num_steps):
    for X_i, y_i in zip(X, y):
        l = train_one_step(model, optimizer, X_i, y_i)
        losses.append(l.item())

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

ax.plot(losses)

ax.set_xlabel('step')
ax.set_ylabel('loss')

# Note, the loss is the training loss for each individual sample

Text(0, 0.5, 'loss')

../_images/93a532090a252518a34a966d2b2fb2ecef12cb20f1a881c2b6a6105c7c28a3a2.png