# Learning Rate Scheduling¶

Run Jupyter Notebook

You can run the code for this section in this jupyter notebook link.

## Optimization Algorithm: Mini-batch Stochastic Gradient Descent (SGD)¶

• We will be using mini-batch gradient descent in all our examples here when scheduling our learning rate
• $\theta = \theta - \eta \cdot \nabla J(\theta, x^{i: i+n}, y^{i:i+n})$
• Characteristics
• Compute the gradient of the lost function w.r.t. parameters for n sets of training sample (n input and n label), $\nabla J(\theta, x^{i: i+n}, y^{i:i+n})$
• Use this to update our parameters at every iteration
• Typically in deep learning, some variation of mini-batch gradient is used where the batch size is a hyperparameter to be determined

## Learning Intuition Recap¶

• Learning process
• Original parameters $\rightarrow$ given input, get output $\rightarrow$ compare with labels $\rightarrow$ get loss with comparison of input/output $\rightarrow$ get gradients of loss w.r.t parameters $\rightarrow$ update parameters so model can churn output closer to labels $\rightarrow$ repeat
• For a detailed mathematical account of how this works and how to implement from scratch in Python and PyTorch, you can read our forward- and back-propagation and gradient descent post.

## Learning Rate Pointers¶ • Update parameters so model can churn output closer to labels, lower loss
• $\theta = \theta - \eta \cdot \nabla J(\theta, x^{i: i+n}, y^{i:i+n})$
• If we set $\eta$ to be a large value $\rightarrow$ learn too much (rapid learning)
• Unable to converge to a good local minima (unable to effectively gradually decrease your loss, overshoot the local lowest value)
• If we set $\eta$ to be a small value $\rightarrow$ learn too little (slow learning)
• May take too long or unable to convert to a good local minima

## Need for Learning Rate Schedules¶

• Benefits
• Converge faster
• Higher accuracy ## Top Basic Learning Rate Schedules¶

1. Step-wise Decay
2. Reduce on Loss Plateau Decay

### Step-wise Learning Rate Decay¶

#### Step-wise Decay: Every Epoch¶

• At every epoch,
• $\eta_t = \eta_{t-1}\gamma$
• $\gamma = 0.1$
• Optimization Algorithm 4: SGD Nesterov
• Modification of SGD Momentum
• $v_t = \gamma v_{t-1} + \eta \cdot \nabla J(\theta - \gamma v_{t-1}, x^{i: i+n}, y^{i:i+n})$
• $\theta = \theta - v_t$
• Practical example
• Given $\eta_t = 0.1$ and $\gamma = 0.01$
• Epoch 0: $\eta_t = 0.1$
• Epoch 1: $\eta_{t+1} = 0.1 (0.1) = 0.01$
• Epoch 2: $\eta_{t+2} = 0.1 (0.1)^2 = 0.001$
• Epoch n: $\eta_{t+n} = 0.1 (0.1)^n$

Code for step-wise learning rate decay at every epoch

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets

# Set seed
torch.manual_seed(0)

# Where to add a new import
from torch.optim.lr_scheduler import StepLR

'''
'''

train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),

test_dataset = dsets.MNIST(root='./data',
train=False,
transform=transforms.ToTensor())

'''
STEP 2: MAKING DATASET ITERABLE
'''

batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

batch_size=batch_size,
shuffle=True)

batch_size=batch_size,
shuffle=False)

'''
STEP 3: CREATE MODEL CLASS
'''
class FeedforwardNeuralNetModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(FeedforwardNeuralNetModel, self).__init__()
# Linear function
self.fc1 = nn.Linear(input_dim, hidden_dim)
# Non-linearity
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)

def forward(self, x):
# Linear function
out = self.fc1(x)
# Non-linearity
out = self.relu(out)
out = self.fc2(out)
return out
'''
STEP 4: INSTANTIATE MODEL CLASS
'''
input_dim = 28*28
hidden_dim = 100
output_dim = 10

model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

'''
STEP 5: INSTANTIATE LOSS CLASS
'''
criterion = nn.CrossEntropyLoss()

'''
STEP 6: INSTANTIATE OPTIMIZER CLASS
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)

'''
STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS
'''
# step_size: at how many multiples of epoch you decay
# step_size = 1, after every 1 epoch, new_lr = lr*gamma
# step_size = 2, after every 2 epoch, new_lr = lr*gamma

# gamma = decaying factor
scheduler = StepLR(optimizer, step_size=1, gamma=0.1)

'''
STEP 7: TRAIN THE MODEL
'''
iter = 0
for epoch in range(num_epochs):
# Decay Learning Rate
scheduler.step()
# Print Learning Rate
print('Epoch:', epoch,'LR:', scheduler.get_lr())
for i, (images, labels) in enumerate(train_loader):

# Forward pass to get output/logits
outputs = model(images)

# Calculate Loss: softmax --> cross entropy loss
loss = criterion(outputs, labels)

loss.backward()

# Updating parameters
optimizer.step()

iter += 1

if iter % 500 == 0:
# Calculate Accuracy
correct = 0
total = 0
# Iterate through test dataset
# Load images to a Torch Variable
images = images.view(-1, 28*28)

# Forward pass only to get logits/output
outputs = model(images)

# Get predictions from the maximum value
_, predicted = torch.max(outputs.data, 1)

# Total number of labels
total += labels.size(0)

# Total correct predictions
correct += (predicted == labels).sum()

accuracy = 100 * correct / total

# Print Loss
print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))

Epoch: 0 LR: [0.1]
Iteration: 500. Loss: 0.15292978286743164. Accuracy: 96
Epoch: 1 LR: [0.010000000000000002]
Iteration: 1000. Loss: 0.1207798570394516. Accuracy: 97
Epoch: 2 LR: [0.0010000000000000002]
Iteration: 1500. Loss: 0.12287932634353638. Accuracy: 97
Epoch: 3 LR: [0.00010000000000000003]
Iteration: 2000. Loss: 0.05614742264151573. Accuracy: 97
Epoch: 4 LR: [1.0000000000000003e-05]
Iteration: 2500. Loss: 0.06775809079408646. Accuracy: 97
Iteration: 3000. Loss: 0.03737065941095352. Accuracy: 97


#### Step-wise Decay: Every 2 Epochs¶

• At every 2 epoch,
• $\eta_t = \eta_{t-1}\gamma$
• $\gamma = 0.1$
• Optimization Algorithm 4: SGD Nesterov
• Modification of SGD Momentum
• $v_t = \gamma v_{t-1} + \eta \cdot \nabla J(\theta - \gamma v_{t-1}, x^{i: i+n}, y^{i:i+n})$
• $\theta = \theta - v_t$
• Practical example
• Given $\eta_t = 0.1$ and $\gamma = 0.01$
• Epoch 0: $\eta_t = 0.1$
• Epoch 1: $\eta_{t+1} = 0.1$
• Epoch 2: $\eta_{t+2} = 0.1 (0.1) = 0.01$

Code for step-wise learning rate decay at every 2 epoch

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets

# Set seed
torch.manual_seed(0)

# Where to add a new import
from torch.optim.lr_scheduler import StepLR

'''
'''

train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),

test_dataset = dsets.MNIST(root='./data',
train=False,
transform=transforms.ToTensor())

'''
STEP 2: MAKING DATASET ITERABLE
'''

batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

batch_size=batch_size,
shuffle=True)

batch_size=batch_size,
shuffle=False)

'''
STEP 3: CREATE MODEL CLASS
'''
class FeedforwardNeuralNetModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(FeedforwardNeuralNetModel, self).__init__()
# Linear function
self.fc1 = nn.Linear(input_dim, hidden_dim)
# Non-linearity
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)

def forward(self, x):
# Linear function
out = self.fc1(x)
# Non-linearity
out = self.relu(out)
out = self.fc2(out)
return out
'''
STEP 4: INSTANTIATE MODEL CLASS
'''
input_dim = 28*28
hidden_dim = 100
output_dim = 10

model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

'''
STEP 5: INSTANTIATE LOSS CLASS
'''
criterion = nn.CrossEntropyLoss()

'''
STEP 6: INSTANTIATE OPTIMIZER CLASS
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)

'''
STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS
'''
# step_size: at how many multiples of epoch you decay
# step_size = 1, after every 2 epoch, new_lr = lr*gamma
# step_size = 2, after every 2 epoch, new_lr = lr*gamma

# gamma = decaying factor
scheduler = StepLR(optimizer, step_size=2, gamma=0.1)

'''
STEP 7: TRAIN THE MODEL
'''
iter = 0
for epoch in range(num_epochs):
# Decay Learning Rate
scheduler.step()
# Print Learning Rate
print('Epoch:', epoch,'LR:', scheduler.get_lr())
for i, (images, labels) in enumerate(train_loader):

# Forward pass to get output/logits
outputs = model(images)

# Calculate Loss: softmax --> cross entropy loss
loss = criterion(outputs, labels)

loss.backward()

# Updating parameters
optimizer.step()

iter += 1

if iter % 500 == 0:
# Calculate Accuracy
correct = 0
total = 0
# Iterate through test dataset
# Load images to a Torch Variable

# Forward pass only to get logits/output
outputs = model(images)

# Get predictions from the maximum value
_, predicted = torch.max(outputs.data, 1)

# Total number of labels
total += labels.size(0)

# Total correct predictions
correct += (predicted == labels).sum()

accuracy = 100 * correct / total

# Print Loss
print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))

Epoch: 0 LR: [0.1]
Iteration: 500. Loss: 0.15292978286743164. Accuracy: 96
Epoch: 1 LR: [0.1]
Iteration: 1000. Loss: 0.11253029108047485. Accuracy: 96
Epoch: 2 LR: [0.010000000000000002]
Iteration: 1500. Loss: 0.14498558640480042. Accuracy: 97
Epoch: 3 LR: [0.010000000000000002]
Iteration: 2000. Loss: 0.03691177815198898. Accuracy: 97
Epoch: 4 LR: [0.0010000000000000002]
Iteration: 2500. Loss: 0.03511016443371773. Accuracy: 97
Iteration: 3000. Loss: 0.029424520209431648. Accuracy: 97


#### Step-wise Decay: Every Epoch, Larger Gamma¶

• At every epoch,
• $\eta_t = \eta_{t-1}\gamma$
• $\gamma = 0.96$
• Optimization Algorithm 4: SGD Nesterov
• Modification of SGD Momentum
• $v_t = \gamma v_{t-1} + \eta \cdot \nabla J(\theta - \gamma v_{t-1}, x^{i: i+n}, y^{i:i+n})$
• $\theta = \theta - v_t$
• Practical example
• Given $\eta_t = 0.1$ and $\gamma = 0.96$
• Epoch 1: $\eta_t = 0.1$
• Epoch 2: $\eta_{t+1} = 0.1 (0.96) = 0.096$
• Epoch 3: $\eta_{t+2} = 0.1 (0.96)^2 = 0.092$
• Epoch n: $\eta_{t+n} = 0.1 (0.96)^n$

Code for step-wise learning rate decay at every epoch with larger gamma

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets

# Set seed
torch.manual_seed(0)

# Where to add a new import
from torch.optim.lr_scheduler import StepLR

'''
'''

train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),

test_dataset = dsets.MNIST(root='./data',
train=False,
transform=transforms.ToTensor())

'''
STEP 2: MAKING DATASET ITERABLE
'''

batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

batch_size=batch_size,
shuffle=True)

batch_size=batch_size,
shuffle=False)

'''
STEP 3: CREATE MODEL CLASS
'''
class FeedforwardNeuralNetModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(FeedforwardNeuralNetModel, self).__init__()
# Linear function
self.fc1 = nn.Linear(input_dim, hidden_dim)
# Non-linearity
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)

def forward(self, x):
# Linear function
out = self.fc1(x)
# Non-linearity
out = self.relu(out)
out = self.fc2(out)
return out
'''
STEP 4: INSTANTIATE MODEL CLASS
'''
input_dim = 28*28
hidden_dim = 100
output_dim = 10

model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

'''
STEP 5: INSTANTIATE LOSS CLASS
'''
criterion = nn.CrossEntropyLoss()

'''
STEP 6: INSTANTIATE OPTIMIZER CLASS
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)

'''
STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS
'''
# step_size: at how many multiples of epoch you decay
# step_size = 1, after every 2 epoch, new_lr = lr*gamma
# step_size = 2, after every 2 epoch, new_lr = lr*gamma

# gamma = decaying factor
scheduler = StepLR(optimizer, step_size=2, gamma=0.96)

'''
STEP 7: TRAIN THE MODEL
'''
iter = 0
for epoch in range(num_epochs):
# Decay Learning Rate
scheduler.step()
# Print Learning Rate
print('Epoch:', epoch,'LR:', scheduler.get_lr())
for i, (images, labels) in enumerate(train_loader):

# Forward pass to get output/logits
outputs = model(images)

# Calculate Loss: softmax --> cross entropy loss
loss = criterion(outputs, labels)

loss.backward()

# Updating parameters
optimizer.step()

iter += 1

if iter % 500 == 0:
# Calculate Accuracy
correct = 0
total = 0
# Iterate through test dataset
# Load images to a Torch Variable
images = images.view(-1, 28*28)

# Forward pass only to get logits/output
outputs = model(images)

# Get predictions from the maximum value
_, predicted = torch.max(outputs.data, 1)

# Total number of labels
total += labels.size(0)

# Total correct predictions
correct += (predicted == labels).sum()

accuracy = 100 * correct / total

# Print Loss
print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))

Epoch: 0 LR: [0.1]
Iteration: 500. Loss: 0.15292978286743164. Accuracy: 96
Epoch: 1 LR: [0.1]
Iteration: 1000. Loss: 0.11253029108047485. Accuracy: 96
Epoch: 2 LR: [0.096]
Iteration: 1500. Loss: 0.11864850670099258. Accuracy: 97
Epoch: 3 LR: [0.096]
Iteration: 2000. Loss: 0.030942382290959358. Accuracy: 97
Epoch: 4 LR: [0.09216]
Iteration: 2500. Loss: 0.04521659016609192. Accuracy: 97
Iteration: 3000. Loss: 0.027839098125696182. Accuracy: 97


#### Pointers on Step-wise Decay¶

• You would want to decay your LR gradually when you're training more epochs
• Converge too fast, to a crappy loss/accuracy, if you decay rapidly
• To decay slower
• Larger $\gamma$
• Larger interval of decay

### Reduce on Loss Plateau Decay¶

#### Reduce on Loss Plateau Decay, Patience=0, Factor=0.1¶

• Reduce learning rate whenever loss plateaus
• Patience: number of epochs with no improvement after which learning rate will be reduced
• Patience = 0
• Factor: multiplier to decrease learning rate, $lr = lr*factor = \gamma$
• Factor = 0.1
• Optimization Algorithm: SGD Nesterov
• Modification of SGD Momentum
• $v_t = \gamma v_{t-1} + \eta \cdot \nabla J(\theta - \gamma v_{t-1}, x^{i: i+n}, y^{i:i+n})$
• $\theta = \theta - v_t$

Code for reduce on loss plateau learning rate decay of factor 0.1 and 0 patience

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets

# Set seed
torch.manual_seed(0)

# Where to add a new import
from torch.optim.lr_scheduler import ReduceLROnPlateau

'''
'''

train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),

test_dataset = dsets.MNIST(root='./data',
train=False,
transform=transforms.ToTensor())

'''
STEP 2: MAKING DATASET ITERABLE
'''

batch_size = 100
n_iters = 6000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

batch_size=batch_size,
shuffle=True)

batch_size=batch_size,
shuffle=False)

'''
STEP 3: CREATE MODEL CLASS
'''
class FeedforwardNeuralNetModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(FeedforwardNeuralNetModel, self).__init__()
# Linear function
self.fc1 = nn.Linear(input_dim, hidden_dim)
# Non-linearity
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)

def forward(self, x):
# Linear function
out = self.fc1(x)
# Non-linearity
out = self.relu(out)
out = self.fc2(out)
return out
'''
STEP 4: INSTANTIATE MODEL CLASS
'''
input_dim = 28*28
hidden_dim = 100
output_dim = 10

model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

'''
STEP 5: INSTANTIATE LOSS CLASS
'''
criterion = nn.CrossEntropyLoss()

'''
STEP 6: INSTANTIATE OPTIMIZER CLASS
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)

'''
STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS
'''
# lr = lr * factor
# mode='max': look for the maximum validation accuracy to track
# patience: number of epochs - 1 where loss plateaus before decreasing LR
# patience = 0, after 1 bad epoch, reduce LR
# factor = decaying factor
scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.1, patience=0, verbose=True)

'''
STEP 7: TRAIN THE MODEL
'''
iter = 0
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):

# Forward pass to get output/logits
outputs = model(images)

# Calculate Loss: softmax --> cross entropy loss
loss = criterion(outputs, labels)

loss.backward()

# Updating parameters
optimizer.step()

iter += 1

if iter % 500 == 0:
# Calculate Accuracy
correct = 0
total = 0
# Iterate through test dataset
# Load images to a Torch Variable
images = images.view(-1, 28*28)

# Forward pass only to get logits/output
outputs = model(images)

# Get predictions from the maximum value
_, predicted = torch.max(outputs.data, 1)

# Total number of labels
total += labels.size(0)

# Total correct predictions
# Without .item(), it is a uint8 tensor which will not work when you pass this number to the scheduler
correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total

# Print Loss
# print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data, accuracy))

# Decay Learning Rate, pass validation accuracy for tracking at every epoch
print('Epoch {} completed'.format(epoch))
print('Loss: {}. Accuracy: {}'.format(loss.item(), accuracy))
print('-'*20)
scheduler.step(accuracy)

Epoch 0 completed
Loss: 0.17087846994400024. Accuracy: 96.26
--------------------
Epoch 1 completed
Loss: 0.11688263714313507. Accuracy: 96.96
--------------------
Epoch 2 completed
Loss: 0.035437121987342834. Accuracy: 96.78
--------------------
Epoch     2: reducing learning rate of group 0 to 1.0000e-02.
Epoch 3 completed
Loss: 0.0324370414018631. Accuracy: 97.7
--------------------
Epoch 4 completed
Loss: 0.022194599732756615. Accuracy: 98.02
--------------------
Epoch 5 completed
Loss: 0.007145566865801811. Accuracy: 98.03
--------------------
Epoch 6 completed
Loss: 0.01673538237810135. Accuracy: 98.05
--------------------
Epoch 7 completed
Loss: 0.025424446910619736. Accuracy: 98.01
--------------------
Epoch     7: reducing learning rate of group 0 to 1.0000e-03.
Epoch 8 completed
Loss: 0.014696130529046059. Accuracy: 98.05
--------------------
Epoch     8: reducing learning rate of group 0 to 1.0000e-04.
Epoch 9 completed
Loss: 0.00573748117312789. Accuracy: 98.04
--------------------
Epoch     9: reducing learning rate of group 0 to 1.0000e-05.


#### Reduce on Loss Plateau Decay, Patience=0, Factor=0.5¶

• Reduce learning rate whenever loss plateaus
• Patience: number of epochs with no improvement after which learning rate will be reduced
• Patience = 0
• Factor: multiplier to decrease learning rate, $lr = lr*factor = \gamma$
• Factor = 0.5
• Optimization Algorithm 4: SGD Nesterov
• Modification of SGD Momentum
• $v_t = \gamma v_{t-1} + \eta \cdot \nabla J(\theta - \gamma v_{t-1}, x^{i: i+n}, y^{i:i+n})$
• $\theta = \theta - v_t$

Code for reduce on loss plateau learning rate decay with factor 0.5 and 0 patience

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets

# Set seed
torch.manual_seed(0)

# Where to add a new import
from torch.optim.lr_scheduler import ReduceLROnPlateau

'''
'''

train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),

test_dataset = dsets.MNIST(root='./data',
train=False,
transform=transforms.ToTensor())

'''
STEP 2: MAKING DATASET ITERABLE
'''

batch_size = 100
n_iters = 6000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

batch_size=batch_size,
shuffle=True)

batch_size=batch_size,
shuffle=False)

'''
STEP 3: CREATE MODEL CLASS
'''
class FeedforwardNeuralNetModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(FeedforwardNeuralNetModel, self).__init__()
# Linear function
self.fc1 = nn.Linear(input_dim, hidden_dim)
# Non-linearity
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)

def forward(self, x):
# Linear function
out = self.fc1(x)
# Non-linearity
out = self.relu(out)
out = self.fc2(out)
return out
'''
STEP 4: INSTANTIATE MODEL CLASS
'''
input_dim = 28*28
hidden_dim = 100
output_dim = 10

model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

'''
STEP 5: INSTANTIATE LOSS CLASS
'''
criterion = nn.CrossEntropyLoss()

'''
STEP 6: INSTANTIATE OPTIMIZER CLASS
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)

'''
STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS
'''
# lr = lr * factor
# mode='max': look for the maximum validation accuracy to track
# patience: number of epochs - 1 where loss plateaus before decreasing LR
# patience = 0, after 1 bad epoch, reduce LR
# factor = decaying factor
scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=0, verbose=True)

'''
STEP 7: TRAIN THE MODEL
'''
iter = 0
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):

# Forward pass to get output/logits
outputs = model(images)

# Calculate Loss: softmax --> cross entropy loss
loss = criterion(outputs, labels)

loss.backward()

# Updating parameters
optimizer.step()

iter += 1

if iter % 500 == 0:
# Calculate Accuracy
correct = 0
total = 0
# Iterate through test dataset
# Load images to a Torch Variable
images = images.view(-1, 28*28)

# Forward pass only to get logits/output
outputs = model(images)

# Get predictions from the maximum value
_, predicted = torch.max(outputs.data, 1)

# Total number of labels
total += labels.size(0)

# Total correct predictions
# Without .item(), it is a uint8 tensor which will not work when you pass this number to the scheduler
correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total

# Print Loss
# print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data, accuracy))

# Decay Learning Rate, pass validation accuracy for tracking at every epoch
print('Epoch {} completed'.format(epoch))
print('Loss: {}. Accuracy: {}'.format(loss.item(), accuracy))
print('-'*20)
scheduler.step(accuracy)

Epoch 0 completed
Loss: 0.17087846994400024. Accuracy: 96.26
--------------------
Epoch 1 completed
Loss: 0.11688263714313507. Accuracy: 96.96
--------------------
Epoch 2 completed
Loss: 0.035437121987342834. Accuracy: 96.78
--------------------
Epoch     2: reducing learning rate of group 0 to 5.0000e-02.
Epoch 3 completed
Loss: 0.04893001914024353. Accuracy: 97.62
--------------------
Epoch 4 completed
Loss: 0.020584167912602425. Accuracy: 97.86
--------------------
Epoch 5 completed
Loss: 0.006022400688380003. Accuracy: 97.95
--------------------
Epoch 6 completed
Loss: 0.028374142944812775. Accuracy: 97.87
--------------------
Epoch     6: reducing learning rate of group 0 to 2.5000e-02.
Epoch 7 completed
Loss: 0.013204765506088734. Accuracy: 98.0
--------------------
Epoch 8 completed
Loss: 0.010137186385691166. Accuracy: 97.95
--------------------
Epoch     8: reducing learning rate of group 0 to 1.2500e-02.
Epoch 9 completed
Loss: 0.0035198689438402653. Accuracy: 98.01
--------------------


## Pointers on Reduce on Loss Pleateau Decay¶

• In these examples, we used patience=1 because we are running few epochs
• You should look at a larger patience such as 5 if for example you ran 500 epochs.
• You should experiment with 2 properties
• Patience
• Decay factor

## Summary¶

We've learnt...

Success

• Learning Rate Intuition
• Update parameters so model can churn output closer to labels
• Learning Rate Pointers
• If we set $\eta$ to be a large value $\rightarrow$ learn too much (rapid learning)
• If we set $\eta$ to be a small value $\rightarrow$ learn too little (slow learning)
• Learning Rate Schedules
• Step-wise Decay
• Reduce on Loss Plateau Decay
• Step-wise Decay
• Every 1 epoch
• Every 2 epoch
• Every 1 epoch, larger gamma
• Step-wise Decay Pointers
• Larger $\gamma$
• Larger interval of decay (increase epoch)
• Reduce on Loss Plateau Decay
• Patience=0, Factor=1
• Patience=0, Factor=0.5
• Pointers on Reduce on Loss Plateau Decay
• Larger patience with more epochs
• 2 hyperparameters to experiment
• Patience
• Decay factor

## Citation¶

If you have found these useful in your research, presentations, school work, projects or workshops, feel free to cite using this DOI.