Learning Rate Scheduling¶
Run Jupyter Notebook
You can run the code for this section in this jupyter notebook link.
Optimization Algorithm: Mini-batch Stochastic Gradient Descent (SGD)¶
- We will be using mini-batch gradient descent in all our examples here when scheduling our learning rate
- Combination of batch gradient descent & stochastic gradient descent
- \theta = \theta - \eta \cdot \nabla J(\theta, x^{i: i+n}, y^{i:i+n})
- Characteristics
- Compute the gradient of the lost function w.r.t. parameters for n sets of training sample (n input and n label), \nabla J(\theta, x^{i: i+n}, y^{i:i+n})
- Use this to update our parameters at every iteration
- Typically in deep learning, some variation of mini-batch gradient is used where the batch size is a hyperparameter to be determined
Learning Intuition Recap¶
- Learning process
- Original parameters \rightarrow given input, get output \rightarrow compare with labels \rightarrow get loss with comparison of input/output \rightarrow get gradients of loss w.r.t parameters \rightarrow update parameters so model can churn output closer to labels \rightarrow repeat
- For a detailed mathematical account of how this works and how to implement from scratch in Python and PyTorch, you can read our forward- and back-propagation and gradient descent post.
Learning Rate Pointers¶
- Update parameters so model can churn output closer to labels, lower loss
- \theta = \theta - \eta \cdot \nabla J(\theta, x^{i: i+n}, y^{i:i+n})
- If we set \eta to be a large value \rightarrow learn too much (rapid learning)
- Unable to converge to a good local minima (unable to effectively gradually decrease your loss, overshoot the local lowest value)
- If we set \eta to be a small value \rightarrow learn too little (slow learning)
- May take too long or unable to convert to a good local minima
Need for Learning Rate Schedules¶
- Benefits
- Converge faster
- Higher accuracy
Top Basic Learning Rate Schedules¶
- Step-wise Decay
- Reduce on Loss Plateau Decay
Step-wise Learning Rate Decay¶
Step-wise Decay: Every Epoch¶
- At every epoch,
- \eta_t = \eta_{t-1}\gamma
- \gamma = 0.1
- Optimization Algorithm 4: SGD Nesterov
- Modification of SGD Momentum
- v_t = \gamma v_{t-1} + \eta \cdot \nabla J(\theta - \gamma v_{t-1}, x^{i: i+n}, y^{i:i+n})
- \theta = \theta - v_t
- Modification of SGD Momentum
- Practical example
- Given \eta_t = 0.1 and $ \gamma = 0.01$
- Epoch 0: \eta_t = 0.1
- Epoch 1: \eta_{t+1} = 0.1 (0.1) = 0.01
- Epoch 2: \eta_{t+2} = 0.1 (0.1)^2 = 0.001
- Epoch n: \eta_{t+n} = 0.1 (0.1)^n
Code for step-wise learning rate decay at every epoch
import torch import torch.nn as nn import torchvision.transforms as transforms import torchvision.datasets as dsets # Set seed torch.manual_seed(0) # Where to add a new import from torch.optim.lr_scheduler import StepLR ''' STEP 1: LOADING DATASET ''' train_dataset = dsets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True) test_dataset = dsets.MNIST(root='./data', train=False, transform=transforms.ToTensor()) ''' STEP 2: MAKING DATASET ITERABLE ''' batch_size = 100 n_iters = 3000 num_epochs = n_iters / (len(train_dataset) / batch_size) num_epochs = int(num_epochs) train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False) ''' STEP 3: CREATE MODEL CLASS ''' class FeedforwardNeuralNetModel(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super(FeedforwardNeuralNetModel, self).__init__() # Linear function self.fc1 = nn.Linear(input_dim, hidden_dim) # Non-linearity self.relu = nn.ReLU() # Linear function (readout) self.fc2 = nn.Linear(hidden_dim, output_dim) def forward(self, x): # Linear function out = self.fc1(x) # Non-linearity out = self.relu(out) # Linear function (readout) out = self.fc2(out) return out ''' STEP 4: INSTANTIATE MODEL CLASS ''' input_dim = 28*28 hidden_dim = 100 output_dim = 10 model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim) ''' STEP 5: INSTANTIATE LOSS CLASS ''' criterion = nn.CrossEntropyLoss() ''' STEP 6: INSTANTIATE OPTIMIZER CLASS ''' learning_rate = 0.1 optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True) ''' STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS ''' # step_size: at how many multiples of epoch you decay # step_size = 1, after every 1 epoch, new_lr = lr*gamma # step_size = 2, after every 2 epoch, new_lr = lr*gamma # gamma = decaying factor scheduler = StepLR(optimizer, step_size=1, gamma=0.1) ''' STEP 7: TRAIN THE MODEL ''' iter = 0 for epoch in range(num_epochs): # Decay Learning Rate scheduler.step() # Print Learning Rate print('Epoch:', epoch,'LR:', scheduler.get_lr()) for i, (images, labels) in enumerate(train_loader): # Load images images = images.view(-1, 28*28).requires_grad_() # Clear gradients w.r.t. parameters optimizer.zero_grad() # Forward pass to get output/logits outputs = model(images) # Calculate Loss: softmax --> cross entropy loss loss = criterion(outputs, labels) # Getting gradients w.r.t. parameters loss.backward() # Updating parameters optimizer.step() iter += 1 if iter % 500 == 0: # Calculate Accuracy correct = 0 total = 0 # Iterate through test dataset for images, labels in test_loader: # Load images to a Torch Variable images = images.view(-1, 28*28) # Forward pass only to get logits/output outputs = model(images) # Get predictions from the maximum value _, predicted = torch.max(outputs.data, 1) # Total number of labels total += labels.size(0) # Total correct predictions correct += (predicted == labels).sum() accuracy = 100 * correct / total # Print Loss print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))
Epoch: 0 LR: [0.1] Iteration: 500. Loss: 0.15292978286743164. Accuracy: 96 Epoch: 1 LR: [0.010000000000000002] Iteration: 1000. Loss: 0.1207798570394516. Accuracy: 97 Epoch: 2 LR: [0.0010000000000000002] Iteration: 1500. Loss: 0.12287932634353638. Accuracy: 97 Epoch: 3 LR: [0.00010000000000000003] Iteration: 2000. Loss: 0.05614742264151573. Accuracy: 97 Epoch: 4 LR: [1.0000000000000003e-05] Iteration: 2500. Loss: 0.06775809079408646. Accuracy: 97 Iteration: 3000. Loss: 0.03737065941095352. Accuracy: 97
Step-wise Decay: Every 2 Epochs¶
- At every 2 epoch,
- \eta_t = \eta_{t-1}\gamma
- \gamma = 0.1
- Optimization Algorithm 4: SGD Nesterov
- Modification of SGD Momentum
- v_t = \gamma v_{t-1} + \eta \cdot \nabla J(\theta - \gamma v_{t-1}, x^{i: i+n}, y^{i:i+n})
- \theta = \theta - v_t
- Modification of SGD Momentum
- Practical example
- Given \eta_t = 0.1 and \gamma = 0.01
- Epoch 0: \eta_t = 0.1
- Epoch 1: \eta_{t+1} = 0.1
- Epoch 2: \eta_{t+2} = 0.1 (0.1) = 0.01
Code for step-wise learning rate decay at every 2 epoch
import torch import torch.nn as nn import torchvision.transforms as transforms import torchvision.datasets as dsets # Set seed torch.manual_seed(0) # Where to add a new import from torch.optim.lr_scheduler import StepLR ''' STEP 1: LOADING DATASET ''' train_dataset = dsets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True) test_dataset = dsets.MNIST(root='./data', train=False, transform=transforms.ToTensor()) ''' STEP 2: MAKING DATASET ITERABLE ''' batch_size = 100 n_iters = 3000 num_epochs = n_iters / (len(train_dataset) / batch_size) num_epochs = int(num_epochs) train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False) ''' STEP 3: CREATE MODEL CLASS ''' class FeedforwardNeuralNetModel(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super(FeedforwardNeuralNetModel, self).__init__() # Linear function self.fc1 = nn.Linear(input_dim, hidden_dim) # Non-linearity self.relu = nn.ReLU() # Linear function (readout) self.fc2 = nn.Linear(hidden_dim, output_dim) def forward(self, x): # Linear function out = self.fc1(x) # Non-linearity out = self.relu(out) # Linear function (readout) out = self.fc2(out) return out ''' STEP 4: INSTANTIATE MODEL CLASS ''' input_dim = 28*28 hidden_dim = 100 output_dim = 10 model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim) ''' STEP 5: INSTANTIATE LOSS CLASS ''' criterion = nn.CrossEntropyLoss() ''' STEP 6: INSTANTIATE OPTIMIZER CLASS ''' learning_rate = 0.1 optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True) ''' STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS ''' # step_size: at how many multiples of epoch you decay # step_size = 1, after every 2 epoch, new_lr = lr*gamma # step_size = 2, after every 2 epoch, new_lr = lr*gamma # gamma = decaying factor scheduler = StepLR(optimizer, step_size=2, gamma=0.1) ''' STEP 7: TRAIN THE MODEL ''' iter = 0 for epoch in range(num_epochs): # Decay Learning Rate scheduler.step() # Print Learning Rate print('Epoch:', epoch,'LR:', scheduler.get_lr()) for i, (images, labels) in enumerate(train_loader): # Load images as Variable images = images.view(-1, 28*28).requires_grad_() # Clear gradients w.r.t. parameters optimizer.zero_grad() # Forward pass to get output/logits outputs = model(images) # Calculate Loss: softmax --> cross entropy loss loss = criterion(outputs, labels) # Getting gradients w.r.t. parameters loss.backward() # Updating parameters optimizer.step() iter += 1 if iter % 500 == 0: # Calculate Accuracy correct = 0 total = 0 # Iterate through test dataset for images, labels in test_loader: # Load images to a Torch Variable images = images.view(-1, 28*28).requires_grad_() # Forward pass only to get logits/output outputs = model(images) # Get predictions from the maximum value _, predicted = torch.max(outputs.data, 1) # Total number of labels total += labels.size(0) # Total correct predictions correct += (predicted == labels).sum() accuracy = 100 * correct / total # Print Loss print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))
Epoch: 0 LR: [0.1] Iteration: 500. Loss: 0.15292978286743164. Accuracy: 96 Epoch: 1 LR: [0.1] Iteration: 1000. Loss: 0.11253029108047485. Accuracy: 96 Epoch: 2 LR: [0.010000000000000002] Iteration: 1500. Loss: 0.14498558640480042. Accuracy: 97 Epoch: 3 LR: [0.010000000000000002] Iteration: 2000. Loss: 0.03691177815198898. Accuracy: 97 Epoch: 4 LR: [0.0010000000000000002] Iteration: 2500. Loss: 0.03511016443371773. Accuracy: 97 Iteration: 3000. Loss: 0.029424520209431648. Accuracy: 97
Step-wise Decay: Every Epoch, Larger Gamma¶
- At every epoch,
- \eta_t = \eta_{t-1}\gamma
- \gamma = 0.96
- Optimization Algorithm 4: SGD Nesterov
- Modification of SGD Momentum
- v_t = \gamma v_{t-1} + \eta \cdot \nabla J(\theta - \gamma v_{t-1}, x^{i: i+n}, y^{i:i+n})
- \theta = \theta - v_t
- Modification of SGD Momentum
- Practical example
- Given \eta_t = 0.1 and \gamma = 0.96
- Epoch 1: \eta_t = 0.1
- Epoch 2: \eta_{t+1} = 0.1 (0.96) = 0.096
- Epoch 3: \eta_{t+2} = 0.1 (0.96)^2 = 0.092
- Epoch n: \eta_{t+n} = 0.1 (0.96)^n
Code for step-wise learning rate decay at every epoch with larger gamma
import torch import torch.nn as nn import torchvision.transforms as transforms import torchvision.datasets as dsets # Set seed torch.manual_seed(0) # Where to add a new import from torch.optim.lr_scheduler import StepLR ''' STEP 1: LOADING DATASET ''' train_dataset = dsets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True) test_dataset = dsets.MNIST(root='./data', train=False, transform=transforms.ToTensor()) ''' STEP 2: MAKING DATASET ITERABLE ''' batch_size = 100 n_iters = 3000 num_epochs = n_iters / (len(train_dataset) / batch_size) num_epochs = int(num_epochs) train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False) ''' STEP 3: CREATE MODEL CLASS ''' class FeedforwardNeuralNetModel(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super(FeedforwardNeuralNetModel, self).__init__() # Linear function self.fc1 = nn.Linear(input_dim, hidden_dim) # Non-linearity self.relu = nn.ReLU() # Linear function (readout) self.fc2 = nn.Linear(hidden_dim, output_dim) def forward(self, x): # Linear function out = self.fc1(x) # Non-linearity out = self.relu(out) # Linear function (readout) out = self.fc2(out) return out ''' STEP 4: INSTANTIATE MODEL CLASS ''' input_dim = 28*28 hidden_dim = 100 output_dim = 10 model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim) ''' STEP 5: INSTANTIATE LOSS CLASS ''' criterion = nn.CrossEntropyLoss() ''' STEP 6: INSTANTIATE OPTIMIZER CLASS ''' learning_rate = 0.1 optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True) ''' STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS ''' # step_size: at how many multiples of epoch you decay # step_size = 1, after every 2 epoch, new_lr = lr*gamma # step_size = 2, after every 2 epoch, new_lr = lr*gamma # gamma = decaying factor scheduler = StepLR(optimizer, step_size=2, gamma=0.96) ''' STEP 7: TRAIN THE MODEL ''' iter = 0 for epoch in range(num_epochs): # Decay Learning Rate scheduler.step() # Print Learning Rate print('Epoch:', epoch,'LR:', scheduler.get_lr()) for i, (images, labels) in enumerate(train_loader): # Load images as Variable images = images.view(-1, 28*28).requires_grad_() # Clear gradients w.r.t. parameters optimizer.zero_grad() # Forward pass to get output/logits outputs = model(images) # Calculate Loss: softmax --> cross entropy loss loss = criterion(outputs, labels) # Getting gradients w.r.t. parameters loss.backward() # Updating parameters optimizer.step() iter += 1 if iter % 500 == 0: # Calculate Accuracy correct = 0 total = 0 # Iterate through test dataset for images, labels in test_loader: # Load images to a Torch Variable images = images.view(-1, 28*28) # Forward pass only to get logits/output outputs = model(images) # Get predictions from the maximum value _, predicted = torch.max(outputs.data, 1) # Total number of labels total += labels.size(0) # Total correct predictions correct += (predicted == labels).sum() accuracy = 100 * correct / total # Print Loss print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))
Epoch: 0 LR: [0.1] Iteration: 500. Loss: 0.15292978286743164. Accuracy: 96 Epoch: 1 LR: [0.1] Iteration: 1000. Loss: 0.11253029108047485. Accuracy: 96 Epoch: 2 LR: [0.096] Iteration: 1500. Loss: 0.11864850670099258. Accuracy: 97 Epoch: 3 LR: [0.096] Iteration: 2000. Loss: 0.030942382290959358. Accuracy: 97 Epoch: 4 LR: [0.09216] Iteration: 2500. Loss: 0.04521659016609192. Accuracy: 97 Iteration: 3000. Loss: 0.027839098125696182. Accuracy: 97
Pointers on Step-wise Decay¶
- You would want to decay your LR gradually when you're training more epochs
- Converge too fast, to a crappy loss/accuracy, if you decay rapidly
- To decay slower
- Larger \gamma
- Larger interval of decay
Reduce on Loss Plateau Decay¶
Reduce on Loss Plateau Decay, Patience=0, Factor=0.1¶
- Reduce learning rate whenever loss plateaus
- Patience: number of epochs with no improvement after which learning rate will be reduced
- Patience = 0
- Factor: multiplier to decrease learning rate, lr = lr*factor = \gamma
- Factor = 0.1
- Patience: number of epochs with no improvement after which learning rate will be reduced
- Optimization Algorithm: SGD Nesterov
- Modification of SGD Momentum
- v_t = \gamma v_{t-1} + \eta \cdot \nabla J(\theta - \gamma v_{t-1}, x^{i: i+n}, y^{i:i+n})
- \theta = \theta - v_t
- Modification of SGD Momentum
Code for reduce on loss plateau learning rate decay of factor 0.1 and 0 patience
import torch import torch.nn as nn import torchvision.transforms as transforms import torchvision.datasets as dsets # Set seed torch.manual_seed(0) # Where to add a new import from torch.optim.lr_scheduler import ReduceLROnPlateau ''' STEP 1: LOADING DATASET ''' train_dataset = dsets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True) test_dataset = dsets.MNIST(root='./data', train=False, transform=transforms.ToTensor()) ''' STEP 2: MAKING DATASET ITERABLE ''' batch_size = 100 n_iters = 6000 num_epochs = n_iters / (len(train_dataset) / batch_size) num_epochs = int(num_epochs) train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False) ''' STEP 3: CREATE MODEL CLASS ''' class FeedforwardNeuralNetModel(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super(FeedforwardNeuralNetModel, self).__init__() # Linear function self.fc1 = nn.Linear(input_dim, hidden_dim) # Non-linearity self.relu = nn.ReLU() # Linear function (readout) self.fc2 = nn.Linear(hidden_dim, output_dim) def forward(self, x): # Linear function out = self.fc1(x) # Non-linearity out = self.relu(out) # Linear function (readout) out = self.fc2(out) return out ''' STEP 4: INSTANTIATE MODEL CLASS ''' input_dim = 28*28 hidden_dim = 100 output_dim = 10 model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim) ''' STEP 5: INSTANTIATE LOSS CLASS ''' criterion = nn.CrossEntropyLoss() ''' STEP 6: INSTANTIATE OPTIMIZER CLASS ''' learning_rate = 0.1 optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True) ''' STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS ''' # lr = lr * factor # mode='max': look for the maximum validation accuracy to track # patience: number of epochs - 1 where loss plateaus before decreasing LR # patience = 0, after 1 bad epoch, reduce LR # factor = decaying factor scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.1, patience=0, verbose=True) ''' STEP 7: TRAIN THE MODEL ''' iter = 0 for epoch in range(num_epochs): for i, (images, labels) in enumerate(train_loader): # Load images as Variable images = images.view(-1, 28*28).requires_grad_() # Clear gradients w.r.t. parameters optimizer.zero_grad() # Forward pass to get output/logits outputs = model(images) # Calculate Loss: softmax --> cross entropy loss loss = criterion(outputs, labels) # Getting gradients w.r.t. parameters loss.backward() # Updating parameters optimizer.step() iter += 1 if iter % 500 == 0: # Calculate Accuracy correct = 0 total = 0 # Iterate through test dataset for images, labels in test_loader: # Load images to a Torch Variable images = images.view(-1, 28*28) # Forward pass only to get logits/output outputs = model(images) # Get predictions from the maximum value _, predicted = torch.max(outputs.data, 1) # Total number of labels total += labels.size(0) # Total correct predictions # Without .item(), it is a uint8 tensor which will not work when you pass this number to the scheduler correct += (predicted == labels).sum().item() accuracy = 100 * correct / total # Print Loss # print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data[0], accuracy)) # Decay Learning Rate, pass validation accuracy for tracking at every epoch print('Epoch {} completed'.format(epoch)) print('Loss: {}. Accuracy: {}'.format(loss.item(), accuracy)) print('-'*20) scheduler.step(accuracy)
Epoch 0 completed Loss: 0.17087846994400024. Accuracy: 96.26 -------------------- Epoch 1 completed Loss: 0.11688263714313507. Accuracy: 96.96 -------------------- Epoch 2 completed Loss: 0.035437121987342834. Accuracy: 96.78 -------------------- Epoch 2: reducing learning rate of group 0 to 1.0000e-02. Epoch 3 completed Loss: 0.0324370414018631. Accuracy: 97.7 -------------------- Epoch 4 completed Loss: 0.022194599732756615. Accuracy: 98.02 -------------------- Epoch 5 completed Loss: 0.007145566865801811. Accuracy: 98.03 -------------------- Epoch 6 completed Loss: 0.01673538237810135. Accuracy: 98.05 -------------------- Epoch 7 completed Loss: 0.025424446910619736. Accuracy: 98.01 -------------------- Epoch 7: reducing learning rate of group 0 to 1.0000e-03. Epoch 8 completed Loss: 0.014696130529046059. Accuracy: 98.05 -------------------- Epoch 8: reducing learning rate of group 0 to 1.0000e-04. Epoch 9 completed Loss: 0.00573748117312789. Accuracy: 98.04 -------------------- Epoch 9: reducing learning rate of group 0 to 1.0000e-05.
Reduce on Loss Plateau Decay, Patience=0, Factor=0.5¶
- Reduce learning rate whenever loss plateaus
- Patience: number of epochs with no improvement after which learning rate will be reduced
- Patience = 0
- Factor: multiplier to decrease learning rate, lr = lr*factor = \gamma
- Factor = 0.5
- Patience: number of epochs with no improvement after which learning rate will be reduced
- Optimization Algorithm 4: SGD Nesterov
- Modification of SGD Momentum
- v_t = \gamma v_{t-1} + \eta \cdot \nabla J(\theta - \gamma v_{t-1}, x^{i: i+n}, y^{i:i+n})
- \theta = \theta - v_t
- Modification of SGD Momentum
Code for reduce on loss plateau learning rate decay with factor 0.5 and 0 patience
import torch import torch.nn as nn import torchvision.transforms as transforms import torchvision.datasets as dsets # Set seed torch.manual_seed(0) # Where to add a new import from torch.optim.lr_scheduler import ReduceLROnPlateau ''' STEP 1: LOADING DATASET ''' train_dataset = dsets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True) test_dataset = dsets.MNIST(root='./data', train=False, transform=transforms.ToTensor()) ''' STEP 2: MAKING DATASET ITERABLE ''' batch_size = 100 n_iters = 6000 num_epochs = n_iters / (len(train_dataset) / batch_size) num_epochs = int(num_epochs) train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False) ''' STEP 3: CREATE MODEL CLASS ''' class FeedforwardNeuralNetModel(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super(FeedforwardNeuralNetModel, self).__init__() # Linear function self.fc1 = nn.Linear(input_dim, hidden_dim) # Non-linearity self.relu = nn.ReLU() # Linear function (readout) self.fc2 = nn.Linear(hidden_dim, output_dim) def forward(self, x): # Linear function out = self.fc1(x) # Non-linearity out = self.relu(out) # Linear function (readout) out = self.fc2(out) return out ''' STEP 4: INSTANTIATE MODEL CLASS ''' input_dim = 28*28 hidden_dim = 100 output_dim = 10 model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim) ''' STEP 5: INSTANTIATE LOSS CLASS ''' criterion = nn.CrossEntropyLoss() ''' STEP 6: INSTANTIATE OPTIMIZER CLASS ''' learning_rate = 0.1 optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True) ''' STEP 7: INSTANTIATE STEP LEARNING SCHEDULER CLASS ''' # lr = lr * factor # mode='max': look for the maximum validation accuracy to track # patience: number of epochs - 1 where loss plateaus before decreasing LR # patience = 0, after 1 bad epoch, reduce LR # factor = decaying factor scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=0, verbose=True) ''' STEP 7: TRAIN THE MODEL ''' iter = 0 for epoch in range(num_epochs): for i, (images, labels) in enumerate(train_loader): # Load images as Variable images = images.view(-1, 28*28).requires_grad_() # Clear gradients w.r.t. parameters optimizer.zero_grad() # Forward pass to get output/logits outputs = model(images) # Calculate Loss: softmax --> cross entropy loss loss = criterion(outputs, labels) # Getting gradients w.r.t. parameters loss.backward() # Updating parameters optimizer.step() iter += 1 if iter % 500 == 0: # Calculate Accuracy correct = 0 total = 0 # Iterate through test dataset for images, labels in test_loader: # Load images to a Torch Variable images = images.view(-1, 28*28) # Forward pass only to get logits/output outputs = model(images) # Get predictions from the maximum value _, predicted = torch.max(outputs.data, 1) # Total number of labels total += labels.size(0) # Total correct predictions # Without .item(), it is a uint8 tensor which will not work when you pass this number to the scheduler correct += (predicted == labels).sum().item() accuracy = 100 * correct / total # Print Loss # print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data[0], accuracy)) # Decay Learning Rate, pass validation accuracy for tracking at every epoch print('Epoch {} completed'.format(epoch)) print('Loss: {}. Accuracy: {}'.format(loss.item(), accuracy)) print('-'*20) scheduler.step(accuracy)
Epoch 0 completed Loss: 0.17087846994400024. Accuracy: 96.26 -------------------- Epoch 1 completed Loss: 0.11688263714313507. Accuracy: 96.96 -------------------- Epoch 2 completed Loss: 0.035437121987342834. Accuracy: 96.78 -------------------- Epoch 2: reducing learning rate of group 0 to 5.0000e-02. Epoch 3 completed Loss: 0.04893001914024353. Accuracy: 97.62 -------------------- Epoch 4 completed Loss: 0.020584167912602425. Accuracy: 97.86 -------------------- Epoch 5 completed Loss: 0.006022400688380003. Accuracy: 97.95 -------------------- Epoch 6 completed Loss: 0.028374142944812775. Accuracy: 97.87 -------------------- Epoch 6: reducing learning rate of group 0 to 2.5000e-02. Epoch 7 completed Loss: 0.013204765506088734. Accuracy: 98.0 -------------------- Epoch 8 completed Loss: 0.010137186385691166. Accuracy: 97.95 -------------------- Epoch 8: reducing learning rate of group 0 to 1.2500e-02. Epoch 9 completed Loss: 0.0035198689438402653. Accuracy: 98.01 --------------------
Pointers on Reduce on Loss Pleateau Decay¶
- In these examples, we used patience=1 because we are running few epochs
- You should look at a larger patience such as 5 if for example you ran 500 epochs.
- You should experiment with 2 properties
- Patience
- Decay factor
Summary¶
We've learnt...
Success
- Learning Rate Intuition
- Update parameters so model can churn output closer to labels
- Gradual parameter updates
- Learning Rate Pointers
- If we set \eta to be a large value \rightarrow learn too much (rapid learning)
- If we set \eta to be a small value \rightarrow learn too little (slow learning)
- Learning Rate Schedules
- Step-wise Decay
- Reduce on Loss Plateau Decay
- Step-wise Decay
- Every 1 epoch
- Every 2 epoch
- Every 1 epoch, larger gamma
- Step-wise Decay Pointers
- Decay LR gradually
- Larger \gamma
- Larger interval of decay (increase epoch)
- Decay LR gradually
- Reduce on Loss Plateau Decay
- Patience=0, Factor=1
- Patience=0, Factor=0.5
- Pointers on Reduce on Loss Plateau Decay
- Larger patience with more epochs
- 2 hyperparameters to experiment
- Patience
- Decay factor
Citation¶
If you have found these useful in your research, presentations, school work, projects or workshops, feel free to cite using this DOI.