Lecture 14 Introduction to Pytorch
Lecture 14 Introduction to Pytorch
Train
Data Engineering & Preprocess
Accuracy
Output
Input: Linear Function Nonlinear
𝑦
𝑥 𝑦 # = ∑'$%& 𝑤$ 𝑥$ + 𝑏 𝑦 = 𝑓(𝑦′)
A hidden layer
What is Deep Learning?
A hidden layer
Neuron 1
Input: Output
Neuron 2 Output Layer 𝑦
𝑥
Neuron 3
What is Deep Learning? A neural network with 3 hidden layers
and the widths are (3,4,2)
Neuron
Neuron
Neuron Neuron
Input: Output
Neuron Output Layer 𝑦
𝑥
Neuron Neuron
Neuron Width 2
Neuron
Width 3
Width 4
What is Deep Learning?
Neural Networks are a type of ML model:
• Use a cascade of multiple layers of nonlinear processing units (neurons). Each successive
layer uses the output from the previous layer as input.
• Learns multiple levels of representations that correspond to different levels of
abstraction; the levels form a hierarchy of concepts.
• Has a long history, perhaps first dates back to 1943, but limited success until the 2000s
𝑒$ : Error of fitting
1 - data point i
𝑙𝑜𝑠𝑠 𝑤, 𝑏 = 3 𝑦$ − 𝑤𝑥$ + 𝑏
𝑁
$
𝑒$
The training/fitting process finds the w,b with the smallest loss, but how?
How does training work?
The training/fitting process finds the w,b with the smallest loss, but how?
Optimization: Gradients
Given a function 𝑙𝑜𝑠𝑠 𝜃 that depends on a 2 dimensional 𝜃 = [𝑤, 𝑏]
Its gradient ∇𝑙𝑜𝑠𝑠(𝜃) is the direction from 𝜃 that will lead to largest increase in 𝑙𝑜𝑠𝑠 𝜃
Optimization: Gradients
Contour Plot of 𝑙𝑜𝑠𝑠 𝜃 (Example)
Think of it as a terrain 1
with the same loss value 0.4 Lowest loss achieved at origin
0.2
w 0
-0.2
-0.4
The farther away from origin,
-0.6
the larger the loss function
-0.8
-1
b
Optimization: Gradients
To find a parameter with Contour Plot of 𝑙𝑜𝑠𝑠 𝜃 (Example)
lower loss function, it 1
should follow the 0.8
negative gradient direction
0.6
-0.2
Current parameter 𝜃
-0.4
Negative Gradient Direction
-0.6
0.8
0.6
𝜃 at the current
-0.4
iteration
-0.6
-0.8
-1
𝜃 ← 𝜃 − 𝜂∇𝑙𝑜𝑠𝑠(𝜃) 0.6
-0.4
-0.6
-0.8
-1
𝑒$ : Error of fitting
1 - data point i
𝑙𝑜𝑠𝑠 𝑤, 𝑏 = 3 𝑦$ − 𝑤𝑥$ + 𝑏
𝑁
$
Side note:
• In pytorch, tensor is the most basic building block. nn.Parameter is a special kind of Tensor
used to represent model parameters
• In our code, both parameters are initialized as torch.zeros, which are all zero tensors
• Our __init__ function takes d as input, which means the input dimension (we will set d=1)
PyTorch: Linear Regression Model
In forward function, define how the output is computed from input. torch.inner means
inner product, and this line of code simply means 𝑤&𝑥& + ⋯ + 𝑤4 𝑥4 + 𝑏
PyTorch: Linear Regression Model
Gradient in PyTorch
The core of PyTorch (and TensorFlow) is their automatic differentiation (autograd)
1. Define a linear regression model
2. Generate some training data
3. Calculate gradient and conduct gradient descent
Gradient in PyTorch
Gradient in PyTorch
The core of PyTorch (and TensorFlow) is their automatic differentiation (autograd)
1. Define a linear regression model
2. Generate some training data
3. Calculate gradient and conduct gradient descent
Gradient in PyTorch
Recall: we want to calculate the gradient of this loss function
1 -
𝑙𝑜𝑠𝑠 𝑤, 𝑏 = 3 𝑦$ − 𝑤𝑥$ + 𝑏
𝑁
$
Steps in PyTorch:
• Step 1: Forward Pass, calculate the loss function value
Gradient in PyTorch
Recall: we want to calculate the gradient of this loss function
1 -
𝑙𝑜𝑠𝑠 𝑤, 𝑏 = 3 𝑦$ − 𝑤𝑥$ + 𝑏
𝑁
$
Steps in PyTorch:
• Step 2: Backward pass.
Before doing backward, let’s first check the gradient values now
Gradient in PyTorch
Recall: we want to calculate the gradient of this loss function
1 -
𝑙𝑜𝑠𝑠 𝑤, 𝑏 = 3 𝑦$ − 𝑤𝑥$ + 𝑏
𝑁
$
Steps in PyTorch:
• Step 2: Backward pass.
Let’s now do backward pass and check gradient again
Up next: gradient descent, i.e. iteratively compute the gradient and conduct gradient descent!
Tell the optimizer what is the parameters to optimize!
How to choose learning rate, maxIter? Let’s now visualize the gradient descent process!
Visualizing Gradient Descent
Learning rate = 0.001
Visualizing
Learning rate = 0.0005 (smaller than our first trial)
Visualizing Gradient Descent
Learning rate = 0.005 (larger than our first trial)
Visualizing Gradient Descent
Learning rate = 0.025 (much larger than our first trial)
Visualizing Gradient Descent
Learning rate = 0.028 (much larger than our first trial)
Visualizing Gradient Descent
Learning rate = 0.03 (much larger than our first trial)
Lessons Learned on Learning Rate
• Learning rate too small:
• Converges too slow and takes a lot of iterations
• Learning rate too large:
• Exhibit unstable (oscillating) behaviors and may diverge