0% found this document useful (0 votes)
3 views45 pages

Lecture 14 Introduction to Pytorch

The document provides an introduction to PyTorch, focusing on its application in deep learning and the differences between traditional machine learning and deep learning. It discusses the architecture of neural networks, the importance of automatic differentiation, and the process of gradient descent for optimizing model parameters. Additionally, it compares PyTorch with TensorFlow, highlighting their respective strengths and use cases in research and industry.

Uploaded by

vinay thakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views45 pages

Lecture 14 Introduction to Pytorch

The document provides an introduction to PyTorch, focusing on its application in deep learning and the differences between traditional machine learning and deep learning. It discusses the architecture of neural networks, the importance of automatic differentiation, and the process of gradient descent for optimizing model parameters. Additionally, it compares PyTorch with TensorFlow, highlighting their respective strengths and use cases in research and industry.

Uploaded by

vinay thakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Introduction to PyTorch

Lecture 14 for 14-763/18-763


Guannan Qu

Oct 23/24, 2023


Recall: the ML process
Phase II: ML Modeling

Identify the Proper ML Model

Train
Data Engineering & Preprocess
Accuracy

Evaluation Confusion Matrix


Train, Evaluation, and Parameter
Tuning
Try different models ROC/AUC

Obtain Final Tuned Model


Hyper-Parameter Tuning
via Cross Validation
Traditional ML vs Deep Learning
Traditional ML suffers from some issues including:
• Not good at handling high dimensional data (e.g. image and texts).
• For a 32*32 image, # of input features is 1024
• Need to do feature extraction (like Fourier Transform) which is difficult

Deep Learning is capable of


• Handling high dimensional data
• No need to do feature extraction
• Feature extraction is done automatically in deep learning.
𝑦

What is Deep Learning?

Recall: linear regression 𝑥

Linear Model Output


Input:
𝑦
𝑥 𝑦 = 𝑎𝑥 + 𝑏

Deep learning replaces the linear model


with “layers” of linear models with non-linear activation!
What is Deep Learning?
A.k.a. activation function.
A popular choice is the ReLU
𝑓 𝑦 ! = max(𝑦′, 0)

Output
Input: Linear Function Nonlinear
𝑦
𝑥 𝑦 # = ∑'$%& 𝑤$ 𝑥$ + 𝑏 𝑦 = 𝑓(𝑦′)

In deep learning, the


input is typically high-
dimensional This is a neural network with 1 layer with width 1
𝑥 = [𝑥! , … , 𝑥" ] Next: increase the width
What is Deep Learning?
Neuron
Linear Function Nonlinear
𝑧$ = ∑%"#$ 𝑤"$ 𝑥" + 𝑏$ 𝑦& = 𝑓(𝑧&)

Input: Nonlinear Output


Linear Function Output Layer
𝑥 𝑦 - = 𝑓(𝑧 -) 𝑦
𝑧 ! = ∑%"#$ 𝑤"! 𝑥" + 𝑏 ! 𝑦 = ∑"#$! 𝑤#% 𝑦 # + 𝑏 %

Linear Function Nonlinear


𝑧 & = ∑%"#$ 𝑤"& 𝑥" + 𝑏 & 𝑦 . = 𝑓(𝑧 .)

A hidden layer
What is Deep Learning?
A hidden layer

Neuron 1

Input: Output
Neuron 2 Output Layer 𝑦
𝑥

Neuron 3
What is Deep Learning? A neural network with 3 hidden layers
and the widths are (3,4,2)
Neuron
Neuron

Neuron Neuron
Input: Output
Neuron Output Layer 𝑦
𝑥
Neuron Neuron

Neuron Width 2
Neuron
Width 3
Width 4
What is Deep Learning?
Neural Networks are a type of ML model:
• Use a cascade of multiple layers of nonlinear processing units (neurons). Each successive
layer uses the output from the previous layer as input.
• Learns multiple levels of representations that correspond to different levels of
abstraction; the levels form a hierarchy of concepts.
• Has a long history, perhaps first dates back to 1943, but limited success until the 2000s

Deep learning uses Deep Neural Networks as the ML model:


• “Deep” refers to the the number of layers is large
• Become extremely successful in the 2010s in various domains (image classification, NLP…)
• Various architectures, Multi-Layer Perceptron (MLP), Convolutional NN (CNN), Recurrent
NN (RNN), Transformer…
Deep learning requires new ML platform
SparkML (based on Transformer/Estimator) is not adequate for deep learning
• Deep neural networks has a highly flexible structure
• # layers, # of neurons for each layer, choice of activation function
• CNN, RNN, ResNet has more complicated structure
• The training of neural network requires a lot of tuning, and therefore needs to get
to the low-level detail
• The training of neural network is data intensive and computationally heavy

We need specialized ML platform for deep learning!


What is TensorFlow?
• History: Developed by the Google Brain Team to accelerate machine
learning and deep neural network research. It was first made public in
late 2015, while the first stable version appeared in 2017. It is open
source under Apache Open Source license.
• Built to run on multiple CPUs or GPUs and even mobile operating
systems.
• Multiple languages like Python, C/C++ or Java.
• End-to-end, free, and open source
• One of the most popular program frameworks for building deep
learning applications.
TensorFlow vs PyTorch
• PyTorch was released in 2017 by Facebook AI (now Meta) and soon became popular
• Known for its simplicity, ease of use, flexibility
• Uses dynamic computation graph
• In contrast, TensorFlow at the time
• Not user friendly, steeper learning curve, not well organized
• Used static computation graph
• But TensorFlow still had advantages, e.g. in deployment, visualization
• The comparison became more complicated when TensorFlow 2 was released in 2019
• TensorFlow 2 became much more user friendly and the APIs were cleaned up
TensorFlow vs PyTorch
• PyTorch and TensorFlow are far and away the two most popular Deep
Learning frameworks today. The debate over which framework is
superior is a longstanding point of contentious debate, with each
camp having its share of fervent supporters.
• TensorFlow has a reputation for being an industry-focused framework
and PyTorch has a reputation for being a research-focused framework
TensorFlow vs PyTorch
TensorFlow vs PyTorch
Introduction to PyTorch
Still use linear regression as warm-up, but go to low level details this time

Understanding ML: loss function and optimization


Loss Function
Linear Model: 𝑦 = 𝑤𝑥 + 𝑏
Model Parameters: 𝑤, 𝑏

𝑒$ : Error of fitting
1 - data point i
𝑙𝑜𝑠𝑠 𝑤, 𝑏 = 3 𝑦$ − 𝑤𝑥$ + 𝑏
𝑁
$

𝑒$

The training/fitting process finds the w,b with the smallest loss, but how?
How does training work?

The training/fitting process finds the w,b with the smallest loss, but how?
Optimization: Gradients
Given a function 𝑙𝑜𝑠𝑠 𝜃 that depends on a 2 dimensional 𝜃 = [𝑤, 𝑏]
Its gradient ∇𝑙𝑜𝑠𝑠(𝜃) is the direction from 𝜃 that will lead to largest increase in 𝑙𝑜𝑠𝑠 𝜃
Optimization: Gradients
Contour Plot of 𝑙𝑜𝑠𝑠 𝜃 (Example)
Think of it as a terrain 1

altitude map, each circle 0.8

represents the set of 𝜃 0.6

with the same loss value 0.4 Lowest loss achieved at origin
0.2

w 0

-0.2

-0.4
The farther away from origin,
-0.6
the larger the loss function
-0.8

-1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

b
Optimization: Gradients
To find a parameter with Contour Plot of 𝑙𝑜𝑠𝑠 𝜃 (Example)
lower loss function, it 1
should follow the 0.8
negative gradient direction
0.6

0.4 Lowest loss achieved at origin


0.2

-0.2

Current parameter 𝜃
-0.4
Negative Gradient Direction
-0.6

-0.8 Gradient direction


-1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1


Optimization: Gradients
Gradient Descent Contour Plot of 𝑙𝑜𝑠𝑠 𝜃 (Example)
Keep following the
negative gradient direction!
1

0.8

0.6

0.4 Lowest loss achieved at origin


0.2

𝜃 0at the next


-0.2iteration

𝜃 at the current
-0.4

iteration
-0.6

-0.8

-1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1


Optimization: Gradients
Gradient Descent Contour Plot of 𝑙𝑜𝑠𝑠 𝜃 (Example)
Initialize 𝜃
1

Repeat maxIter steps: 0.8

𝜃 ← 𝜃 − 𝜂∇𝑙𝑜𝑠𝑠(𝜃) 0.6

0.4 Lowest loss achieved at origin


𝜂 is learning rate, i.e. 0.2
how large a step one
0
makes in each iteration
-0.2

-0.4

-0.6

-0.8

-1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1


Optimization: Gradients
Linear Model: 𝑦 = 𝑤𝑥 + 𝑏
Model Parameters: 𝑤, 𝑏

𝑒$ : Error of fitting
1 - data point i
𝑙𝑜𝑠𝑠 𝑤, 𝑏 = 3 𝑦$ − 𝑤𝑥$ + 𝑏
𝑁
$

How to calculate the gradient of this loss function?


Gradient in PyTorch
The core of PyTorch (and TensorFlow) is their automatic differentiation (autograd)
1. Define a linear regression model
2. Generate some training data
3. Calculate gradient and conduct gradient descent
PyTorch: Linear Regression Model
Subclassing nn.Module
PyTorch: Linear Regression Model
In __init__ function, define all the parameters of the
model as nn.Parameter and give them initial values

Side note:
• In pytorch, tensor is the most basic building block. nn.Parameter is a special kind of Tensor
used to represent model parameters
• In our code, both parameters are initialized as torch.zeros, which are all zero tensors
• Our __init__ function takes d as input, which means the input dimension (we will set d=1)
PyTorch: Linear Regression Model

In forward function, define how the output is computed from input. torch.inner means
inner product, and this line of code simply means 𝑤&𝑥& + ⋯ + 𝑤4 𝑥4 + 𝑏
PyTorch: Linear Regression Model
Gradient in PyTorch
The core of PyTorch (and TensorFlow) is their automatic differentiation (autograd)
1. Define a linear regression model
2. Generate some training data
3. Calculate gradient and conduct gradient descent
Gradient in PyTorch
Gradient in PyTorch
The core of PyTorch (and TensorFlow) is their automatic differentiation (autograd)
1. Define a linear regression model
2. Generate some training data
3. Calculate gradient and conduct gradient descent
Gradient in PyTorch
Recall: we want to calculate the gradient of this loss function

1 -
𝑙𝑜𝑠𝑠 𝑤, 𝑏 = 3 𝑦$ − 𝑤𝑥$ + 𝑏
𝑁
$
Steps in PyTorch:
• Step 1: Forward Pass, calculate the loss function value
Gradient in PyTorch
Recall: we want to calculate the gradient of this loss function

1 -
𝑙𝑜𝑠𝑠 𝑤, 𝑏 = 3 𝑦$ − 𝑤𝑥$ + 𝑏
𝑁
$
Steps in PyTorch:
• Step 2: Backward pass.
Before doing backward, let’s first check the gradient values now
Gradient in PyTorch
Recall: we want to calculate the gradient of this loss function

1 -
𝑙𝑜𝑠𝑠 𝑤, 𝑏 = 3 𝑦$ − 𝑤𝑥$ + 𝑏
𝑁
$
Steps in PyTorch:
• Step 2: Backward pass.
Let’s now do backward pass and check gradient again

Up next: gradient descent, i.e. iteratively compute the gradient and conduct gradient descent!
Tell the optimizer what is the parameters to optimize!

Setting learning rate


Forward Pass

Backward pass and compute gradient.


Note: VERY IMPORTANT to run optimizer.zero_grad() to reset grad
to zero! Otherwise, the backward will be incorrect.
Run a gradient descent on the parameters using the
computed gradient and the learning rate
Summary So Far
• Creating a Module: subsclassing nn.Module
• Define parameters, define forward function
• Calculating gradient
• Forward and backward pass
• Perform training (gradient descent)
• Create optimizer and specify the parameters to optimize, and specify learning
rate
• Write training loop that does gradient descent
• Forward and compute loss
• Zero-grad
• Backward
• Step

How to choose learning rate, maxIter? Let’s now visualize the gradient descent process!
Visualizing Gradient Descent
Learning rate = 0.001
Visualizing
Learning rate = 0.0005 (smaller than our first trial)
Visualizing Gradient Descent
Learning rate = 0.005 (larger than our first trial)
Visualizing Gradient Descent
Learning rate = 0.025 (much larger than our first trial)
Visualizing Gradient Descent
Learning rate = 0.028 (much larger than our first trial)
Visualizing Gradient Descent
Learning rate = 0.03 (much larger than our first trial)
Lessons Learned on Learning Rate
• Learning rate too small:
• Converges too slow and takes a lot of iterations
• Learning rate too large:
• Exhibit unstable (oscillating) behaviors and may diverge

• How to find a good learning rate:


• Find a small enough learning rate that does not diverge
• Increase learning rate and plot the training loss curve
• If the loss curve appears to be converging and “stable”, can further increase
• If the loss curve appears to be unstable and shows signs of divergence, decrease learning
rate

You might also like