18. Deep Learning with PyTorch Step-by-Step
18. Deep Learning with PyTorch Step-by-Step
Version 1.1.1
Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide
Volume I—Fundamentals
• 2021-05-18: v1.0
• 2021-12-15: v1.1
• 2022-02-12: v1.1.1
Although the author has used his best efforts to ensure that the information and
instructions contained in this book are accurate, under no circumstances shall the
author be liable for any loss, damage, liability, or expense incurred or suffered as a
consequence, directly or indirectly, of the use and/or application of any of the
contents of this book. Any action you take upon the information in this book is
strictly at your own risk. If any code samples or other technology this book contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights. The author does not have any control over and does not
assume any responsibility for third-party websites or their content. All trademarks
are the property of their respective owners. Screenshots are used for illustrative
purposes only.
Richard P. Feynman
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Why PyTorch? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Why This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
What Do I Need to Know? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
How to Read This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
What’s Next?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Setup Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Official Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Binder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Local Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1. Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2. Conda (Virtual) Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3. PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4. TensorBoard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5. GraphViz and Torchviz (optional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6. Git. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7. Jupyter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Moving On. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 0: Visualizing Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Spoilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Imports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Visualizing Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Data Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Train-Validation-Test Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Step 0 - Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Step 1 - Compute Model’s Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Step 2 - Compute the Loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Loss Surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Cross-Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Step 3 - Compute the Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Visualizing Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Backpropagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Step 4 - Update the Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Low Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
High Learning Rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Very High Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
"Bad" Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Scaling / Standardizing / Normalizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Step 5 - Rinse and Repeat!. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
The Path of Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 1: A Simple Regression Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Spoilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Imports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
A Simple Regression Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Data Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Gradient Descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Step 0 - Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Step 1 - Compute Model’s Predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Step 2 - Compute the Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Step 3 - Compute the Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Step 4 - Update the Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Step 5 - Rinse and Repeat! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Linear Regression in Numpy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
PyTorch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Loading Data, Devices, and CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Creating Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Autograd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
grad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
zero_ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Updating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
no_grad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Dynamic Computation Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Optimizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
step / zero_grad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
state_dict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Nested Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Sequential Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Model Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Recap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Chapter 2: Rethinking the Training Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Spoilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Jupyter Notebook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Imports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Rethinking the Training Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Training Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
TensorDataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
DataLoader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Mini-Batch Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Random Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Plotting Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
TensorBoard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Running It Inside a Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Running It Separately (Local Installation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Running It Separately (Binder) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
SummaryWriter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
add_graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
add_scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Saving and Loading Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Model State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Resuming Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Deploying / Making Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Setting the Model’s Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Recap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Chapter 2.1: Going Classy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Spoilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Jupyter Notebook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Imports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Going Classy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
The Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
The Constructor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Placeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Training Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Saving and Loading Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Visualization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
The Full Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Classy Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Making Predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Resuming Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Recap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Chapter 3: A Simple Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Spoilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Jupyter Notebook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Imports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
A Simple Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Data Preparation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Logits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Odds Ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Log Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
From Logits to Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
BCELoss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
BCEWithLogitsLoss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Imbalanced Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Model Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Decision Boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Classification Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
True and False Positive Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Trade-offs and Curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Low Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
High Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
ROC and PR Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
The Precision Quirk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Best and Worst Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Comparing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Recap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Thank You! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
Preface
If you’re reading this, I probably don’t need to tell you that deep learning is amazing
and PyTorch is cool, right?
But I will tell you, briefly, how this series of books came to be. In 2016, I started
teaching a class on machine learning with Apache Spark and, a couple of years later,
another class on the fundamentals of machine learning.
At some point, I tried to find a blog post that would visually explain, in a clear and
concise manner, the concepts behind binary cross-entropy so that I could show it
to my students. Since I could not find any that fit my purpose, I decided to write one
myself. Although I thought of it as a fairly basic topic, it turned out to be my most
popular blog post[1]! My readers have welcomed the simple, straightforward, and
conversational way I explained the topic.
Then, in 2019, I used the same approach for writing another blog post:
"Understanding PyTorch with an example: a step-by-step tutorial."[2] Once again, I
was amazed by the reaction from the readers!
It was their positive feedback that motivated me to write this series of books to
help beginners start their journey into deep learning and PyTorch.
In this first volume, I cover the basics of gradient descent, the fundamentals of
PyTorch, training linear and logistic regressions, evaluation metrics, and more. If
you have absolutely no experience with PyTorch, this is your starting point.
The second volume is mostly focused on computer vision: deeper models and
activation functions, convolutional neural networks, initialization schemes,
schedulers, and transfer learning. If your goal is to learn about deep learning
models for computer vision, and you’re already comfortable training simple models
in PyTorch, the second volume is the right one for you.
Then, the third volume focuses on all things sequence: recurrent neural networks
and their variations, sequence-to-sequence models, attention, self-attention, and
the Transformer architecture. The very last chapter of the third volume is a crash
course on natural language processing: from the basics of word tokenization all the
way up to fine-tuning large models (BERT and GPT-2) using the HuggingFace
library. This volume is more demanding than the other two, and you’re going to
enjoy it more if you already have a solid understanding of deep learning models.
| xi
These books are meant to be read in order, and, although they can be read
independently, I strongly recommend you read them as the one, long book I
originally wrote :-)
I hope you enjoy reading this series as much as I enjoyed writing it.
[1] https://bit.ly/2UW5iTg
[2] https://bit.ly/2TpzwxR
xii | Preface
Acknowledgements
First and foremost, I’d like to thank YOU, my reader, for making this book possible.
If it weren’t for the amazing feedback I got from the thousands of readers of my
blog post about PyTorch, I would have never mustered the strength to start and
finish such a major undertaking as writing a 1,000-page book series!
I’d like to thank my good friends Jesús Martínez-Blanco (who managed to read
absolutely everything that I wrote), Jakub Cieslik, Hannah Berscheid, Mihail Vieru,
Ramona Theresa Steck, Mehdi Belayet Lincon, and António Góis for helping me out
and dedicating a good chunk of their time to reading, proofing, and suggesting
improvements to my drafts. I’m forever grateful for your support! I’d also like to
thank my friend José Luis Lopez Pino for the initial push I needed to actually start
writing this book.
Many thanks to my friends José Quesada and David Anderson for taking me as a
student at the Data Science Retreat in 2015 and, later on, for inviting me to be a
teacher there. That was the starting point of my career both as a data scientist and
as teacher.
I’d also like to thank the PyTorch developers for developing such an amazing
framework, and the teams from Leanpub and Towards Data Science for making it
incredibly easy for content creators like me to share their work with the
community.
Finally, I’d like to thank my wife, Jerusa, for always being supportive throughout
the entire writing of this series of books, and for taking the time to read every single
page in it :-)
| xiii
About the Author
Daniel is a data scientist, developer, writer, and teacher. He has been teaching
machine learning and distributed computing technologies at Data Science Retreat,
the longest-running Berlin-based bootcamp, since 2016, helping more than 150
students advance their careers.
Daniel is also the main contributor of two Python packages: HandySpark and
DeepReplay.
Second, maybe there are even some unexpected benefits to your health—check
Andrej Karpathy’s tweet[3] about it!
• Tesla: Watch Andrej Karpathy (AI director at Tesla) speak about "how Tesla is
using PyTorch to develop full self-driving capabilities for its vehicles" in this video.[8]
• fastai: fastai is a library built on top of PyTorch to simplify model training and
[10]
is used in its "Practical Deep Learning for Coders"[11] course. The fastai library is
deeply connected to PyTorch and "you can’t become really proficient at using
fastai if you don’t know PyTorch well, too."[12]
• Airbnb: PyTorch sits at the core of the company’s dialog assistant for customer
service.[15]
This series of books aims to get you started with PyTorch while giving you a solid
understanding of how it works.
Why PyTorch? | 1
Why This Book?
If you’re looking for a book where you can learn about deep learning and PyTorch
without having to spend hours deciphering cryptic text and code, and one that’s
easy and enjoyable to read, this is it :-)
First, this is not a typical book: most tutorials start with some nice and pretty image
classification problem to illustrate how to use PyTorch. It may seem cool, but I
believe it distracts you from the main goal: learning how PyTorch works. In this
book, I present a structured, incremental, and from-first-principles approach to
learn PyTorch.
Second, this is not a formal book in any way: I am writing this book as if I were
having a conversation with you, the reader. I will ask you questions (and give you
answers shortly afterward), and I will also make (silly) jokes.
My job here is to make you understand the topic, so I will avoid fancy
mathematical notation as much as possible and spell it out in plain English.
In this first book of the Deep Learning with PyTorch Step-by-Step series, I will guide
you through the development of many models in PyTorch, showing you why
PyTorch makes it much easier and more intuitive to build models in Python:
autograd, dynamic computation graph, model classes, and much, much more.
We will build, step-by-step, not only the models themselves but also your
understanding as I show you both the reasoning behind the code and how to avoid
some common pitfalls and errors along the way.
There is yet another advantage of focusing on the basics: this book is likely to have
a longer shelf life. It is fairly common for technical books, especially those focusing
on cutting-edge technology, to become outdated quickly. Hopefully, this is not
going to be the case here, since the underlying mechanics are not changing and
neither are the concepts. It is expected that some syntax changes over time, but I
do not see backward compatibility-breaking changes coming anytime soon.
The best example is gradient descent, which most people are familiar with at some
level. Maybe you know its general idea, perhaps you’ve seen it in Andrew Ng’s
Machine Learning course, or maybe you’ve even computed some partial
derivatives yourself!
Maybe you already know some of these concepts well: If this is the case, you can
simply skip them, since I’ve made these explanations as independent as possible
from the rest of the content.
But I want to make sure everyone is on the same page, so, if you have just heard
about a given concept or if you are unsure if you have entirely understood it, these
explanations are for you.
That being said, this is what I expect from you, the reader:
• to be able to work with PyData stack (numpy, matplotplib, and pandas) and
Jupyter notebooks
◦ training-validation-test split
◦ underfitting and overfitting (bias-variance trade-off)
Even so, I am still briefly touching on some of these topics, but I need to draw a line
somewhere; otherwise, this book would be gigantic!
This book is visually different than other books: As I’ve mentioned already in the
"Why This Book?" section, I really like to make use of visual cues. Although this is
not, strictly speaking, a convention, this is how you can interpret those cues:
• Every code cell is followed by another cell showing the corresponding outputs
(if any)
• All code presented in the book is available at its official repository on GitHub:
https://github.com/dvgodoy/PyTorchStepByStep
If there is any output to the code cell, titled or not, there will be another code cell
depicting the corresponding output so you can check if you successfully
reproduced it or not.
Output
Some code cells do not have titles—running them does not affect the workflow:
Output
WARNING
Potential problems or things to look out for.
INFORMATION
Important information to pay attention to.
IMPORTANT
Really important information to pay attention to.
TECHNICAL
DISCUSSION
Really brief discussion on a concept or topic.
LATER
Important topics that will be covered in more detail later.
SILLY
Jokes, puns, memes, quotes from movies.
What’s Next?
It’s time to set up an environment for your learning journey using the Setup Guide.
[3] https://bit.ly/2MQoYRo
[4] https://bit.ly/37uZgLB
[5] https://pytorch.org/ecosystem/
[6] https://bit.ly/2MTN0Lh
[7] https://bit.ly/2UFHFve
[8] https://bit.ly/2XXJkyo
[9] https://openai.com/blog/openai-pytorch/
What’s Next? | 7
Setup Guide
Official Repository
This book’s official repository is available on GitHub:
https://github.com/dvgodoy/PyTorchStepByStep
It contains one Jupyter notebook for every chapter in this book. Each notebook
contains all the code shown in its corresponding chapter, and you should be able to
run its cells in sequence to get the same outputs, as shown in the book. I strongly
believe that being able to reproduce the results brings confidence to the reader.
Environment
There are three options for you to run the Jupyter notebooks:
Let’s briefly explore the pros and cons of each of these options.
Google Colab
Google Colab "allows you to write and execute Python in your browser, with zero
configuration required, free access to GPUs and easy sharing."[18].
You can easily load notebooks directly from GitHub using Colab’s special URL
(https://colab.research.google.com/github/). Just type in the GitHub’s user or
organization (like mine, dvgodoy), and it will show you a list of all its public
repositories (like this book’s, PyTorchStepByStep).
8 | Setup Guide
After choosing a repository, it will list the available notebooks and corresponding
links to open them in a new browser tab.
You also get access to a GPU, which is very useful to train deep learning models
faster. More important, if you make changes to the notebook, Google Colab will
keep them. The whole setup is very convenient; the only cons I can think of are:
Binder
Binder "allows you to create custom computing environments that can be shared and
used by many remote users."[19]
You can also load notebooks directly from GitHub, but the process is slightly
different. Binder will create something like a virtual machine (technically, it is a
container, but let’s leave it at that), clone the repository, and start Jupyter. This
allows you to have access to Jupyter’s home page in your browser, just like you
would if you were running it locally, but everything is running in a JupyterHub
server on their end.
Just go to Binder’s site (https://mybinder.org/) and type in the URL to the GitHub
repository you want to explore (for instance,
https://github.com/dvgodoy/PyTorchStepByStep) and click on Launch. It will take
a couple of minutes to build the image and open Jupyter’s home page.
You can also launch Binder for this book’s repository directly using the following
Environment | 9
link: https://mybinder.org/v2/gh/dvgodoy/PyTorchStepByStep/master.
Binder is very convenient since it does not require a prior setup of any kind. Any
Python packages needed to successfully run the environment are likely installed
during launch (if provided by the author of the repository).
On the other hand, it may take time to start, and it does not keep your changes
after your session expires (so, make sure you download any notebooks you
modify).
Local Installation
This option will give you more flexibility, but it will require more effort to set up. I
encourage you to try setting up your own environment. It may seem daunting at
first, but you can surely accomplish it by following seven easy steps:
Checklist
☐ 1. Install Anaconda.
☐ 2. Create and activate a virtual environment.
☐ 3. Install PyTorch package.
☐ 4. Install TensorBoard package.
☐ 5. Install GraphViz software and TorchViz package (optional).
☐ 6. Install git and clone the repository.
☐ 7. Start Jupyter notebook.
10 | Setup Guide
1. Anaconda
If you don’t have Anaconda’s Individual Edition[20] installed yet, this would be a
good time to do it. It is a convenient way to start since it contains most of the
Python libraries a data scientist will ever need to develop and train models.
• Windows (https://docs.anaconda.com/anaconda/install/windows/)
• macOS (https://docs.anaconda.com/anaconda/install/mac-os/)
• Linux (https://docs.anaconda.com/anaconda/install/linux/)
Make sure you choose Python 3.X version since Python 2 was
discontinued in January 2020.
"What is an environment?"
It is pretty much a replication of Python itself and some (or all) of its libraries, so,
effectively, you’ll end up with multiple Python installations on your computer.
"Why can’t I just use one single Python installation for everything?"
It is beyond the scope of this guide to debate these issues, but take my word for it
(or Google it!)—you’ll benefit a great deal if you pick up the habit of creating a
different environment for every project you start working on.
First, you need to choose a name for your environment :-) Let’s call ours
Environment | 11
pytorchbook (or anything else you find easy to remember). Then, you need to open
a terminal (in Ubuntu) or Anaconda Prompt (in Windows or macOS) and type the
following command:
Did it finish creating the environment? Good! It is time to activate it, meaning,
making that Python installation the one to be used now. In the same terminal (or
Anaconda prompt), just type:
Your prompt should look like this (if you’re using Linux):
(pytorchbook)$
(pytorchbook)C:\>
Done! You are using a brand new Conda environment now. You’ll need to activate
it every time you open a new terminal, or, if you’re a Windows or macOS user, you
can open the corresponding Anaconda prompt (it will show up as Anaconda
Prompt (pytorchbook), in our case), which will have it activated from the start.
12 | Setup Guide
3. PyTorch
PyTorch is the coolest deep learning framework, just in case you skipped the
introduction.
It is "an open source machine learning framework that accelerates the path from
research prototyping to production deployment."[22] Sounds good, right? Well, I
probably don’t have to convince you at this point :-)
It is time to install the star of the show :-) We can go straight to the Start Locally
(https://pytorch.org/get-started/locally/) section of PyTorch’s website, and it will
automatically select the options that best suit your local environment, and it will
show you the command to run.
Environment | 13
Using GPU / CUDA
CUDA "is a parallel computing platform and programming model developed by NVIDIA
for general computing on graphical processing units (GPUs)."[23]
If you have a GPU in your computer (likely a GeForce graphics card), you can
leverage its power to train deep learning models much faster than using a CPU. In
this case, you should choose a PyTorch installation that includes CUDA support.
This is not enough, though: If you haven’t done so yet, you need to install up-to-
date drivers, the CUDA Toolkit, and the CUDA Deep Neural Network library
(cuDNN). Unfortunately, more detailed installation instructions for CUDA are
outside the scope of this book.
The advantage of using a GPU is that it allows you to iterate faster and experiment
with more-complex models and a more extensive range of hyper-parameters.
In my case, I use Linux, and I have a GPU with CUDA version 10.2 installed. So I
would run the following command in the terminal (after activating the
environment):
Using CPU
If you do not have a GPU, you should choose None for CUDA.
Sure! The code and the examples in this book were designed to allow all readers to
follow them promptly. Some examples may demand a bit more computing power,
but we are talking about a couple of minutes in a CPU, not hours. If you do not have
a GPU, don’t worry! Besides, you can always use Google Colab if you need to use a
GPU for a while!
If I had a Windows computer, and no GPU, I would have to run the following
command in the Anaconda prompt (pytorchbook):
14 | Setup Guide
(pytorchbook) C:\> conda install pytorch torchvision cpuonly\
-c pytorch
Installing CUDA
CUDA: Installing drivers for a GeForce graphics card, NVIDIA’s cuDNN, and
CUDA Toolkit can be challenging and is highly dependent on which model
you own and which OS you use.
For installing NVIDIA’s CUDA Deep Neural Network library (cuDNN), you
need to register at https://developer.nvidia.com/cudnn.
4. TensorBoard
TensorBoard is a powerful tool, and we can use it even if we are developing models
in PyTorch. Luckily, you don’t need to install the whole TensorFlow to get it; you
can easily install TensorBoard alone using Conda. You just need to run this
command in your terminal or Anaconda prompt (again, after activating the
environment):
Environment | 15
5. GraphViz and Torchviz (optional)
Once you find it, you need to set or change the PATH accordingly,
adding GraphViz’s location to it.
For additional information, you can also check the "How to Install Graphviz
Software"[27] guide.
After installing GraphViz, you can install the torchviz[28] package. This package is
not part of Anaconda Distribution Repository[29] and is only available at PyPI[30], the
Python Package Index, so we need to pip install it.
Once again, open a terminal or Anaconda prompt and run this command (just once
more: after activating the environment):
16 | Setup Guide
(pytorchbook)$ pip install torchviz
To check your GraphViz / TorchViz installation, you can try the Python code below:
(pytorchbook)$ python
Output
If you get an error of any kind (the one below is pretty common), it means there is
still some kind of installation issue with GraphViz.
Output
6. Git
It is way beyond this guide’s scope to introduce you to version control and its most
popular tool: git. If you are familiar with it already, great, you can skip this section
altogether!
Otherwise, I’d recommend you to learn more about it; it will definitely be useful for
you later down the line. In the meantime, I will show you the bare minimum so you
can use git to clone the repository containing all code used in this book and get
your own, local copy of it to modify and experiment with as you please.
Environment | 17
First, you need to install it. So, head to its downloads page (https://git-scm.com/
downloads) and follow instructions for your OS. Once the installation is complete,
please open a new terminal or Anaconda prompt (it’s OK to close the previous
one). In the new terminal or Anaconda prompt, you should be able to run git
commands.
The command above will create a PyTorchStepByStep folder that contains a local
copy of everything available on GitHub’s repository.
Although they may seem equivalent at first sight, you should prefer conda
install over pip install when working with Anaconda and its virtual
environments.
To learn more about the differences between conda and pip, read
"Understanding Conda and Pip."[32]
As a rule, first try to conda install a given package and, only if it does not
exist there, fall back to pip install, as we did with torchviz.
18 | Setup Guide
7. Jupyter
After cloning the repository, navigate to the PyTorchStepByStep folder and, once
inside it, start Jupyter on your terminal or Anaconda prompt:
This will open your browser, and you will see Jupyter’s home page containing the
repository’s notebooks and code.
Moving On
Regardless of which of the three environments you chose, now you are ready to
move on and tackle the development of your first PyTorch model, step-by-step!
[17] https://pytorch.org/docs/stable/notes/randomness.html
[18] https://colab.research.google.com/notebooks/intro.ipynb
[19] https://mybinder.readthedocs.io/en/latest/
[20] https://www.anaconda.com/products/individual
[21] https://bit.ly/2MVk0CM
[22] https://pytorch.org/
[23] https://developer.nvidia.com/cuda-zone
[24] https://www.tensorflow.org/tensorboard
[25] https://www.graphviz.org/
[26] https://bit.ly/3fIwYA5
Moving On | 19
[27] https://bit.ly/30Ayct3
[28] https://github.com/szagoruyko/pytorchviz
[29] https://docs.anaconda.com/anaconda/packages/pkg-docs/
[30] https://pypi.org/
[31] https://bit.ly/37onBTt
[32] https://bit.ly/2AAh8J5
20 | Setup Guide
Chapter 0
Visualizing Gradient Descent
Spoilers
In this chapter, we will:
There is no actual PyTorch code in this chapter… it is Numpy all along because our
focus here is to understand, inside and out, how gradient descent works. PyTorch
will be introduced in the next chapter.
Jupyter Notebook
The Jupyter notebook corresponding to Chapter 0[33] is part of the official Deep
Learning with PyTorch Step-by-Step repository on GitHub. You can also run it
directly in Google Colab[34].
If you’re using a local installation, open your terminal or Anaconda prompt and
navigate to the PyTorchStepByStep folder you cloned from GitHub. Then, activate
the pytorchbook environment and run jupyter notebook:
If you’re using Jupyter’s default settings, this link should open Chapter 0’s
Spoilers | 21
notebook. If not, just click on Chapter00.ipynb on your Jupyter’s home page.
Imports
For the sake of organization, all libraries needed throughout the code used in any
given chapter are imported at its very beginning. For this chapter, we’ll need the
following imports:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
I believe the way gradient descent is usually explained lacks intuition. Students and
beginners are left with a bunch of equations and rules of thumb—this is not the way
one should learn such a fundamental topic.
If you really understand how gradient descent works, you will also understand how
the characteristics of your data and your choice of hyper-parameters (mini-batch
size and learning rate, for instance) have an impact on how well and how fast the
model training is going to be.
By really understanding, I do not mean working through the equations manually: this
does not develop intuition either. I mean visualizing the effects of different
settings; I mean telling a story to illustrate the concept. That’s how you develop
intuition.
But first, we need some data to work with. Instead of using some external dataset,
let’s
Model
The model must be simple and familiar, so you can focus on the inner workings of
gradient descent.
So, I will stick with a model as simple as it can be: a linear regression with a single
feature, x!
In this model, we use a feature (x) to try to predict the value of a label (y). There are
three elements in our model:
• parameter b, the bias (or intercept), which tells us the expected average value of
y when x is zero
• parameter w, the weight (or slope), which tells us how much y increases, on
average, if we increase x by one unit
• and that last term (why does it always have to be a Greek letter?), epsilon, which
is there to account for the inherent noise; that is, the error we cannot get rid of
We can also conceive the very same model structure in a less abstract way:
And to make it even more concrete, let’s say that the minimum wage is $1,000
(whatever the currency or time frame, this is not important). So, if you have no
experience, your salary is going to be the minimum wage (parameter b).
Model | 23
Also, let’s say that, on average, you get a $2,000 increase (parameter w) for every
year of experience you have. So, if you have two years of experience, you are
expected to earn a salary of $5,000. But your actual salary is $5,600 (lucky you!).
Since the model cannot account for those extra $600, your extra money is,
technically speaking, noise.
Data Generation
We know our model already. In order to generate synthetic data for it, we need to
pick values for its parameters. I chose b = 1 and w = 2 (as in thousands of dollars)
from the example above.
First, let’s generate our feature (x): We use Numpy's rand() method to randomly
generate 100 (N) points between 0 and 1.
Then, we plug our feature (x) and our parameters b and w into our equation to
compute our labels (y). But we need to add some Gaussian noise[36] (epsilon) as well;
otherwise, our synthetic dataset would be a perfectly straight line. We can
generate noise using Numpy's randn() method, which draws samples from a normal
distribution (of mean 0 and variance 1), and then multiply it by a factor to adjust for
the level of noise. Since I don’t want to add too much noise, I picked 0.1 as my
factor.
1 true_b = 1
2 true_w = 2
3 N = 100
4
5 # Data Generation
6 np.random.seed(42)
7 x = np.random.rand(N, 1)
8 epsilon = (.1 * np.random.randn(N, 1))
9 y = true_b + true_w * x + epsilon
Did you notice the np.random.seed(42) at line 6? This line of code is actually more
important than it looks. It guarantees that, every time we run this code, the same
random numbers will be generated.
Well, you know, random numbers are not quite random… They are really
pseudo-random, which means Numpy's number generator spits out a
sequence of numbers that looks like it’s random. But it is not, really.
The good thing about this behavior is that we can tell the generator to start a
particular sequence of pseudo-random numbers. To some extent, it works
as if we tell the generator: "please generate sequence #42," and it will spill out
a sequence of numbers. That number, 42, which works like the index of the
sequence, is called a seed. Every time we give it the same seed, it generates
the same numbers.
This means we have the best of both worlds: On the one hand, we do
generate a sequence of numbers that, for all intents and purposes, is
considered to be random; on the other hand, we have the power to
reproduce any given sequence. I cannot stress enough how convenient that
is for debugging purposes!
Moreover, you can guarantee that other people will be able to reproduce
your results. Imagine how annoying it would be to run the code in this book
and get different outputs every time, having to wonder if there is anything
wrong with it. But since I’ve set a seed, you and I can achieve the very same
outputs, even if it involved generating random data!
Next, let’s split our synthetic data into train and validation sets, shuffling the array
of indices and using the first 80 shuffled points for training.
Yes, they are random enough, and shuffling them is indeed redundant in this
example. But it is best practice to always shuffle your data points before training a
model to improve the performance of gradient descent.
Data Generation | 25
There is one exception to the "always shuffle" rule, though: time
series problems, where shuffling can lead to data leakage.
Train-Validation-Test Split
It is beyond the scope of this book to explain the reasoning behind the train-
validation-test split, but there are two points I’d like to make:
1. The split should always be the first thing you do—no preprocessing, no
transformations; nothing happens before the split. That’s why we do this
immediately after the synthetic data generation.
2. In this chapter we will use only the training set, so I did not bother to create a
test set, but I performed a split nonetheless to highlight point #1 :-)
Train-Validation Split
That’s a fair point. Later on, we will refer to the indices of the data points belonging
to either train or validation sets, instead of the points themselves. So, I thought of
using them from the very start.
We know that b = 1, w = 2, but now let’s see how close we can get to the true
values by using gradient descent and the 80 points in the training set (for training,
N = 80).
OK, given that we’ll never know the true values of the parameters, we need to set
initial values for them. How do we choose them? It turns out a random guess is as
good as any other.
For training a model, you need to randomly initialize the parameters / weights (we
have only two, b and w).
Output
[0.49671415] [-0.1382643]
Step 1
The error is the difference between the actual value (label) and the predicted
value computed for a single data point. So, for a given i-th point (from our dataset
of N points), its error is:
The error of the first point in our dataset (i = 0) can be represented like this:
The loss, on the other hand, is some sort of aggregation of errors for a set of data
points.
It seems rather obvious to compute the loss for all (N) data points, right? Well, yes
and no. Although it will surely yield a more stable path from the initial random
parameters to the parameters that minimize the loss, it will also surely be slow.
This means one needs to sacrifice (a bit of) stability for the sake of speed. This is easily
accomplished by randomly choosing (without replacement) a subset of n out of N
data points each time we compute the loss.
For a regression problem, the loss is given by the mean squared error (MSE); that
is, the average of all squared errors; that is, the average of all squared differences
between labels (y) and predictions (b + wx).
In the code below, we are using all data points of the training set to compute the
loss, so n = N = 80, meaning we are indeed performing batch gradient descent.
Step 2
2.7421577700550976
Loss Surface
We have just computed the loss (2.74) corresponding to our randomly initialized
parameters (b = 0.49 and w = -0.13). What if we did the same for ALL possible
values of b and w? Well, not all possible values, but all combinations of evenly spaced
values in a given range, like:
# Reminder:
# true_b = 1
# true_w = 2
Output
The result of the meshgrid() operation was two (101, 101) matrices representing
the values of each parameter inside a grid. What does one of these matrices look
like?
bs
Sure, we’re somewhat cheating here, since we know the true values of b and w, so
we can choose the perfect ranges for the parameters. But it is for educational
purposes only :-)
Next, we could use those values to compute the corresponding predictions, errors,
and losses. Let’s start by taking a single data point from the training set and
computing the predictions for every combination in our grid:
dummy_x = x_train[0]
dummy_yhat = bs + ws * dummy_x
dummy_yhat.shape
Output
(101, 101)
Output
Cool! We got 80 matrices of shape (101, 101), one matrix for each data point, each
matrix containing a grid of predictions.
The errors are the difference between the predictions and the labels, but we
cannot perform this operation right away—we need to work a bit on our labels (y),
so they have the proper shape for it (broadcasting is good, but not that good):
all_labels = y_train.reshape(-1, 1, 1)
all_labels.shape
Output
(80, 1, 1)
Our labels turned out to be 80 matrices of shape (1, 1)—the most boring kind of
matrix—but that is enough for broadcasting to work its magic. We can compute the
errors now:
Output
Each prediction has its own error, so we get 80 matrices of shape (101, 101), again,
The only step missing is to compute the mean squared error. First, we take the
square of all errors. Then, we average the squares over all data points. Since our
data points are in the first dimension, we use axis=0 to compute this average:
Output
(101, 101)
The result is a grid of losses, a matrix of shape (101, 101), each loss corresponding
to a different combination of the parameters b and w.
These losses are our loss surface, which can be visualized in a 3D plot, where the
vertical axis (z) represents the loss values. If we connect the combinations of b and
w that yield the same loss value, we’ll get an ellipse. Then, we can draw this ellipse
in the original b x w plane (in blue, for a loss value of 3). This is, in a nutshell, what a
contour plot does. From now on, we’ll always use the contour plot, instead of the
corresponding 3D version.
In the center of the plot, where parameters (b, w) have values close to (1, 2), the loss
is at its minimum value. This is the point we’re trying to reach using gradient
In the bottom, slightly to the left, there is the random start point, corresponding to
our randomly initialized parameters.
This is one of the nice things about tackling a simple problem like a linear
regression with a single feature: We have only two parameters, and thus we can
compute and visualize the loss surface.
Cross-Sections
Another nice thing is that we can cut a cross-section in the loss surface to check
what the loss would look like if the other parameter were held constant.
Let’s start by making b = 0.52 (the value from b_range that is closest to our initial
random value for b, 0.4967). We cut a cross-section vertically (the red dashed line)
on our loss surface (left plot), and we get the resulting plot on the right:
What does this cross-section tell us? It tells us that, if we keep b constant (at 0.52),
the loss, seen from the perspective of parameter w, can be minimized if w gets
increased (up to some value between 2 and 3).
OK, so far, so good… What about the other cross-section? Let’s cut it horizontally
now, making w = -0.16 (the value from w_range that is closest to our initial random
value for b, -0.1382). The resulting plot is on the right:
Now, if we keep w constant (at -0.16), the loss, seen from the perspective of
parameter b, can be minimized if b gets increased (up to some value close to 2).
Now I have a question for you: Which of the two dashed curves,
red (w changes, b is constant) or black (b changes, w is constant)
yields the largest changes in loss when we modify the changing
parameter?
A derivative tells you how much a given quantity changes when you slightly vary
some other quantity. In our case, how much does our MSE loss change when we
vary each of our two parameters separately?
The right-most part of the equations below is what you usually see in
implementations of gradient descent for simple linear regression. In the
intermediate step, I show you all elements that pop up from the application of the
chain rule,[37] so you know how the final expression came to be.
Just to be clear: We will always use our "regular" error computed at the beginning
of Step 2. The loss surface is surely eye candy, but, as I mentioned before, it is only
feasible to use it for educational purposes.
Step 3
-3.044811379650508 -1.8337537171510832
Visualizing Gradients
Since the gradient for b is larger (in absolute value, 3.04) than the gradient for w (in
absolute value, 1.83), the answer for the question I posed you in the "Cross-
Sections" section is: The black curve (b changes, w is constant) yields the largest
changes in loss.
"Why is that?"
To answer that, let’s first put both cross-section plots side-by-side, so we can more
easily compare them. What is the main difference between them?
The curve on the right is steeper. That’s your answer! Steeper curves have larger
gradients.
Cool! That’s the intuition… Now, let’s get a bit more geometrical. So, I am zooming
in on the regions given by the red and black squares of Figure 0.7.
From the "Cross-Sections" section, we already know that to minimize the loss, both
b and w needed to be increased. So, keeping in the spirit of using gradients, let’s
increase each parameter a little bit (always keeping the other one fixed!). By the
What effect do these increases have on the loss? Let’s check it out:
On the left plot, increasing w by 0.12 yields a loss reduction of 0.21. The
geometrically computed and roughly approximate gradient is given by the ratio
between the two values: -1.79. How does this result compare to the actual value of
the gradient (-1.83)? It is actually not bad for a crude approximation. Could it be
better? Sure, if we make the increase in w smaller and smaller (like 0.01, instead of
0.12), we’ll get better and better approximations. In the limit, as the increase
approaches zero, we’ll arrive at the precise value of the gradient. Well, that’s the
definition of a derivative!
The same reasoning goes for the plot on the right: increasing b by the same 0.12
yields a larger loss reduction of 0.35. Larger loss reduction, larger ratio, larger
gradient—and larger error, too, since the geometric approximation (-2.90) is
farther away from the actual value (-3.04).
Time for another question: Which curve, red or black, do you like best to reduce
the loss? It should be the black one, right? Well, yes, but it is not as straightforward
as we’d like it to be. We’ll dig deeper into this in the "Learning Rate" section.
Backpropagation
Now that you’ve learned about computing the gradient of the loss function w.r.t. to
The term backpropagation strictly refers only to the algorithm for computing
the gradient, not how the gradient is used; but the term is often used loosely
to refer to the entire learning algorithm, including how the gradient is used,
such as by stochastic gradient descent.
Does it seem familiar? That’s it; backpropagation is nothing more than "chained"
gradient descent. That’s, in a nutshell, how a neural network is trained: It uses
backpropagation, starting at its last layer and working its way back, to update the
weights through all the layers.
In our example, we have a single layer, even a single neuron, so there is no need to
backpropagate anything (more on that in the next chapter).
Equation 0.5 - Updating coefficients b and w using computed gradients and a learning rate
We can also interpret this a bit differently: Each parameter is going to have its
Honestly, I believe this way of thinking about the parameter update makes more
sense. First, you decide on a learning rate that specifies your step size, while the
gradients tell you the relative impact (on the loss) of taking a step for each
parameter. Then, you take a given number of steps that’s proportional to that
relative impact: more impact, more steps.
That is a topic on its own and beyond the scope of this section as
well. We’ll get back to it later on, in the second volume of the
series.
In our example, let’s start with a value of 0.1 for the learning rate (which is a
relatively high value, as far as learning rates are concerned).
Step 4
Output
[0.49671415] [-0.1382643]
[0.80119529] [0.04511107]
What’s the impact of one update on our model? Let’s visually check its predictions.
Learning Rate
Maybe you’ve seen this famous graph[38](from Stanford’s CS231n class) that shows
how a learning rate that is too high or too low affects the loss during training. Most
people will see it (or have seen it) at some point in time. This is pretty much general
knowledge, but I think it needs to be thoroughly explained and visually
demonstrated to be truly understood. So, let’s start!
I will tell you a little story (trying to build an analogy here, please bear with me!):
Imagine you are coming back from hiking in the mountains and you want to get
back home as quickly as possible. At some point in your path, you can either choose
to go ahead or to make a right turn.
The path ahead is almost flat, while the path to your right is kinda steep. The
steepness is the gradient. If you take a single step one way or the other, it will lead
to different outcomes (you’ll descend more if you take one step to the right instead
of going ahead).
But, here is the thing: You know that the path to your right is getting you home
But, you still have one choice: You can adjust the size of your step. You can choose
to take steps of any size, from tiny steps to long strides. That’s your learning rate.
OK, let’s see where this little story brought us so far. That’s how you’ll move, in a
nutshell:
You get the point, right? I hope so, because the analogy completely falls apart now.
At this point, after moving in one direction (say, the right turn we talked about),
you’d have to stop and move in the other direction (for just a fraction of a step,
because the path was almost flat, remember?). And so on and so forth. Well, I don’t
think anyone has ever returned from hiking in such an orthogonal zig-zag path!
Anyway, let’s explore further the only choice you have: the size of your step—I
mean, the learning rate.
It makes sense to start with baby steps, right? This means using a low learning rate.
Low learning rates are safe(r), as expected. If you were to take tiny steps while
returning home from your hiking, you’d be more likely to arrive there safe and
sound—but it would take a lot of time. The same holds true for training models:
Low learning rates will likely get you to (some) minimum point, eventually.
Unfortunately, time is money, especially when you’re paying for GPU time in the
cloud, so, there is an incentive to try higher learning rates.
How does this reasoning apply to our model? From computing our (geometric)
gradients, we know we need to take a given number of steps: 1.79 (parameter w)
Where does this movement lead us? As you can see in the plots below (as shown by
the new dots to the right of the original ones), in both cases, the movement took us
closer to the minimum; more so on the right because the curve is steeper.
What would have happened if we had used a high learning rate instead, say, a step
size of 0.8? As we can see in the plots below, we start to, literally, run into trouble.
Even though everything is still OK on the left plot, the right plot shows us a
completely different picture: We ended up on the other side of the curve. That is
not good… You’d be going back and forth, alternately hitting both sides of the
curve.
"Well, even so, I may still reach the minimum; why is it so bad?"
In our simple example, yes, you’d eventually reach the minimum because the curve
is nice and round.
But, in real problems, the "curve" has a really weird shape that allows for bizarre
outcomes, such as going back and forth without ever approaching the minimum.
In our analogy, you moved so fast that you fell down and hit the other side of the
valley, then kept going down like a ping-pong. Hard to believe, I know, but you
definitely don’t want that!
Wait, it may get worse than that! Let’s use a really high learning rate, say, a step
size of 1.1!
Ok, that is bad. On the right plot, not only did we end up on the other side of the
curve again, but we actually climbed up. This means our loss increased, instead of
decreased! How is that even possible? You’re moving so fast downhill that you end up
climbing it back up?! Unfortunately, the analogy cannot help us anymore. We need
to think about this particular case in a different way.
First, notice that everything is fine on the left plot. The enormous learning rate did
not cause any issues, because the left curve is less steep than the one on the right.
In other words, the curve on the left can take higher learning rates than the curve
on the right.
"Bad" Feature
How do we achieve equally steep curves? I’m on it! First, let’s take a look at a slightly
modified example, which I am calling the "bad" dataset:
• I multiplied our feature (x) by 10, so it is in the range [0, 10] now, and renamed
it bad_x.
• But since I do not want the labels (y) to change, I divided the original true_w
parameter by 10 and renamed it bad_w—this way, both bad_w * bad_x and w *
x yield the same results.
# Data Generation
np.random.seed(42)
# We divide w by 10
bad_w = true_w / 10
# And multiply x by 10
bad_x = np.random.rand(N, 1) * 10
Then, I performed the same split as before for both original and bad datasets and
plotted the training sets side-by-side, as you can see below:
Does this simple scaling have any meaningful impact on our gradient descent?
Well, if it hadn’t, I wouldn’t be asking it, right? Let’s compute a new loss surface and
compare to the one we had before.
Figure 0.14 - Loss surface—before and after scaling feature x (Obs.: left plot looks a bit different
than Figure 0.6 because it is centered at the "after" minimum)
Look at the contour values of Figure 0.14: The dark blue line was 3.0, and now it is
50.0! For the same range of parameter values, loss values are much higher.
Let’s look at the cross-sections before and after we multiplied feature x by 10.
What happened here? The red curve got much steeper (larger gradient), and thus
we must use a lower learning rate to safely descend along it.
How can we fix it? Well, we ruined it by scaling it 10x larger. Perhaps we can make
it better if we scale it in a different way.
Different how? There is this beautiful thing called the StandardScaler, which
transforms a feature in such a way that it ends up with zero mean and unit
standard deviation.
How does it achieve that? First, it computes the mean and the standard deviation of
a given feature (x) using the training set (N points):
If we were to recompute the mean and the standard deviation of the scaled
feature, we would get 0 and 1, respectively. This pre-processing step is commonly
referred to as normalization, although, technically, it should always be referred to as
standardization.
After using the training set only to fit the StandardScaler, you
should use its transform() method to apply the pre-processing
step to all datasets: training, validation, and test.
Let’s start with the unit standard deviation; that is, scaling the feature
values such that its standard deviation equals one. This is one of the most
important pre-processing steps, not only for the sake of improving the
performance of gradient descent, but for other techniques such as principal
component analysis (PCA) as well. The goal is to have all numerical features
in a similar scale, so the results are not affected by the original range of each
feature.
Think of two common features in a model: age and salary. While age usually
varies between 0 and 110, salaries can go from the low hundreds (say, 500)
to several thousand (say, 9,000). If we compute the corresponding standard
deviations, we may get values like 25 and 2,000, respectively. Thus, we need
to standardize both features to have them on equal footing.
And then there is the zero mean; that is, centering the feature at zero.
Deeper neural networks may suffer from a very serious condition called
vanishing gradients. Since the gradients are used to update the parameters,
smaller and smaller (that is, vanishing) gradients mean smaller and smaller
updates, up to the point of a standstill: The network simply stops learning.
One way to help the network to fight this condition is to center its inputs,
the features, at zero. We’ll get back to this later on, in the second volume of
the series, while discussing activation functions.
Notice that we are not regenerating the data—we are using the original feature x
as input for the StandardScaler and transforming it into a scaled x. The labels (y)
Once again, the only difference between the plots is the scale of feature x. Its
range was originally [0, 1], then we made it [0, 10], and now the StandardScaler
made it [-1.5, 1.5].
OK, time to check the loss surface: To illustrate the differences, I am plotting the
three of them side-by-side: original, "bad", and scaled. It looks like Figure 0.17.
Figure 0.17 - Loss surfaces for different scales for feature x (Obs.: left and center plots look a bit
different than Figure 0.14 because they are centered at the "scaled" minimum)
In practice, this is the best surface one could hope for: The cross-sections are going
to be similarly steep, and a good learning rate for one of them is also good for the
Sure, in the real world, you’ll never get a pretty bowl like that. But our conclusion
still holds:
Definition of Epoch
• For mini-batch (of size n), one epoch has N/n updates, since a
mini-batch of n data points is used to perform an update.
Repeating this process over and over for many epochs is, in a nutshell, training a
model.
In the next chapter, we’ll put all these steps together and run it for 1,000 epochs, so
we’ll get to the parameters depicted in the figure above, b = 1.0235 and w = 1.9690.
No particular reason, but this is a fairly simple model, and we can afford to run it
over a large number of epochs. In more-complex models, though, a couple of dozen
epochs may be enough. We’ll discuss this a bit more in Chapter 1.
In Step 3, we have seen the loss surface and both random start and minimum
points.
The answers to all these questions depend on many things, like the learning rate, the
shape of the loss surface, and the number of points we use to compute the loss.
To illustrate the differences, I’ve generated paths over 100 epochs using either 80
data points (batch), 16 data points (mini-batch), or a single data point (stochastic) for
Figure 0.19 - The paths of gradient descent (Obs.: random start is different from Figure 0.4)
You can see that the resulting parameters at the end of Epoch 1 differ greatly from
one another. This is a direct consequence of the number of updates happening
during one epoch, according to the batch size. In our example, for 100 epochs:
So, for both center and right plots, the path between random start and Epoch 1
contains multiple updates, which are not depicted in the plot (otherwise it would
be very cluttered)—that’s why the line connecting two epochs is dashed, instead of
solid. In reality, there would be zig-zagging lines connecting every two epochs.
• The stochastic gradient descent path is somewhat weird: It gets quite close to
the minimum point at the end of Epoch 1 already, but then it seems to fail to
actually reach it. But this is expected since it uses a single data point for each
update; it will never stabilize, forever hovering in the neighborhood of the
minimum point.
Clearly, there is a trade-off here: Either we have a stable and smooth trajectory, or
we move faster toward the minimum.
In time, with practice, you’ll observe the behaviors described here in your own
models. Make sure to try plenty of different combinations: mini-batch sizes,
learning rates, etc. This way, not only will your models learn, but so will you :-)
• visualizing an example of a loss surface and using its cross-sections to get the
loss curves for individual parameters
• learning that a gradient is a partial derivative and it represents how much the
loss changes if one parameter changes a little bit
• computing the gradients for our model’s parameters using equations, code,
and geometry
• learning that loss curves for all parameters should be, ideally, similarly steep
Recap | 57
• visualizing the effects of using a feature with a larger range, making the loss
curve for the corresponding parameter much steeper
• learning that preprocessing steps like scaling should be applied after the train-
validation split to prevent leakage
• figuring out that performing all steps (forward pass, loss, gradients, and
parameter update) makes one epoch
• visualizing the path of gradient descent over many epochs and realizing it is
heavily dependent on the kind of gradient descent used: batch, mini-batch, or
stochastic
• learning that there is a trade-off between the stable and smooth path of batch
gradient descent and the fast and somewhat chaotic path of stochastic gradient
descent, making the use of mini-batch gradient descent a good compromise
between the other two
You are now ready to put it all together and actually train a model using PyTorch!
[33] https://github.com/dvgodoy/PyTorchStepByStep/blob/master/Chapter00.ipynb
[34] https://colab.research.google.com/github/dvgodoy/PyTorchStepByStep/blob/master/Chapter00.ipynb
[35] https://en.wikipedia.org/wiki/Gradient_descent
[36] https://en.wikipedia.org/wiki/Gaussian_noise
[37] https://en.wikipedia.org/wiki/Chain_rule
[38] https://bit.ly/2BxCxTO
Jupyter Notebook
The Jupyter notebook corresponding to Chapter 1[39] is part of the official Deep
Learning with PyTorch Step-by-Step repository on GitHub. You can also run it
directly in Google Colab[40].
If you’re using a local installation, open your terminal or Anaconda prompt and
navigate to the PyTorchStepByStep folder you cloned from GitHub. Then, activate
the pytorchbook environment and run jupyter notebook:
Spoilers | 59
If you’re using Jupyter’s default settings, this link should open Chapter 1’s
notebook. If not, just click on Chapter01.ipynb on your Jupyter’s home page.
Imports
For the sake of organization, all libraries needed throughout the code used in any
given chapter are imported at its very beginning. For this chapter, we’ll need the
following imports:
import numpy as np
from sklearn.linear_model import LinearRegression
import torch
import torch.optim as optim
import torch.nn as nn
from torchviz import make_dot
For this reason, in this first example, I will stick with a simple and familiar problem:
a linear regression with a single feature x! It doesn’t get much simpler than that!
It is also possible to think of it as the simplest neural network possible: one input,
one output, and no activation function (that is, linear).
If you have read Chapter 0, you can either choose to skip to the
"Linear Regression in Numpy" section or to use the next two
sections as a review.
Data Generation
Let’s start generating some synthetic data. We start with a vector of 100 (N) points
for our feature (x) and create our labels (y) using b = 1, w = 2, and some Gaussian
noise[41] (epsilon).
1 true_b = 1
2 true_w = 2
3 N = 100
4
5 # Data Generation
6 np.random.seed(42)
7 x = np.random.rand(N, 1)
8 epsilon = (.1 * np.random.randn(N, 1))
9 y = true_b + true_w * x + epsilon
Next, let’s split our synthetic data into train and validation sets, shuffling the array
of indices and using the first 80 shuffled points for training.
Data Generation | 61
Notebook Cell 1.1 - Splitting synthetic dataset into train and validation sets for linear regression
We know that b = 1, w = 2, but now let’s see how close we can get to the true
values by using gradient descent and the 80 points in the training set (for training,
N = 80).
Gradient Descent
I’ll cover the five basic steps you’ll need to go through to use gradient descent and
the corresponding Numpy code.
For training a model, you need to randomly initialize the parameters / weights (we
have only two, b and w).
Step 0
print(b, w)
Output
[0.49671415] [-0.1382643]
This is the forward pass; it simply computes the model’s predictions using the current
values of the parameters / weights. At the very beginning, we will be producing really
bad predictions, as we started with random values from Step 0.
Step 1
For a regression problem, the loss is given by the mean squared error (MSE); that
is, the average of all squared errors; that is, the average of all squared differences
between labels (y) and predictions (b + wx).
In the code below, we are using all data points of the training set to compute the
loss, so n = N = 80, meaning we are performing batch gradient descent.
Gradient Descent | 63
Step 2
print(loss)
Output
2.7421577700550976
A derivative tells you how much a given quantity changes when you slightly vary
some other quantity. In our case, how much does our MSE loss change when we
vary each of our two parameters separately?
Output
-3.044811379650508 -1.8337537171510832
In the final step, we use the gradients to update the parameters. Since we are
trying to minimize our losses, we reverse the sign of the gradient for the update.
That is a topic on its own and beyond the scope of this section as
well. We’ll get back to it in the second volume of the series.
In our example, let’s start with a value of 0.1 for the learning rate (which is a
relatively high value, as far as learning rates are concerned!).
Step 4
print(b, w)
Gradient Descent | 65
Output
[0.49671415] [-0.1382643]
[0.80119529] [0.04511107]
Now we use the updated parameters to go back to Step 1 and restart the process.
Definition of Epoch
• For mini-batch (of size n), one epoch has N/n updates, since a
mini-batch of n data points is used to perform an update.
Repeating this process over and over for many epochs is, in a nutshell, training a
model.
For training a model, there is a first initialization step (line numbers refer to
Notebook Cell 1.2 code below):
• Initialization of hyper-parameters (in our case, only learning rate and number of
epochs)—lines 9 and 11
For each epoch, there are four training steps (line numbers refer to Notebook Cell
1.2 code below):
For now, we will be using batch gradient descent only, meaning, we’ll use all data
points for each one of the four steps above. It also means that going once through
all of the steps is already one epoch. Then, if we want to train our model over 1,000
epochs, we just need to add a single loop.
② Initialization of hyper-parameters
Output
Good question: We don’t need to run it for 1,000 epochs. There are ways of
stopping it earlier, once the progress is considered negligible (for instance, if the
loss was barely reduced). These are called, most appropriately, early stopping
methods. For now, since our model is a very simple one, we can afford to train it for
1,000 epochs.
Just to make sure we haven’t made any mistakes in our code, we can use Scikit-
Learn’s linear regression to fit the model and compare the coefficients.
Output
PyTorch
First, we need to cover a few basic concepts that may throw you off-balance if you
don’t grasp them well enough before going full-force on modeling.
Tensor
In Numpy, you may have an array that has three dimensions, right? That is,
technically speaking, a tensor.
But, to keep things simple, it is commonplace to call vectors and matrices tensors as
well—so, from now on, everything is either a scalar or a tensor.
You can create tensors in PyTorch pretty much the same way you create arrays in
Numpy. Using tensor() you can create either a scalar or a tensor.
scalar = torch.tensor(3.14159)
vector = torch.tensor([1, 2, 3])
matrix = torch.ones((2, 3), dtype=torch.float)
tensor = torch.randn((2, 3, 4), dtype=torch.float)
print(scalar)
print(vector)
print(matrix)
print(tensor)
PyTorch | 71
Output
tensor(3.1416)
tensor([1, 2, 3])
tensor([[1., 1., 1.],
[1., 1., 1.]])
tensor([[[-1.0658, -0.5675, -1.2903, -0.1136],
[ 1.0344, 2.1910, 0.7926, -0.7065],
[ 0.4552, -0.6728, 1.8786, -0.3248]],
You can get the shape of a tensor using its size() method or its shape attribute.
print(tensor.size(), tensor.shape)
Output
All tensors have shapes, but scalars have "empty" shapes, since they are
dimensionless (or zero dimensions, if you prefer):
print(scalar.size(), scalar.shape)
Output
torch.Size([]) torch.Size([])
You can also reshape a tensor using its view() (preferred) or reshape() methods.
Output
If you want to copy all data, that is, duplicate the data in memory, you may use
either its new_tensor() or clone() methods.
PyTorch | 73
Output
Output
Output
It removes the tensor from the computation graph, which probably raises more
questions than it answers, right? Don’t worry, we’ll get back to it later in this
chapter.
It is time to start converting our Numpy code to PyTorch: We’ll start with the
training data; that is, our x_train and y_train arrays.
x_train_tensor = torch.as_tensor(x_train)
x_train.dtype, x_train_tensor.dtype
Output
(dtype('float64'), torch.float64)
You can also easily cast it to a different type, like a lower-precision (32-bit) float,
which will occupy less space in memory, using float():
float_tensor = x_train_tensor.float()
float_tensor.dtype
Output
torch.float32
PyTorch | 75
dummy_array = np.array([1, 2, 3])
dummy_tensor = torch.as_tensor(dummy_array)
# Modifies the numpy array
dummy_array[1] = 0
# Tensor gets modified too...
dummy_tensor
Output
tensor([1, 0, 3])
Well, you could … just keep in mind that torch.tensor() always makes a copy of
the data, instead of sharing the underlying data with the Numpy array.
You can also perform the opposite operation, namely, transforming a PyTorch
tensor back to a Numpy array. That’s what numpy() is good for:
dummy_tensor.numpy()
Output
array([1, 0, 3])
So far, we have only created CPU tensors. What does it mean? It means the data in
the tensor is stored in the computer’s main memory and any operations performed
on it are going to be handled by its CPU (the central processing unit; for instance,
an Intel® Core™ i7 Processor). So, although the data is, technically speaking, in the
memory, we’re still calling this kind of tensor a CPU tensor.
Yes, there is also a GPU tensor. A GPU (which stands for graphics processing unit)
is the processor of a graphics card. These tensors store their data in the graphics
card’s memory, and operations on top of them are performed by the GPU. For
If you have a graphics card from NVIDIA, you can use the power of its GPU to
speed up model training. PyTorch supports the use of these GPUs for model
training using CUDA (Compute Unified Device Architecture), which needs to be
previously installed and configured (please refer to the "Setup Guide" for more
information on this).
If you do have a GPU (and you managed to install CUDA), we’re getting to the part
where you get to use it with PyTorch. But, even if you do not have a GPU, you
should stick around in this section anyway. Why? First, you can use a free GPU
from Google Colab, and, second, you should always make your code GPU-ready;
that is, it should automatically run in a GPU, if one is available.
PyTorch has your back once more—you can use cuda.is_available() to find out if
you have a GPU at your disposal and set your device accordingly. So, it is good
practice to figure this out at the top of your code:
So, if you don’t have a GPU, your device is called cpu. If you do have a GPU, your
device is called cuda or cuda:0. Why isn’t it called gpu, then? Don’t ask me… The
important thing is, your code will be able to always use the appropriate device.
"Why cuda:0? Are there others, like cuda:1, cuda:2 and so on?"
There may be if you are lucky enough to have multiple GPUs in your computer. Since
this is usually not the case, I am assuming you have either one GPU or none. So,
when we tell PyTorch to send a tensor to cuda without any numbering, it will send it
to the current CUDA device, which is device #0 by default.
If you are using someone else’s computer and you don’t know how many GPUs it
has, or which model they are, you can figure it out using cuda.device_count() and
cuda.get_device_name():
PyTorch | 77
n_cudas = torch.cuda.device_count()
for i in range(n_cudas):
print(torch.cuda.get_device_name(i))
Output
In my case, I have only one GPU, and it is a GeForce GTX 1060 model with 6 GB RAM.
There is only one thing left to do: turn our tensor into a GPU tensor. That’s what
to() is good for. It sends a tensor to the specified device.
gpu_tensor = torch.as_tensor(x_train).to(device)
gpu_tensor[0]
Output - GPU
Output - CPU
tensor([0.7713], dtype=torch.float64)
In this case, there is no device information in the printed output because PyTorch
simply assumes the default (cpu).
Yes, you should, because there is no cost in doing so. If you have only a CPU, your
tensor is already a CPU tensor, so nothing will happen. But if you share your code
with others on GitHub, whoever has a GPU will benefit from it.
Let’s put it all together now and make our training data ready for PyTorch.
So, we defined a device, converted both Numpy arrays into PyTorch tensors, cast
them to floats, and sent them to the device. Let’s take a look at the types:
Output - GPU
Output - CPU
If you compare the types of both variables, you’ll get what you’d expect:
numpy.ndarray for the first one and torch.Tensor for the second one.
But where does the x_train_tensor "live"? Is it a CPU or a GPU tensor? You can’t
say, but if you use PyTorch’s type(), it will reveal its location
—torch.cuda.FloatTensor—a GPU tensor in this case (assuming the output using a
GPU, of course).
There is one more thing to be aware of when using GPU tensors. Remember
numpy()? What if we want to turn a GPU tensor back into a Numpy array? We’ll get
an error:
PyTorch | 79
back_to_numpy = x_train_tensor.numpy()
Output
Unfortunately, Numpy cannot handle GPU tensors! You need to make them CPU
tensors first using cpu():
back_to_numpy = x_train_tensor.cpu().numpy()
So, to avoid this error, use first cpu() and then numpy(), even if you are using a CPU.
It follows the same principle of to(device): You can share your code with others
who may be using a GPU.
Creating Parameters
What distinguishes a tensor used for training data (or validation, or test)—like the
ones we’ve just created—from a tensor used as a (trainable) parameter / weight?
The latter requires the computation of its gradients, so we can update their values
(the parameters’ values, that is). That’s what the requires_grad=True argument is
good for. It tells PyTorch to compute gradients for us.
You may be tempted to create a simple tensor for a parameter and, later on, send it
to your chosen device, as we did with our data, right? Not so fast…
The first chunk of code below creates two tensors for our parameters, including
gradients and all. But they are CPU tensors, by default.
# FIRST
# Initializes parameters "b" and "w" randomly, ALMOST as we
# did in Numpy, since we want to apply gradient descent on
# these parameters we need to set REQUIRES_GRAD = TRUE
torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, dtype=torch.float)
w = torch.randn(1, requires_grad=True, dtype=torch.float)
print(b, w)
Output
tensor([0.3367], requires_grad=True)
tensor([0.1288], requires_grad=True)
"If I use the same seed in PyTorch as I used in Numpy (or, to put it
differently, if I use 42 everywhere), will I get the same numbers?"
Unfortunately, NO.
You’ll get the same numbers for the same seed in the same package. PyTorch
generates a number sequence that is different from the one generated by Numpy,
even if you use the same seed in both.
PyTorch | 81
I am assuming you’d like to use your GPU (or the one from Google Colab), right? So
we need to send those tensors to the device. We can try the naive approach, the
one that worked well for sending the training data to the device. That’s our second
(and failed) attempt:
# SECOND
# But what if we want to run it on a GPU? We could just
# send them to device, right?
torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, dtype=torch.float).to(device)
w = torch.randn(1, requires_grad=True, dtype=torch.float).to(device)
print(b, w)
# Sorry, but NO! The to(device) "shadows" the gradient...
Output
In the third chunk, we first send our tensors to the device and then use the
requires_grad_() method to set its requires_grad attribute to True in place.
Output
Yes, we can do better: We can assign tensors to a device at the moment of their
creation.
# FINAL
# We can specify the device at the moment of creation
# RECOMMENDED!
PyTorch | 83
Output
If you do not have a GPU, your outputs are going to be slightly different:
Output - CPU
tensor([0.3367], requires_grad=True)
tensor([0.1288], requires_grad=True)
Similar to what happens when using the same seed in different packages (Numpy
and PyTorch), we also get different sequences of random numbers if PyTorch
generates them in different devices (CPU and GPU).
Now that we know how to create tensors that require gradients, let’s see how
PyTorch handles them. That’s the role of the…
Autograd
Autograd is PyTorch’s automatic differentiation package. Thanks to it, we don’t need
to worry about partial derivatives, chain rule, or anything like it.
backward
So, how do we tell PyTorch to do its thing and compute all gradients? That’s the
role of the backward() method. It will compute gradients for all (gradient-requiring)
tensors involved in the computation of a given variable.
Do you remember the starting point for computing the gradients? It was the loss,
as we computed its partial derivatives w.r.t. our parameters. Hence, we need to
invoke the backward() method from the corresponding Python variable:
Which tensors are going to be handled by the backward() method applied to the
loss?
• b
• w
• yhat
• error
Do you see the pattern here? If a tensor in the list is used to compute another
tensor, the latter will also be included in the list. Tracking these dependencies is
exactly what the dynamic computation graph is doing, as we’ll see shortly.
Autograd | 85
print(error.requires_grad, yhat.requires_grad, \
b.requires_grad, w.requires_grad)
print(y_train_tensor.requires_grad, x_train_tensor.requires_grad)
Output
grad
What about the actual values of the gradients? We can inspect them by looking at
the grad attribute of a tensor.
print(b.grad, w.grad)
Output
tensor([-3.3881], device='cuda:0')
tensor([-1.9439], device='cuda:0')
If you check the method’s documentation, it clearly states that gradients are
accumulated. What does that mean? It means that, if we run Notebook Cell 1.5's
code (Steps 1 to 3) twice and check the grad attribute afterward, we will end up
with:
Output
tensor([-6.7762], device='cuda:0')
tensor([-3.8878], device='cuda:0')
If you do not have a GPU, your outputs are going to be slightly different:
Output
tensor([-3.1125]) tensor([-1.8156])
tensor([-6.2250]) tensor([-3.6313])
These gradients' values are exactly twice as much as they were before, as
expected!
OK, but that is actually a problem: We need to use the gradients corresponding to
the current loss to perform the parameter update. We should NOT use
accumulated gradients.
During the training of large models, the necessary number of data points in a mini-
batch may be too large to fit in memory (of the graphics card). How can one solve
this, other than buying more-expensive hardware?
One can split a mini-batch into "sub-mini-batches" (horrible name, I know, don’t
quote me on this!), compute the gradients for those "subs" and accumulate them to
achieve the same result as computing the gradients on the full mini-batch.
zero_
Every time we use the gradients to update the parameters, we need to zero the
gradients afterward. And that’s what zero_() is good for.
Autograd | 87
Output
(tensor([0.], device='cuda:0'),
tensor([0.], device='cuda:0'))
What does the underscore (_) at the end of the method’s name
mean? Do you remember? If not, go back to the previous section
and find out.
So, let’s ditch the manual computation of gradients and use both the backward()
and zero_() methods instead.
That’s it? Well, pretty much … but there is always a catch, and this time it has to do
with the update of the parameters.
Updating Parameters
Unfortunately, our Numpy's code for updating parameters is not enough. Why
not?! Let’s try it out, simply copying and pasting it (this is the first attempt), changing
it slightly (second attempt), and then asking PyTorch to back off (yes, it is PyTorch’s
fault!).
Autograd | 89
56 # need to tell it to let it go...
57 b.grad.zero_() ④
58 w.grad.zero_() ④
59
60 print(b, w)
In the first attempt, if we use the same update structure as in our Numpy code, we’ll
get the weird error below, but we can get a hint of what’s going on by looking at the
tensor itself. Once again, we "lost" the gradient while reassigning the update
results to our parameters. Thus, the grad attribute turns out to be None, and it
raises the error.
Why?! It turns out to be a case of "too much of a good thing." The culprit is
PyTorch’s ability to build a dynamic computation graph from every Python
operation that involves any gradient-computing tensor or its dependencies.
We’ll go deeper into the inner workings of the dynamic computation graph in the
next section.
So, how do we tell PyTorch to "back off" and let us update our parameters without
messing up its fancy dynamic computation graph? That’s what torch.no_grad() is
good for. It allows us to perform regular Python operations on tensors without
affecting PyTorch’s computation graph.
Finally, we managed to successfully run our model and get the resulting
parameters. Surely enough, they match the ones we got in our Numpy-only
implementation.
Remember:
It was true for going into Mordor, and it is also true for updating parameters.
It turns out, no_grad() has another use case other than allowing us to update
parameters; we’ll get back to it in Chapter 2 when dealing with a model’s
evaluation.
Morpheus
How great was The Matrix? Right? Right? But, jokes aside, I want you to see the
graph for yourself too!
So, let’s stick with the bare minimum: two (gradient-computing) tensors for our
parameters, predictions, errors, and loss—these are Steps 0, 1, and 2.
Figure 1.5 - Computation graph generated for yhat; Obs.: the corresponding variable names were
inserted manually
• green box ((80, 1)): the tensor used as the starting point for the computation
of gradients (assuming the backward() method is called from the variable used
to visualize the graph)—they are computed from the bottom-up in a graph
Now, take a closer look at the gray box at the bottom of the graph: Two arrows are
pointing to it since it is adding up two variables, b and w*x. Seems obvious, right?
Then, look at the other gray box (MulBackward0) of the same graph: It is performing
a multiplication operation, namely, w*x. But there is only one arrow pointing to it!
The arrow comes from the blue box that corresponds to our parameter w.
So, even though there are more tensors involved in the operations performed by
the computation graph, it only shows gradient-computing tensors and their
dependencies.
What would happen to the computation graph if we set requires_grad to False for
our parameter b?
make_dot(yhat)
The best thing about the dynamic computation graph is that you can make it as
complex as you want it. You can even use control flow statements (e.g., if
statements) to control the flow of the gradients.
Figure 1.7 shows an example of this. And yes, I do know that the computation itself
is complete nonsense!
b = torch.randn(1, requires_grad=True, \
dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, \
dtype=torch.float, device=device)
yhat = b + w * x_train_tensor
error = yhat - y_train_tensor
loss = (error ** 2).mean()
# this makes no sense!!
if loss > 0:
yhat2 = w * x_train_tensor
error2 = yhat2 - y_train_tensor
# neither does this!!
loss += error2.mean()
make_dot(loss)
Even though the computation is nonsensical, you can clearly see the effect of
adding a control flow statement like if loss > 0: It branches the computation
graph into two parts. The right branch performs the computation inside the if
statement, which gets added to the result of the left branch in the end. Cool, right?
Even though we are not building more-complex models like that in this book, this
small example illustrates very well PyTorch’s capabilities and how easily they can
be implemented in code.
Optimizer
So far, we’ve been manually updating the parameters using the computed
gradients. That’s probably fine for two parameters, but what if we had a whole lot
of them? We need to use one of PyTorch’s optimizers, like SGD, RMSprop, or
Adam.
Optimizer | 95
There are many optimizers: SGD is the most basic of them, and
Adam is one of the most popular.
step / zero_grad
An optimizer takes the parameters we want to update, the learning rate we want
to use (and possibly many other hyper-parameters as well!), and performs the
updates through its step() method.
Besides, we also don’t need to zero the gradients one by one anymore. We just
invoke the optimizer’s zero_grad() method, and that’s it!
In the code below, we create a stochastic gradient descent (SGD) optimizer to update
our parameters b and w.
Optimizer | 97
41 optimizer.zero_grad() ③
42
43 print(b, w)
① Defining an optimizer
Let’s inspect our two parameters just to make sure everything is still working fine:
Output
Loss
We now tackle the loss computation. As expected, PyTorch has us covered once
again. There are many loss functions to choose from, depending on the task at
hand. Since ours is a regression, we are using the mean squared error (MSE) as loss,
and thus we need PyTorch’s nn.MSELoss():
Output
MSELoss()
Notice that nn.MSELoss() is NOT the loss function itself: We do not pass
predictions and labels to it! Instead, as you can see, it returns another function,
which we called loss_fn: That is the actual loss function. So, we can pass a
prediction and a label to it and get the corresponding loss value:
Output
tensor(1.1700)
We then use the created loss function in the code below, at line 29, to compute the
loss, given our predictions and our labels:
Loss | 99
Notebook Cell 1.8 - PyTorch’s loss in action: no more manual loss computation!
Output
loss
Output
What if we wanted to have it as a Numpy array? I guess we could just use numpy()
again, right? (And cpu() as well, since our loss is in the cuda device.)
loss.cpu().numpy()
Output
What happened here? Unlike our data tensors, the loss tensor is actually computing
gradients; to use numpy(), we need to detach() the tensor from the computation
graph first:
loss.detach().cpu().numpy()
Loss | 101
Output
array(0.00804466, dtype=float32)
This seems like a lot of work; there must be an easier way! And there is one,
indeed: We can use item(), for tensors with a single element, or tolist()
otherwise (it still returns a scalar if there is only one element, though).
print(loss.item(), loss.tolist())
Output
0.008044655434787273 0.008044655434787273
At this point, there’s only one piece of code left to change: the predictions. It is
then time to introduce PyTorch’s way of implementing a…
Model
In PyTorch, a model is represented by a regular Python class that inherits from the
Module class.
So, assuming you’re already comfortable with OOP, let’s dive into developing a
model in PyTorch.
• __init__(self): It defines the parts that make up the model—in our case, two
parameters, b and w.
The reason is, the call to the whole model involves extra steps,
namely, handling forward and backward hooks. If you don’t use
hooks (and we don’t use any right now), both calls are equivalent.
Model | 103
Let’s build a proper (yet simple) model for our regression task. It should look like
this:
Notebook Cell 1.9 - Building our "Manual" model, creating parameter by parameter!
1 class ManualLinearRegression(nn.Module):
2 def __init__(self):
3 super().__init__()
4 # To make "b" and "w" real parameters of the model,
5 # we need to wrap them with nn.Parameter
6 self.b = nn.Parameter(torch.randn(1,
7 requires_grad=True,
8 dtype=torch.float))
9 self.w = nn.Parameter(torch.randn(1,
10 requires_grad=True,
11 dtype=torch.float))
12
13 def forward(self, x):
14 # Computes the outputs / predictions
15 return self.b + self.w * x
Parameters
In the __init__() method, we define our two parameters, b and w, using the
Parameter class, to tell PyTorch that these tensors, which are attributes of the
ManualLinearRegression class, should be considered parameters of the model the
class represents.
Why should we care about that? By doing so, we can use our model’s parameters()
method to retrieve an iterator over the model’s parameters, including parameters
of nested models. Then we can use it to feed our optimizer (instead of building a list
of parameters ourselves!).
torch.manual_seed(42)
# Creates a "dummy" instance of our ManualLinearRegression model
dummy = ManualLinearRegression()
list(dummy.parameters())
[Parameter containing:
tensor([0.3367], requires_grad=True), Parameter containing:
tensor([0.1288], requires_grad=True)]
state_dict
Moreover, we can get the current values of all parameters using our model’s
state_dict() method.
dummy.state_dict()
Output
The state_dict() of a given model is simply a Python dictionary that maps each
attribute / parameter to its corresponding tensor. But only learnable parameters
are included, as its purpose is to keep track of parameters that are going to be
updated by the optimizer.
By the way, the optimizer itself has a state_dict() too, which contains its internal
state, as well as other hyper-parameters. Let’s take a quick look at it:
optimizer.state_dict()
Output
{'state': {},
'param_groups': [{'lr': 0.1,
'momentum': 0,
'dampening': 0,
'weight_decay': 0,
'nesterov': False,
'params': [140535747664704, 140535747688560]}]}
Model | 105
"What do we need this for?"
It turns out, state dictionaries can also be used for checkpointing a model, as we will
see in Chapter 2.
Device
If we were to send our dummy model to a device, it would look like this:
torch.manual_seed(42)
# Creates a "dummy" instance of our ManualLinearRegression model
# and sends it to the device
dummy = ManualLinearRegression().to(device)
Forward Pass
The forward pass is the moment when the model makes predictions.
Otherwise, your model’s hooks will not work (if you have them).
We can use all these handy methods to change our code, which should be looking
like this:
Model | 107
① Instantiating a model
② What IS this?!?
Now, the printed statements will look like this—final values for parameters b and w
are still the same, so everything is OK :-)
Output
train
I hope you noticed one particular statement in the code (line 21), to which I
assigned a comment "What is this?!?"—model.train().
It is good practice to call model.train() in the training loop. It is also possible to set
a model to evaluation mode, but this is a topic for the next chapter.
Nested Models
We are implementing a single-feature linear regression, one input and one output, so
the corresponding linear model would look like this:
linear = nn.Linear(1, 1)
linear
linear.state_dict()
Output
OrderedDict([('weight', tensor([[-0.2191]])),
('bias', tensor([0.2018]))])
So, our former parameter b is the bias, and our former parameter w is the weight
(your values will be different since I haven’t set up a random seed for this example).
Now, let’s use PyTorch’s Linear model as an attribute of our own, thus creating a
nested model.
Even though this clearly is a contrived example, since we are pretty much wrapping
the underlying model without adding anything useful (or, at all!) to it, it illustrates the
concept well.
class MyLinearRegression(nn.Module):
def __init__(self):
super().__init__()
# Instead of our custom parameters, we use a Linear model
# with a single input and a single output
self.linear = nn.Linear(1, 1)
Model | 109
In the __init__() method, we create an attribute that contains our nested Linear
model.
In the forward() method, we call the nested model itself to perform the forward
pass (notice, we are not calling self.linear.forward(x)!).
Now, if we call the parameters() method of this model, PyTorch will figure out the
parameters of its attributes recursively.
torch.manual_seed(42)
dummy = MyLinearRegression().to(device)
list(dummy.parameters())
Output
[Parameter containing:
tensor([[0.7645]], device='cuda:0', requires_grad=True),
Parameter containing:
tensor([0.8300], device='cuda:0', requires_grad=True)]
You can also add extra Linear attributes, and, even if you don’t use them at all in
the forward pass, they will still be listed under parameters().
If you prefer, you can also use state_dict() to get the parameter values, together
with their names:
dummy.state_dict()
Output
OrderedDict([('linear.weight',
tensor([[0.7645]], device='cuda:0')),
('linear.bias',
tensor([0.8300], device='cuda:0'))])
Notice that both bias and weight have a prefix with the attribute name: linear, from
the self.linear in the __init__() method.
Our model was simple enough. You may be thinking: "Why even bother to build a
class for it?!" Well, you have a point…
For straightforward models that use a series of built-in PyTorch models (like
Linear), where the output of one is sequentially fed as an input to the next, we can
use a, er … Sequential model :-)
In our case, we would build a sequential model with a single argument; that is, the
Linear model we used to train our linear regression. The model would look like
this:
1 torch.manual_seed(42)
2 # Alternatively, you can use a Sequential model
3 model = nn.Sequential(nn.Linear(1, 1)).to(device)
4
5 model.state_dict()
Output
We’ve been talking about models inside other models. This may get confusing real
quick, so let’s follow convention and call any internal model a layer.
Model | 111
Layers
In the figure above, the hidden layer would be nn.Linear(3, 5) (since it takes
three inputs—from the input layer—and generates five outputs), and the output
layer would be nn.Linear(5, 1) (since it takes five inputs—the outputs from the
hidden layer—and generates a single output).
torch.manual_seed(42)
# Building the model from the figure above
model = nn.Sequential(nn.Linear(3, 5), nn.Linear(5, 1)).to(device)
model.state_dict()
OrderedDict([
('0.weight',
tensor([[ 0.4414, 0.4792, -0.1353],
[ 0.5304, -0.1265, 0.1165],
[-0.2811, 0.3391, 0.5090],
[-0.4236, 0.5018, 0.1081],
[ 0.4266, 0.0782, 0.2784]],
device='cuda:0')),
('0.bias',
tensor([-0.0815, 0.4451, 0.0853, -0.2695, 0.1472],
device='cuda:0')),
('1.weight',
tensor([[-0.2060, -0.0524, -0.1816, 0.2967, -0.3530]],
device='cuda:0')),
('1.bias',
tensor([-0.2062], device='cuda:0'))])
Since this sequential model does not have attribute names, state_dict() uses
numeric prefixes.
You can also use a model’s add_module() method to name the layers:
torch.manual_seed(42)
# Building the model from the figure above
model = nn.Sequential()
model.add_module('layer1', nn.Linear(3, 5))
model.add_module('layer2', nn.Linear(5, 1))
model.to(device)
Output
Sequential(
(layer1): Linear(in_features=3, out_features=5, bias=True)
(layer2): Linear(in_features=5, out_features=1, bias=True)
)
Model | 113
There are MANY different layers that can be used in PyTorch:
• Convolution Layers
• Pooling Layers
• Padding Layers
• Non-linear Activations
• Normalization Layers
• Recurrent Layers
• Transformer Layers
• Linear Layers
• Dropout Layers
• Sparse Layers (embeddings)
• Vision Layers
• DataParallel Layers (multi-GPU)
• Flatten Layer
So far, we have just used a Linear layer. In the next volume of the series, we’ll use
many others, like convolution, pooling, padding, flatten, dropout, and non-linear
activations.
It is time to put it all together and organize our code into three fundamental parts,
namely:
There hasn’t been much data preparation up to this point, to be honest. After
generating our data points in Notebook Cell 1.1, the only preparation step
performed so far has been transforming Numpy arrays into PyTorch tensors, as in
Notebook Cell 1.3, which is reproduced below:
1 %%writefile data_preparation/v0.py
2
3 device = 'cuda' if torch.cuda.is_available() else 'cpu'
4
5 # Our data was in Numpy arrays, but we need to transform them
6 # into PyTorch's Tensors and then send them to the
7 # chosen device
8 x_train_tensor = torch.as_tensor(x_train).float().to(device)
9 y_train_tensor = torch.as_tensor(y_train).float().to(device)
%run -i data_preparation/v0.py
This part will get much more interesting in the next chapter when we get to use
Dataset and DataLoader classes :-)
We know we have to run the full sequence to train a model: data preparation,
model configuration, and model training. In Chapter 2, we’ll gradually improve each
of these parts, versioning them inside each corresponding folder. So, saving them to
files allows us to run a full sequence using different versions without having to
duplicate code.
Let’s say we start improving model configuration (and we will do exactly that in
Chapter 2), but the other two parts are still the same; how do we run the full
sequence?
%run -i data_preparation/v0.py
%run -i model_configuration/v1.py
%run -i model_training/v0.py
Since we’re using the -i option, it works exactly as if we had copied the code from
the files into a cell and executed it.
We are using the following two magics to better organize our code:
Model Configuration
We have seen plenty of this part: from defining parameters b and w manually, then
wrapping them up using the Module class, to using layers in a Sequential model.
We have also defined a loss function and an optimizer for our particular linear
regression model.
For the purpose of organizing our code, we’ll include the following elements in the
model configuration part:
Most of the corresponding code can be found in Notebook Cell 1.10, lines 1-15, but
we’ll replace the ManualLinearRegression model with the Sequential model from
Notebook Cell 1.12:
1 %%writefile model_configuration/v0.py
2
3 # This is redundant now, but it won't be when we introduce
4 # Datasets...
5 device = 'cuda' if torch.cuda.is_available() else 'cpu'
6
7 # Sets learning rate - this is "eta" ~ the "n"-like Greek letter
8 lr = 0.1
9
10 torch.manual_seed(42)
11 # Now we can create a model and send it at once to the device
12 model = nn.Sequential(nn.Linear(1, 1)).to(device)
13
14 # Defines an SGD optimizer to update the parameters
15 # (now retrieved directly from the model)
16 optimizer = optim.SGD(model.parameters(), lr=lr)
17
18 # Defines an MSE loss function
19 loss_fn = nn.MSELoss(reduction='mean')
%run -i model_configuration/v0.py
Model Training
This is the last part, where the actual training takes place. It loops over the gradient
descent steps we saw at the beginning of this chapter:
This sequence is repeated over and over until the number of epochs is reached.
The corresponding code for this part also comes from Notebook Cell 1.10, lines 17-
36.
1 %%writefile model_training/v0.py
2
3 # Defines number of epochs
4 n_epochs = 1000
5
6 for epoch in range(n_epochs):
7 # Sets model to TRAIN mode
8 model.train()
9
10 # Step 1 - Computes model's predicted output - forward pass
11 yhat = model(x_train_tensor)
12
13 # Step 2 - Computes the loss
14 loss = loss_fn(yhat, y_train_tensor)
15
16 # Step 3 - Computes gradients for both "b" and "w" parameters
17 loss.backward()
18
19 # Step 4 - Updates parameters using gradients and
20 # the learning rate
21 optimizer.step()
22 optimizer.zero_grad()
%run -i model_training/v0.py
print(model.state_dict())
Output
Now, take a close, hard look at the code inside the training loop.
Before I give you the answer, let me address something else that may be on your
mind: "What is the point of all this?"
Well, in the next chapter we’ll get fancier, using more of PyTorch’s classes (like
Dataset and DataLoader) to further refine our data preparation step, and we’ll also
try to reduce boilerplate code to a minimum. So, splitting our code into three
logical parts will allow us to better handle these improvements.
And here is the answer: NO, the code inside the loop would not change.
I guess you figured out which boilerplate I was referring to, right?
• transforming the original Numpy implementation into a PyTorch one using the
elements above
You are now ready for the next chapter. We’ll see more of PyTorch’s capabilities,
and we’ll further develop our training loop so it can be used for different problems
and models. You’ll be building your own, small draft of a library for training deep
learning models.
[39] https://github.com/dvgodoy/PyTorchStepByStep/blob/master/Chapter01.ipynb
[40] https://colab.research.google.com/github/dvgodoy/PyTorchStepByStep/blob/master/Chapter01.ipynb
Recap | 121
KEEP ON READING
KEEP ON LEARNING
Take your skills to the next level!
Tackle Computer Vision tasks using
convolutional neural networks and
transfer learning in the second
volume!