0% found this document useful (0 votes)
98 views

18. Deep Learning with PyTorch Step-by-Step

Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide by Daniel Voigt Godoy provides foundational knowledge on deep learning using the PyTorch framework. The document includes a comprehensive setup guide, visualizations, and step-by-step instructions for implementing various deep learning concepts. It is designed for beginners and emphasizes practical application and understanding of the material.

Uploaded by

Sourav Banerjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views

18. Deep Learning with PyTorch Step-by-Step

Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide by Daniel Voigt Godoy provides foundational knowledge on deep learning using the PyTorch framework. The document includes a comprehensive setup guide, visualizations, and step-by-step instructions for implementing various deep learning concepts. It is designed for beginners and emphasizes practical application and understanding of the material.

Uploaded by

Sourav Banerjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

Deep Learning with PyTorch

Step-by-Step: A Beginner’s Guide


Volume I—Fundamentals
Daniel Voigt Godoy

Version 1.1.1
Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide

Volume I—Fundamentals

by Daniel Voigt Godoy

Copyright © 2020-2022 by Daniel Voigt Godoy. All rights reserved.

May 2021: First Edition

Revision History for the First Edition:

• 2021-05-18: v1.0
• 2021-12-15: v1.1
• 2022-02-12: v1.1.1

For more information, please send an email to [email protected]

Although the author has used his best efforts to ensure that the information and
instructions contained in this book are accurate, under no circumstances shall the
author be liable for any loss, damage, liability, or expense incurred or suffered as a
consequence, directly or indirectly, of the use and/or application of any of the
contents of this book. Any action you take upon the information in this book is
strictly at your own risk. If any code samples or other technology this book contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights. The author does not have any control over and does not
assume any responsibility for third-party websites or their content. All trademarks
are the property of their respective owners. Screenshots are used for illustrative
purposes only.

No part of this book may be reproduced or transmitted in any form or by any


means (electronic, mechanical, photocopying, recording, or otherwise), or by any
information storage and retrieval system without the prior written permission of
the copyright owner, except where permitted by law. Please purchase only
authorized electronic editions. Your support of the author’s rights is appreciated.
"What I cannot create, I do not understand."

Richard P. Feynman
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Why PyTorch? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Why This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
What Do I Need to Know? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
How to Read This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
What’s Next?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Setup Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Official Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Binder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Local Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1. Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2. Conda (Virtual) Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3. PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4. TensorBoard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5. GraphViz and Torchviz (optional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6. Git. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7. Jupyter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Moving On. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 0: Visualizing Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Spoilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Imports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Visualizing Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Data Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Train-Validation-Test Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Step 0 - Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Step 1 - Compute Model’s Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Step 2 - Compute the Loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Loss Surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Cross-Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Step 3 - Compute the Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Visualizing Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Backpropagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Step 4 - Update the Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Low Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
High Learning Rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Very High Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
"Bad" Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Scaling / Standardizing / Normalizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Step 5 - Rinse and Repeat!. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
The Path of Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 1: A Simple Regression Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Spoilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Imports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
A Simple Regression Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Data Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Gradient Descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Step 0 - Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Step 1 - Compute Model’s Predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Step 2 - Compute the Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Step 3 - Compute the Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Step 4 - Update the Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Step 5 - Rinse and Repeat! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Linear Regression in Numpy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
PyTorch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Loading Data, Devices, and CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Creating Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Autograd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
grad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
zero_ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Updating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
no_grad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Dynamic Computation Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Optimizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
step / zero_grad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
state_dict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Nested Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Sequential Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Model Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Recap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Chapter 2: Rethinking the Training Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Spoilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Jupyter Notebook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Imports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Rethinking the Training Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Training Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
TensorDataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
DataLoader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Mini-Batch Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Random Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Plotting Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
TensorBoard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Running It Inside a Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Running It Separately (Local Installation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Running It Separately (Binder) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
SummaryWriter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
add_graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
add_scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Saving and Loading Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Model State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Resuming Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Deploying / Making Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Setting the Model’s Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Recap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Chapter 2.1: Going Classy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Spoilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Jupyter Notebook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Imports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Going Classy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
The Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
The Constructor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Placeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Training Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Saving and Loading Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Visualization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
The Full Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Classy Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Making Predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Resuming Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Recap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Chapter 3: A Simple Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Spoilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Jupyter Notebook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Imports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
A Simple Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Data Preparation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Logits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Odds Ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Log Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
From Logits to Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
BCELoss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
BCEWithLogitsLoss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Imbalanced Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Model Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Decision Boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Classification Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
True and False Positive Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Trade-offs and Curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Low Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
High Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
ROC and PR Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
The Precision Quirk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Best and Worst Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Comparing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Recap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Thank You! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
Preface
If you’re reading this, I probably don’t need to tell you that deep learning is amazing
and PyTorch is cool, right?

But I will tell you, briefly, how this series of books came to be. In 2016, I started
teaching a class on machine learning with Apache Spark and, a couple of years later,
another class on the fundamentals of machine learning.

At some point, I tried to find a blog post that would visually explain, in a clear and
concise manner, the concepts behind binary cross-entropy so that I could show it
to my students. Since I could not find any that fit my purpose, I decided to write one
myself. Although I thought of it as a fairly basic topic, it turned out to be my most
popular blog post[1]! My readers have welcomed the simple, straightforward, and
conversational way I explained the topic.

Then, in 2019, I used the same approach for writing another blog post:
"Understanding PyTorch with an example: a step-by-step tutorial."[2] Once again, I
was amazed by the reaction from the readers!

It was their positive feedback that motivated me to write this series of books to
help beginners start their journey into deep learning and PyTorch.

In this first volume, I cover the basics of gradient descent, the fundamentals of
PyTorch, training linear and logistic regressions, evaluation metrics, and more. If
you have absolutely no experience with PyTorch, this is your starting point.

The second volume is mostly focused on computer vision: deeper models and
activation functions, convolutional neural networks, initialization schemes,
schedulers, and transfer learning. If your goal is to learn about deep learning
models for computer vision, and you’re already comfortable training simple models
in PyTorch, the second volume is the right one for you.

Then, the third volume focuses on all things sequence: recurrent neural networks
and their variations, sequence-to-sequence models, attention, self-attention, and
the Transformer architecture. The very last chapter of the third volume is a crash
course on natural language processing: from the basics of word tokenization all the
way up to fine-tuning large models (BERT and GPT-2) using the HuggingFace
library. This volume is more demanding than the other two, and you’re going to
enjoy it more if you already have a solid understanding of deep learning models.

| xi
These books are meant to be read in order, and, although they can be read
independently, I strongly recommend you read them as the one, long book I
originally wrote :-)

I hope you enjoy reading this series as much as I enjoyed writing it.

[1] https://bit.ly/2UW5iTg
[2] https://bit.ly/2TpzwxR

xii | Preface
Acknowledgements
First and foremost, I’d like to thank YOU, my reader, for making this book possible.
If it weren’t for the amazing feedback I got from the thousands of readers of my
blog post about PyTorch, I would have never mustered the strength to start and
finish such a major undertaking as writing a 1,000-page book series!

I’d like to thank my good friends Jesús Martínez-Blanco (who managed to read
absolutely everything that I wrote), Jakub Cieslik, Hannah Berscheid, Mihail Vieru,
Ramona Theresa Steck, Mehdi Belayet Lincon, and António Góis for helping me out
and dedicating a good chunk of their time to reading, proofing, and suggesting
improvements to my drafts. I’m forever grateful for your support! I’d also like to
thank my friend José Luis Lopez Pino for the initial push I needed to actually start
writing this book.

Many thanks to my friends José Quesada and David Anderson for taking me as a
student at the Data Science Retreat in 2015 and, later on, for inviting me to be a
teacher there. That was the starting point of my career both as a data scientist and
as teacher.

I’d also like to thank the PyTorch developers for developing such an amazing
framework, and the teams from Leanpub and Towards Data Science for making it
incredibly easy for content creators like me to share their work with the
community.

Finally, I’d like to thank my wife, Jerusa, for always being supportive throughout
the entire writing of this series of books, and for taking the time to read every single
page in it :-)

| xiii
About the Author

Daniel is a data scientist, developer, writer, and teacher. He has been teaching
machine learning and distributed computing technologies at Data Science Retreat,
the longest-running Berlin-based bootcamp, since 2016, helping more than 150
students advance their careers.

Daniel is also the main contributor of two Python packages: HandySpark and
DeepReplay.

His professional background includes 20 years of experience working for


companies in several industries: banking, government, fintech, retail, and mobility.

xiv | About the Author


Frequently Asked Questions (FAQ)
Why PyTorch?
First, coding in PyTorch is fun :-) Really, there is something to it that makes it very
enjoyable to write code in. Some say it is because it is very pythonic, or maybe
there is something else, who knows? I hope that, by the end of this book, you feel
like that too!

Second, maybe there are even some unexpected benefits to your health—check
Andrej Karpathy’s tweet[3] about it!

Jokes aside, PyTorch is the fastest-growing[4] framework for developing deep


learning models and it has a huge ecosystem.[5] That is, there are many tools and
libraries developed on top of PyTorch. It is the preferred framework[6] in academia
already and is making its way in the industry.

Several companies are already powered by PyTorch;[7] to name a few:

• Facebook: The company is the original developer of PyTorch, released in


October 2016.

• Tesla: Watch Andrej Karpathy (AI director at Tesla) speak about "how Tesla is
using PyTorch to develop full self-driving capabilities for its vehicles" in this video.[8]

• OpenAI: In January 2020, OpenAI decided to standardize its deep learning


framework on PyTorch.[9]

• fastai: fastai is a library built on top of PyTorch to simplify model training and
[10]

is used in its "Practical Deep Learning for Coders"[11] course. The fastai library is
deeply connected to PyTorch and "you can’t become really proficient at using
fastai if you don’t know PyTorch well, too."[12]

• Uber: The company is a significant contributor to PyTorch’s ecosystem, having


developed libraries like Pyro[13] (probabilistic programming) and Horovod[14] (a
distributed training framework).

• Airbnb: PyTorch sits at the core of the company’s dialog assistant for customer
service.[15]

This series of books aims to get you started with PyTorch while giving you a solid
understanding of how it works.

Why PyTorch? | 1
Why This Book?
If you’re looking for a book where you can learn about deep learning and PyTorch
without having to spend hours deciphering cryptic text and code, and one that’s
easy and enjoyable to read, this is it :-)

First, this is not a typical book: most tutorials start with some nice and pretty image
classification problem to illustrate how to use PyTorch. It may seem cool, but I
believe it distracts you from the main goal: learning how PyTorch works. In this
book, I present a structured, incremental, and from-first-principles approach to
learn PyTorch.

Second, this is not a formal book in any way: I am writing this book as if I were
having a conversation with you, the reader. I will ask you questions (and give you
answers shortly afterward), and I will also make (silly) jokes.

My job here is to make you understand the topic, so I will avoid fancy
mathematical notation as much as possible and spell it out in plain English.

In this first book of the Deep Learning with PyTorch Step-by-Step series, I will guide
you through the development of many models in PyTorch, showing you why
PyTorch makes it much easier and more intuitive to build models in Python:
autograd, dynamic computation graph, model classes, and much, much more.

We will build, step-by-step, not only the models themselves but also your
understanding as I show you both the reasoning behind the code and how to avoid
some common pitfalls and errors along the way.

There is yet another advantage of focusing on the basics: this book is likely to have
a longer shelf life. It is fairly common for technical books, especially those focusing
on cutting-edge technology, to become outdated quickly. Hopefully, this is not
going to be the case here, since the underlying mechanics are not changing and
neither are the concepts. It is expected that some syntax changes over time, but I
do not see backward compatibility-breaking changes coming anytime soon.

One more thing: If you hadn’t noticed already, I really like to


make use of visual cues, that is, bold and italic highlights. I firmly
 believe this helps the reader to grasp the key ideas I am trying to
convey in a sentence more easily. You can find more on that in
the section "How to Read This Book."

2 | Frequently Asked Questions (FAQ)


Who Should Read This Book?
I wrote this book for beginners in general—not only PyTorch beginners. Every now
and then, I will spend some time explaining some fundamental concepts that I
believe are essential to have a proper understanding of what’s going on in the
code.

The best example is gradient descent, which most people are familiar with at some
level. Maybe you know its general idea, perhaps you’ve seen it in Andrew Ng’s
Machine Learning course, or maybe you’ve even computed some partial
derivatives yourself!

In real life, the mechanics of gradient descent will be handled automatically by


PyTorch (uh, spoiler alert!). But, I will walk you through it anyway (unless you
choose to skip Chapter 0 altogether, of course), because lots of elements in the
code, as well as choices of hyper-parameters (learning rate, mini-batch size, etc.),
can be much more easily understood if you know where they come from.

Maybe you already know some of these concepts well: If this is the case, you can
simply skip them, since I’ve made these explanations as independent as possible
from the rest of the content.

But I want to make sure everyone is on the same page, so, if you have just heard
about a given concept or if you are unsure if you have entirely understood it, these
explanations are for you.

What Do I Need to Know?


This is a book for beginners, so I am assuming as little prior knowledge as possible;
as mentioned in the previous section, I will take the time to explain fundamental
concepts whenever needed.

That being said, this is what I expect from you, the reader:

• to be able to code in Python (if you are familiar with object-oriented


programming [OOP], even better)

• to be able to work with PyData stack (numpy, matplotplib, and pandas) and
Jupyter notebooks

• to be familiar with some basic concepts used in machine learning, like:

Who Should Read This Book? | 3


◦ supervised learning: regression and classification
◦ loss functions for regression and classification (mean squared error, cross-
entropy, etc.)

◦ training-validation-test split
◦ underfitting and overfitting (bias-variance trade-off)

Even so, I am still briefly touching on some of these topics, but I need to draw a line
somewhere; otherwise, this book would be gigantic!

How to Read This Book


Since this book is a beginner’s guide, it is meant to be read sequentially, as ideas
and concepts are progressively built. The same holds true for the code inside the
book—you should be able to reproduce all outputs, provided you execute the
chunks of code in the same order as they are introduced.

This book is visually different than other books: As I’ve mentioned already in the
"Why This Book?" section, I really like to make use of visual cues. Although this is
not, strictly speaking, a convention, this is how you can interpret those cues:

• I use bold to highlight what I believe to be the most relevant words in a


sentence or paragraph, while italicized words are considered important too (not
important enough to be bold, though)

• Variables, coefficients, and parameters in general, are italicized


• Classes and methods are written in a monospaced font, and they link to PyTorch
[16]
documentation the first time they are introduced, so you can easily follow it
(unlike other links in this book, links to documentation are numerous and thus
not included in the footnotes)

• Every code cell is followed by another cell showing the corresponding outputs
(if any)

• All code presented in the book is available at its official repository on GitHub:

https://github.com/dvgodoy/PyTorchStepByStep

4 | Frequently Asked Questions (FAQ)


Code cells with titles are an important piece of the workflow:

Title Goes Here

1 # Whatever is being done here is going to impact OTHER code


2 # cells. Besides, most cells have COMMENTS explaining what
3 # is happening
4 x = [1., 2., 3.]
5 print(x)

If there is any output to the code cell, titled or not, there will be another code cell
depicting the corresponding output so you can check if you successfully
reproduced it or not.

Output

[1.0, 2.0, 3.0]

Some code cells do not have titles—running them does not affect the workflow:

# Those cells illustrate HOW TO CODE something, but they are


# NOT part of the main workflow
dummy = ['a', 'b', 'c']
print(dummy[::-1])

But even these cells have their outputs shown!

Output

['c', 'b', 'a']

I use asides to communicate a variety of things, according to the corresponding


icon:

WARNING
 Potential problems or things to look out for.

How to Read This Book | 5


TIP
 Key aspects I really want you to remember.

INFORMATION
 Important information to pay attention to.

IMPORTANT
 Really important information to pay attention to.

TECHNICAL

 Technical aspects of parameterization or inner workings of


algorithms.

QUESTION AND ANSWER

 Asking myself questions (pretending to be you, the reader) and


answering them, either in the same block or shortly after.

DISCUSSION
 Really brief discussion on a concept or topic.

LATER
 Important topics that will be covered in more detail later.

SILLY
 Jokes, puns, memes, quotes from movies.

What’s Next?
It’s time to set up an environment for your learning journey using the Setup Guide.

[3] https://bit.ly/2MQoYRo
[4] https://bit.ly/37uZgLB
[5] https://pytorch.org/ecosystem/
[6] https://bit.ly/2MTN0Lh
[7] https://bit.ly/2UFHFve
[8] https://bit.ly/2XXJkyo
[9] https://openai.com/blog/openai-pytorch/

6 | Frequently Asked Questions (FAQ)


[10] https://docs.fast.ai/
[11] https://course.fast.ai/
[12] https://course.fast.ai/
[13] http://pyro.ai/
[14] https://github.com/horovod/horovod
[15] https://bit.ly/30CPhm5
[16] https://bit.ly/3cT1aH2

What’s Next? | 7
Setup Guide
Official Repository
This book’s official repository is available on GitHub:

https://github.com/dvgodoy/PyTorchStepByStep

It contains one Jupyter notebook for every chapter in this book. Each notebook
contains all the code shown in its corresponding chapter, and you should be able to
run its cells in sequence to get the same outputs, as shown in the book. I strongly
believe that being able to reproduce the results brings confidence to the reader.

Even though I did my best to ensure the reproducibility of the


results, you may still find some minor differences in your outputs
(especially during model training). Unfortunately, completely
 reproducible results are not guaranteed across PyTorch releases,
and results may not be reproducible between CPU and GPU
executions, even when using identical seeds.[17]

Environment
There are three options for you to run the Jupyter notebooks:

• Google Colab (https://colab.research.google.com)


• Binder (https://mybinder.org)
• Local Installation

Let’s briefly explore the pros and cons of each of these options.

Google Colab

Google Colab "allows you to write and execute Python in your browser, with zero
configuration required, free access to GPUs and easy sharing."[18].

You can easily load notebooks directly from GitHub using Colab’s special URL
(https://colab.research.google.com/github/). Just type in the GitHub’s user or
organization (like mine, dvgodoy), and it will show you a list of all its public
repositories (like this book’s, PyTorchStepByStep).

8 | Setup Guide
After choosing a repository, it will list the available notebooks and corresponding
links to open them in a new browser tab.

Figure S.1 - Google Colab’s special URL

You also get access to a GPU, which is very useful to train deep learning models
faster. More important, if you make changes to the notebook, Google Colab will
keep them. The whole setup is very convenient; the only cons I can think of are:

• You need to be logged in to a Google account.


• You need to (re)install Python packages that are not part of Google Colab’s
default configuration.

Binder

Binder "allows you to create custom computing environments that can be shared and
used by many remote users."[19]

You can also load notebooks directly from GitHub, but the process is slightly
different. Binder will create something like a virtual machine (technically, it is a
container, but let’s leave it at that), clone the repository, and start Jupyter. This
allows you to have access to Jupyter’s home page in your browser, just like you
would if you were running it locally, but everything is running in a JupyterHub
server on their end.

Just go to Binder’s site (https://mybinder.org/) and type in the URL to the GitHub
repository you want to explore (for instance,
https://github.com/dvgodoy/PyTorchStepByStep) and click on Launch. It will take
a couple of minutes to build the image and open Jupyter’s home page.

You can also launch Binder for this book’s repository directly using the following

Environment | 9
link: https://mybinder.org/v2/gh/dvgodoy/PyTorchStepByStep/master.

Figure S.2 - Binder’s page

Binder is very convenient since it does not require a prior setup of any kind. Any
Python packages needed to successfully run the environment are likely installed
during launch (if provided by the author of the repository).

On the other hand, it may take time to start, and it does not keep your changes
after your session expires (so, make sure you download any notebooks you
modify).

Local Installation

This option will give you more flexibility, but it will require more effort to set up. I
encourage you to try setting up your own environment. It may seem daunting at
first, but you can surely accomplish it by following seven easy steps:

Checklist

☐ 1. Install Anaconda.
☐ 2. Create and activate a virtual environment.
☐ 3. Install PyTorch package.
☐ 4. Install TensorBoard package.
☐ 5. Install GraphViz software and TorchViz package (optional).
☐ 6. Install git and clone the repository.
☐ 7. Start Jupyter notebook.

10 | Setup Guide
1. Anaconda

If you don’t have Anaconda’s Individual Edition[20] installed yet, this would be a
good time to do it. It is a convenient way to start since it contains most of the
Python libraries a data scientist will ever need to develop and train models.

Please follow the installation instructions for your OS:

• Windows (https://docs.anaconda.com/anaconda/install/windows/)
• macOS (https://docs.anaconda.com/anaconda/install/mac-os/)
• Linux (https://docs.anaconda.com/anaconda/install/linux/)

Make sure you choose Python 3.X version since Python 2 was
 discontinued in January 2020.

After installing Anaconda, it is time to create an environment.

2. Conda (Virtual) Environments

Virtual environments are a convenient way to isolate Python installations


associated with different projects.

 "What is an environment?"

It is pretty much a replication of Python itself and some (or all) of its libraries, so,
effectively, you’ll end up with multiple Python installations on your computer.

 "Why can’t I just use one single Python installation for everything?"

With so many independently developed Python libraries, each having many


different versions and each version having various dependencies (on other
libraries), things can get out of hand real quick.

It is beyond the scope of this guide to debate these issues, but take my word for it
(or Google it!)—you’ll benefit a great deal if you pick up the habit of creating a
different environment for every project you start working on.

 "How do I create an environment?"

First, you need to choose a name for your environment :-) Let’s call ours

Environment | 11
pytorchbook (or anything else you find easy to remember). Then, you need to open
a terminal (in Ubuntu) or Anaconda Prompt (in Windows or macOS) and type the
following command:

$ conda create -n pytorchbook anaconda

The command above creates a Conda environment named pytorchbook and


includes all Anaconda packages in it (time to get a coffee, it will take a while…). If
you want to learn more about creating and using Conda environments, please
check Anaconda’s "Managing Environments"[21] user guide.

Did it finish creating the environment? Good! It is time to activate it, meaning,
making that Python installation the one to be used now. In the same terminal (or
Anaconda prompt), just type:

$ conda activate pytorchbook

Your prompt should look like this (if you’re using Linux):

(pytorchbook)$

or like this (if you’re using Windows):

(pytorchbook)C:\>

Done! You are using a brand new Conda environment now. You’ll need to activate
it every time you open a new terminal, or, if you’re a Windows or macOS user, you
can open the corresponding Anaconda prompt (it will show up as Anaconda
Prompt (pytorchbook), in our case), which will have it activated from the start.

IMPORTANT: From now on, I am assuming you’ll activate the


pytorchbook environment every time you open a terminal or
 Anaconda prompt. Further installation steps must be executed
inside the environment.

12 | Setup Guide
3. PyTorch

PyTorch is the coolest deep learning framework, just in case you skipped the
introduction.

It is "an open source machine learning framework that accelerates the path from
research prototyping to production deployment."[22] Sounds good, right? Well, I
probably don’t have to convince you at this point :-)

It is time to install the star of the show :-) We can go straight to the Start Locally
(https://pytorch.org/get-started/locally/) section of PyTorch’s website, and it will
automatically select the options that best suit your local environment, and it will
show you the command to run.

Figure S.3 - PyTorch’s Start Locally page

Some of these options are given:

• PyTorch Build: Always select a Stable version.


• Package: I am assuming you’re using Conda.
• Language: Obviously, Python.

So, two options remain: Your OS and CUDA.

 "What is CUDA?" you ask.

Environment | 13
Using GPU / CUDA

CUDA "is a parallel computing platform and programming model developed by NVIDIA
for general computing on graphical processing units (GPUs)."[23]

If you have a GPU in your computer (likely a GeForce graphics card), you can
leverage its power to train deep learning models much faster than using a CPU. In
this case, you should choose a PyTorch installation that includes CUDA support.

This is not enough, though: If you haven’t done so yet, you need to install up-to-
date drivers, the CUDA Toolkit, and the CUDA Deep Neural Network library
(cuDNN). Unfortunately, more detailed installation instructions for CUDA are
outside the scope of this book.

The advantage of using a GPU is that it allows you to iterate faster and experiment
with more-complex models and a more extensive range of hyper-parameters.

In my case, I use Linux, and I have a GPU with CUDA version 10.2 installed. So I
would run the following command in the terminal (after activating the
environment):

(pytorchbook)$ conda install pytorch torchvision\


cudatoolkit=10.2 -c pytorch

Using CPU

If you do not have a GPU, you should choose None for CUDA.

 "Would I be able to run the code without a GPU?" you ask.

Sure! The code and the examples in this book were designed to allow all readers to
follow them promptly. Some examples may demand a bit more computing power,
but we are talking about a couple of minutes in a CPU, not hours. If you do not have
a GPU, don’t worry! Besides, you can always use Google Colab if you need to use a
GPU for a while!

If I had a Windows computer, and no GPU, I would have to run the following
command in the Anaconda prompt (pytorchbook):

14 | Setup Guide
(pytorchbook) C:\> conda install pytorch torchvision cpuonly\
-c pytorch

Installing CUDA

CUDA: Installing drivers for a GeForce graphics card, NVIDIA’s cuDNN, and
CUDA Toolkit can be challenging and is highly dependent on which model
you own and which OS you use.

For installing GeForce’s drivers, go to GeForce’s website


(https://www.geforce.com/drivers), select your OS and the model of your
graphics card, and follow the installation instructions.

For installing NVIDIA’s CUDA Deep Neural Network library (cuDNN), you
need to register at https://developer.nvidia.com/cudnn.

For installing CUDA Toolkit (https://developer.nvidia.com/cuda-toolkit), please


follow instructions for your OS and choose a local installer or executable file.

macOS: If you’re a macOS user, please beware that PyTorch’s binaries DO


NOT support CUDA, meaning you’ll need to install PyTorch from source if
you want to use your GPU. This is a somewhat complicated process (as
described in https://github.com/pytorch/pytorch#from-source), so, if you don’t
feel like doing it, you can choose to proceed without CUDA, and you’ll still be
able to execute the code in this book promptly.

4. TensorBoard

TensorBoard is TensorFlow’s visualization toolkit, and "provides the visualization


and tooling needed for machine learning experimentation."[24]

TensorBoard is a powerful tool, and we can use it even if we are developing models
in PyTorch. Luckily, you don’t need to install the whole TensorFlow to get it; you
can easily install TensorBoard alone using Conda. You just need to run this
command in your terminal or Anaconda prompt (again, after activating the
environment):

(pytorchbook)$ conda install -c conda-forge tensorboard

Environment | 15
5. GraphViz and Torchviz (optional)

This step is optional, mostly because the installation of GraphViz


can sometimes be challenging (especially on Windows). If for any
reason you do not succeed in installing it correctly, or if you
 decide to skip this installation step, you will still be able to
execute the code in this book (except for a couple of cells that
generate images of a model’s structure in the "Dynamic
Computation Graph" section of Chapter 1).

GraphViz is an open source graph visualization software. It is "a way of representing


structural information as diagrams of abstract graphs and networks."[25]

We need to install GraphViz to use TorchViz, a neat package that allows us to


visualize a model’s structure. Please check the installation instructions for your OS
at https://www.graphviz.org/download/.

If you are using Windows, please use the GraphViz’s Windows


 Package installer at https://graphviz.gitlab.io/_pages/Download/
windows/graphviz-2.38.msi.

You also need to add GraphViz to the PATH (environment


variable) in Windows. Most likely, you can find the GraphViz
executable file at C:\ProgramFiles(x86)\Graphviz2.38\bin.

 Once you find it, you need to set or change the PATH accordingly,
adding GraphViz’s location to it.

For more details on how to do that, please refer to "How to Add


to Windows PATH Environment Variable."[26]

For additional information, you can also check the "How to Install Graphviz
Software"[27] guide.

After installing GraphViz, you can install the torchviz[28] package. This package is
not part of Anaconda Distribution Repository[29] and is only available at PyPI[30], the
Python Package Index, so we need to pip install it.

Once again, open a terminal or Anaconda prompt and run this command (just once
more: after activating the environment):

16 | Setup Guide
(pytorchbook)$ pip install torchviz

To check your GraphViz / TorchViz installation, you can try the Python code below:

(pytorchbook)$ python

Python 3.7.5 (default, Oct 25 2019, 15:51:11)


[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more
information.
>>> import torch
>>> from torchviz import make_dot
>>> v = torch.tensor(1.0, requires_grad=True)
>>> make_dot(v)

If everything is working correctly, you should see something like this:

Output

<graphviz.dot.Digraph object at 0x7ff540c56f50>

If you get an error of any kind (the one below is pretty common), it means there is
still some kind of installation issue with GraphViz.

Output

ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make


sure the Graphviz executables are on your systems' PATH

6. Git

It is way beyond this guide’s scope to introduce you to version control and its most
popular tool: git. If you are familiar with it already, great, you can skip this section
altogether!

Otherwise, I’d recommend you to learn more about it; it will definitely be useful for
you later down the line. In the meantime, I will show you the bare minimum so you
can use git to clone the repository containing all code used in this book and get
your own, local copy of it to modify and experiment with as you please.

Environment | 17
First, you need to install it. So, head to its downloads page (https://git-scm.com/
downloads) and follow instructions for your OS. Once the installation is complete,
please open a new terminal or Anaconda prompt (it’s OK to close the previous
one). In the new terminal or Anaconda prompt, you should be able to run git
commands.

To clone this book’s repository, you only need to run:

(pytorchbook)$ git clone https://github.com/dvgodoy/\


PyTorchStepByStep.git

The command above will create a PyTorchStepByStep folder that contains a local
copy of everything available on GitHub’s repository.

conda install vs pip install

Although they may seem equivalent at first sight, you should prefer conda
install over pip install when working with Anaconda and its virtual
environments.

This is because conda install is sensitive to the active virtual environment:


The package will be installed only for that environment. If you use pip
install, and pip itself is not installed in the active environment, it will fall
back to the global pip, and you definitely do not want that.

Why not? Remember the problem with dependencies I mentioned in the


virtual environment section? That’s why! The conda installer assumes it
handles all packages that are part of its repository and keeps track of the
complicated network of dependencies among them (to learn more about
this, check this link[31]).

To learn more about the differences between conda and pip, read
"Understanding Conda and Pip."[32]

As a rule, first try to conda install a given package and, only if it does not
exist there, fall back to pip install, as we did with torchviz.

18 | Setup Guide
7. Jupyter

After cloning the repository, navigate to the PyTorchStepByStep folder and, once
inside it, start Jupyter on your terminal or Anaconda prompt:

(pytorchbook)$ jupyter notebook

This will open your browser, and you will see Jupyter’s home page containing the
repository’s notebooks and code.

Figure S.4 - Running Jupyter

Moving On
Regardless of which of the three environments you chose, now you are ready to
move on and tackle the development of your first PyTorch model, step-by-step!

[17] https://pytorch.org/docs/stable/notes/randomness.html
[18] https://colab.research.google.com/notebooks/intro.ipynb
[19] https://mybinder.readthedocs.io/en/latest/
[20] https://www.anaconda.com/products/individual
[21] https://bit.ly/2MVk0CM
[22] https://pytorch.org/
[23] https://developer.nvidia.com/cuda-zone
[24] https://www.tensorflow.org/tensorboard
[25] https://www.graphviz.org/
[26] https://bit.ly/3fIwYA5

Moving On | 19
[27] https://bit.ly/30Ayct3
[28] https://github.com/szagoruyko/pytorchviz
[29] https://docs.anaconda.com/anaconda/packages/pkg-docs/
[30] https://pypi.org/
[31] https://bit.ly/37onBTt
[32] https://bit.ly/2AAh8J5

20 | Setup Guide
Chapter 0
Visualizing Gradient Descent
Spoilers
In this chapter, we will:

• define a simple linear regression model


• walk through every step of gradient descent: initializing parameters,
performing a forward pass, computing errors and loss, computing gradients,
and updating parameters

• understand gradients using equations, code, and geometry


• understand the difference between batch, mini-batch, and stochastic gradient
descent

• visualize the effects on the loss of using different learning rates


• understand the importance of standardizing / scaling features
• and much, much more!

There is no actual PyTorch code in this chapter… it is Numpy all along because our
focus here is to understand, inside and out, how gradient descent works. PyTorch
will be introduced in the next chapter.

Jupyter Notebook
The Jupyter notebook corresponding to Chapter 0[33] is part of the official Deep
Learning with PyTorch Step-by-Step repository on GitHub. You can also run it
directly in Google Colab[34].

If you’re using a local installation, open your terminal or Anaconda prompt and
navigate to the PyTorchStepByStep folder you cloned from GitHub. Then, activate
the pytorchbook environment and run jupyter notebook:

$ conda activate pytorchbook

(pytorchbook)$ jupyter notebook

If you’re using Jupyter’s default settings, this link should open Chapter 0’s

Spoilers | 21
notebook. If not, just click on Chapter00.ipynb on your Jupyter’s home page.

Imports

For the sake of organization, all libraries needed throughout the code used in any
given chapter are imported at its very beginning. For this chapter, we’ll need the
following imports:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

Visualizing Gradient Descent


According to Wikipedia[35]: "Gradient descent is a first-order
iterative optimization algorithm for finding a local minimum of a
differentiable function."

 But I would go with: "Gradient descent is an iterative technique


commonly used in machine learning and deep learning to find the
best possible set of parameters / coefficients for a given model,
data points, and loss function, starting from an initial, and usually
random, guess."

 "Why visualizing gradient descent?"

I believe the way gradient descent is usually explained lacks intuition. Students and
beginners are left with a bunch of equations and rules of thumb—this is not the way
one should learn such a fundamental topic.

If you really understand how gradient descent works, you will also understand how
the characteristics of your data and your choice of hyper-parameters (mini-batch
size and learning rate, for instance) have an impact on how well and how fast the
model training is going to be.

By really understanding, I do not mean working through the equations manually: this
does not develop intuition either. I mean visualizing the effects of different
settings; I mean telling a story to illustrate the concept. That’s how you develop
intuition.

22 | Chapter 0: Visualizing Gradient Descent


That being said, I’ll cover the five basic steps you’d need to go through to use
gradient descent. I’ll show you the corresponding Numpy code while explaining lots
of fundamental concepts along the way.

But first, we need some data to work with. Instead of using some external dataset,
let’s

• define which model we want to train to better understand gradient descent;


and

• generate synthetic data for that model.

Model
The model must be simple and familiar, so you can focus on the inner workings of
gradient descent.

So, I will stick with a model as simple as it can be: a linear regression with a single
feature, x!

Equation 0.1 - Simple linear regression model

In this model, we use a feature (x) to try to predict the value of a label (y). There are
three elements in our model:

• parameter b, the bias (or intercept), which tells us the expected average value of
y when x is zero

• parameter w, the weight (or slope), which tells us how much y increases, on
average, if we increase x by one unit

• and that last term (why does it always have to be a Greek letter?), epsilon, which
is there to account for the inherent noise; that is, the error we cannot get rid of

We can also conceive the very same model structure in a less abstract way:

salary = minimum wage + increase per year * years of experience + noise

And to make it even more concrete, let’s say that the minimum wage is $1,000
(whatever the currency or time frame, this is not important). So, if you have no
experience, your salary is going to be the minimum wage (parameter b).

Model | 23
Also, let’s say that, on average, you get a $2,000 increase (parameter w) for every
year of experience you have. So, if you have two years of experience, you are
expected to earn a salary of $5,000. But your actual salary is $5,600 (lucky you!).
Since the model cannot account for those extra $600, your extra money is,
technically speaking, noise.

Data Generation
We know our model already. In order to generate synthetic data for it, we need to
pick values for its parameters. I chose b = 1 and w = 2 (as in thousands of dollars)
from the example above.

First, let’s generate our feature (x): We use Numpy's rand() method to randomly
generate 100 (N) points between 0 and 1.

Then, we plug our feature (x) and our parameters b and w into our equation to
compute our labels (y). But we need to add some Gaussian noise[36] (epsilon) as well;
otherwise, our synthetic dataset would be a perfectly straight line. We can
generate noise using Numpy's randn() method, which draws samples from a normal
distribution (of mean 0 and variance 1), and then multiply it by a factor to adjust for
the level of noise. Since I don’t want to add too much noise, I picked 0.1 as my
factor.

Synthetic Data Generation


Data Generation

1 true_b = 1
2 true_w = 2
3 N = 100
4
5 # Data Generation
6 np.random.seed(42)
7 x = np.random.rand(N, 1)
8 epsilon = (.1 * np.random.randn(N, 1))
9 y = true_b + true_w * x + epsilon

Did you notice the np.random.seed(42) at line 6? This line of code is actually more
important than it looks. It guarantees that, every time we run this code, the same
random numbers will be generated.

24 | Chapter 0: Visualizing Gradient Descent


"Wait; what?! Aren’t the numbers supposed to be random? How could
 they possibly be the same numbers?" you ask, perhaps even a bit
annoyed by this.

(Not So) Random Numbers

Well, you know, random numbers are not quite random… They are really
pseudo-random, which means Numpy's number generator spits out a
sequence of numbers that looks like it’s random. But it is not, really.

The good thing about this behavior is that we can tell the generator to start a
particular sequence of pseudo-random numbers. To some extent, it works
as if we tell the generator: "please generate sequence #42," and it will spill out
a sequence of numbers. That number, 42, which works like the index of the
sequence, is called a seed. Every time we give it the same seed, it generates
the same numbers.

This means we have the best of both worlds: On the one hand, we do
generate a sequence of numbers that, for all intents and purposes, is
considered to be random; on the other hand, we have the power to
reproduce any given sequence. I cannot stress enough how convenient that
is for debugging purposes!

Moreover, you can guarantee that other people will be able to reproduce
your results. Imagine how annoying it would be to run the code in this book
and get different outputs every time, having to wonder if there is anything
wrong with it. But since I’ve set a seed, you and I can achieve the very same
outputs, even if it involved generating random data!

Next, let’s split our synthetic data into train and validation sets, shuffling the array
of indices and using the first 80 shuffled points for training.

"Why do you need to shuffle randomly generated data points? Aren’t


 they random enough?"

Yes, they are random enough, and shuffling them is indeed redundant in this
example. But it is best practice to always shuffle your data points before training a
model to improve the performance of gradient descent.

Data Generation | 25
There is one exception to the "always shuffle" rule, though: time
 series problems, where shuffling can lead to data leakage.

Train-Validation-Test Split

It is beyond the scope of this book to explain the reasoning behind the train-
validation-test split, but there are two points I’d like to make:

1. The split should always be the first thing you do—no preprocessing, no
transformations; nothing happens before the split. That’s why we do this
immediately after the synthetic data generation.
2. In this chapter we will use only the training set, so I did not bother to create a
test set, but I performed a split nonetheless to highlight point #1 :-)

Train-Validation Split

1 # Shuffles the indices


2 idx = np.arange(N)
3 np.random.shuffle(idx)
4
5 # Uses first 80 random indices for train
6 train_idx = idx[:int(N*.8)]
7 # Uses the remaining indices for validation
8 val_idx = idx[int(N*.8):]
9
10 # Generates train and validation sets
11 x_train, y_train = x[train_idx], y[train_idx]
12 x_val, y_val = x[val_idx], y[val_idx]

"Why didn’t you use train_test_split() from Scikit-Learn?" you


 may be asking.

That’s a fair point. Later on, we will refer to the indices of the data points belonging
to either train or validation sets, instead of the points themselves. So, I thought of
using them from the very start.

26 | Chapter 0: Visualizing Gradient Descent


Figure 0.1 - Synthetic data: train and validation sets

We know that b = 1, w = 2, but now let’s see how close we can get to the true
values by using gradient descent and the 80 points in the training set (for training,
N = 80).

Step 0 - Random Initialization


In our example, we already know the true values of the parameters, but this will
obviously never happen in real life: If we knew the true values, why even bother to
train a model to find them?!

OK, given that we’ll never know the true values of the parameters, we need to set
initial values for them. How do we choose them? It turns out a random guess is as
good as any other.

Even though the initialization is random, there are some clever


initialization schemes that should be used when training more-
 complex models. We’ll get back to them (much) later, in the
second volume of the series.

For training a model, you need to randomly initialize the parameters / weights (we
have only two, b and w).

Step 0 - Random Initialization | 27


Random Initialization

1 # Step 0 - Initializes parameters "b" and "w" randomly


2 np.random.seed(42)
3 b = np.random.randn(1)
4 w = np.random.randn(1)
5
6 print(b, w)

Output

[0.49671415] [-0.1382643]

Step 1 - Compute Model’s Predictions


This is the forward pass; it simply computes the model’s predictions using the current
values of the parameters / weights. At the very beginning, we will be producing really
bad predictions, as we started with random values in Step 0.

Step 1

1 # Step 1 - Computes our model's predicted output - forward pass


2 yhat = b + w * x_train

Figure 0.2 - Model’s predictions (with random parameters)

28 | Chapter 0: Visualizing Gradient Descent


Step 2 - Compute the Loss
There is a subtle but fundamental difference between error and loss.

The error is the difference between the actual value (label) and the predicted
value computed for a single data point. So, for a given i-th point (from our dataset
of N points), its error is:

Equation 0.2 - Error

The error of the first point in our dataset (i = 0) can be represented like this:

Figure 0.3 - Prediction error (for one data point)

The loss, on the other hand, is some sort of aggregation of errors for a set of data
points.

It seems rather obvious to compute the loss for all (N) data points, right? Well, yes
and no. Although it will surely yield a more stable path from the initial random
parameters to the parameters that minimize the loss, it will also surely be slow.

This means one needs to sacrifice (a bit of) stability for the sake of speed. This is easily
accomplished by randomly choosing (without replacement) a subset of n out of N
data points each time we compute the loss.

Step 2 - Compute the Loss | 29


Batch, Mini-batch, and Stochastic Gradient Descent

• If we use all points in the training set (n = N) to compute the


loss, we are performing a batch gradient descent;

 • If we were to use a single point (n = 1) each time, it would be a


stochastic gradient descent;

• Anything else (n) in between 1 and N characterizes a mini-


batch gradient descent;

For a regression problem, the loss is given by the mean squared error (MSE); that
is, the average of all squared errors; that is, the average of all squared differences
between labels (y) and predictions (b + wx).

Equation 0.3 - Loss: mean squared error (MSE)

In the code below, we are using all data points of the training set to compute the
loss, so n = N = 80, meaning we are indeed performing batch gradient descent.

Step 2

1 # Step 2 - Computing the loss


2 # We are using ALL data points, so this is BATCH gradient
3 # descent. How wrong is our model? That's the error!
4 error = (yhat - y_train)
5
6 # It is a regression, so it computes mean squared error (MSE)
7 loss = (error ** 2).mean()
8
9 print(loss)

30 | Chapter 0: Visualizing Gradient Descent


Output

2.7421577700550976

Loss Surface

We have just computed the loss (2.74) corresponding to our randomly initialized
parameters (b = 0.49 and w = -0.13). What if we did the same for ALL possible
values of b and w? Well, not all possible values, but all combinations of evenly spaced
values in a given range, like:

# Reminder:
# true_b = 1
# true_w = 2

# we have to split the ranges in 100 evenly spaced intervals each


b_range = np.linspace(true_b - 3, true_b + 3, 101)
w_range = np.linspace(true_w - 3, true_w + 3, 101)
# meshgrid is a handy function that generates a grid of b and w
# values for all combinations
bs, ws = np.meshgrid(b_range, w_range)
bs.shape, ws.shape

Output

((101, 101), (101, 101))

The result of the meshgrid() operation was two (101, 101) matrices representing
the values of each parameter inside a grid. What does one of these matrices look
like?

bs

Step 2 - Compute the Loss | 31


Output

array([[-2. , -1.94, -1.88, ..., 3.88, 3.94, 4. ],


[-2. , -1.94, -1.88, ..., 3.88, 3.94, 4. ],
[-2. , -1.94, -1.88, ..., 3.88, 3.94, 4. ],
...,
[-2. , -1.94, -1.88, ..., 3.88, 3.94, 4. ],
[-2. , -1.94, -1.88, ..., 3.88, 3.94, 4. ],
[-2. , -1.94, -1.88, ..., 3.88, 3.94, 4. ]])

Sure, we’re somewhat cheating here, since we know the true values of b and w, so
we can choose the perfect ranges for the parameters. But it is for educational
purposes only :-)

Next, we could use those values to compute the corresponding predictions, errors,
and losses. Let’s start by taking a single data point from the training set and
computing the predictions for every combination in our grid:

dummy_x = x_train[0]
dummy_yhat = bs + ws * dummy_x
dummy_yhat.shape

Output

(101, 101)

Thanks to its broadcasting capabilities, Numpy is able to understand we want to


multiply the same x value by every entry in the ws matrix. This operation resulted
in a grid of predictions for that single data point. Now we need to do this for every
one of our 80 data points in the training set.

We can use Numpy's apply_along_axis() to accomplish this:

 Look ma, no loops!

32 | Chapter 0: Visualizing Gradient Descent


all_predictions = np.apply_along_axis(
func1d=lambda x: bs + ws * x,
axis=1,
arr=x_train,
)
all_predictions.shape

Output

(80, 101, 101)

Cool! We got 80 matrices of shape (101, 101), one matrix for each data point, each
matrix containing a grid of predictions.

The errors are the difference between the predictions and the labels, but we
cannot perform this operation right away—we need to work a bit on our labels (y),
so they have the proper shape for it (broadcasting is good, but not that good):

all_labels = y_train.reshape(-1, 1, 1)
all_labels.shape

Output

(80, 1, 1)

Our labels turned out to be 80 matrices of shape (1, 1)—the most boring kind of
matrix—but that is enough for broadcasting to work its magic. We can compute the
errors now:

all_errors = (all_predictions - all_labels)


all_errors.shape

Output

(80, 101, 101)

Each prediction has its own error, so we get 80 matrices of shape (101, 101), again,

Step 2 - Compute the Loss | 33


one matrix for each data point, each matrix containing a grid of errors.

The only step missing is to compute the mean squared error. First, we take the
square of all errors. Then, we average the squares over all data points. Since our
data points are in the first dimension, we use axis=0 to compute this average:

all_losses = (all_errors ** 2).mean(axis=0)


all_losses.shape

Output

(101, 101)

The result is a grid of losses, a matrix of shape (101, 101), each loss corresponding
to a different combination of the parameters b and w.

These losses are our loss surface, which can be visualized in a 3D plot, where the
vertical axis (z) represents the loss values. If we connect the combinations of b and
w that yield the same loss value, we’ll get an ellipse. Then, we can draw this ellipse
in the original b x w plane (in blue, for a loss value of 3). This is, in a nutshell, what a
contour plot does. From now on, we’ll always use the contour plot, instead of the
corresponding 3D version.

Figure 0.4 - Loss surface

In the center of the plot, where parameters (b, w) have values close to (1, 2), the loss
is at its minimum value. This is the point we’re trying to reach using gradient

34 | Chapter 0: Visualizing Gradient Descent


descent.

In the bottom, slightly to the left, there is the random start point, corresponding to
our randomly initialized parameters.

This is one of the nice things about tackling a simple problem like a linear
regression with a single feature: We have only two parameters, and thus we can
compute and visualize the loss surface.

Unfortunately, for the absolute majority of problems, computing


the loss surface is not going to be feasible: we have to rely on
 gradient descent’s ability to reach a point of minimum, even if it
starts at some random point.

Cross-Sections

Another nice thing is that we can cut a cross-section in the loss surface to check
what the loss would look like if the other parameter were held constant.

Let’s start by making b = 0.52 (the value from b_range that is closest to our initial
random value for b, 0.4967). We cut a cross-section vertically (the red dashed line)
on our loss surface (left plot), and we get the resulting plot on the right:

Figure 0.5 - Vertical cross-section; parameter b is fixed

What does this cross-section tell us? It tells us that, if we keep b constant (at 0.52),
the loss, seen from the perspective of parameter w, can be minimized if w gets
increased (up to some value between 2 and 3).

Step 2 - Compute the Loss | 35


Sure, different values of b produce different cross-section loss curves for w. And
those curves will depend on the shape of the loss surface (more on that later, in the
"Learning Rate" section).

OK, so far, so good… What about the other cross-section? Let’s cut it horizontally
now, making w = -0.16 (the value from w_range that is closest to our initial random
value for b, -0.1382). The resulting plot is on the right:

Figure 0.6 - Horizontal cross-section; parameter w is fixed

Now, if we keep w constant (at -0.16), the loss, seen from the perspective of
parameter b, can be minimized if b gets increased (up to some value close to 2).

In general, the purpose of this cross-section is to get the effect on


 the loss of changing a single parameter, while keeping
everything else constant. This is, in a nutshell, a gradient :-)

Now I have a question for you: Which of the two dashed curves,
red (w changes, b is constant) or black (b changes, w is constant)
 yields the largest changes in loss when we modify the changing
parameter?

The answer is coming right up in the next section!

Step 3 - Compute the Gradients


A gradient is a partial derivative—why partial? Because one computes it with

36 | Chapter 0: Visualizing Gradient Descent


respect to (w.r.t.) a single parameter. We have two parameters, b and w, so we must
compute two partial derivatives.

A derivative tells you how much a given quantity changes when you slightly vary
some other quantity. In our case, how much does our MSE loss change when we
vary each of our two parameters separately?

Gradient = how much the loss changes if ONE parameter


 changes a little bit!

The right-most part of the equations below is what you usually see in
implementations of gradient descent for simple linear regression. In the
intermediate step, I show you all elements that pop up from the application of the
chain rule,[37] so you know how the final expression came to be.

Equation 0.4 - Computing gradients w.r.t coefficients b and w using n points

Just to be clear: We will always use our "regular" error computed at the beginning
of Step 2. The loss surface is surely eye candy, but, as I mentioned before, it is only
feasible to use it for educational purposes.

Step 3

1 # Step 3 - Computes gradients for both "b" and "w" parameters


2 b_grad = 2 * error.mean()
3 w_grad = 2 * (x_train * error).mean()
4 print(b_grad, w_grad)

Step 3 - Compute the Gradients | 37


Output

-3.044811379650508 -1.8337537171510832

Visualizing Gradients

Since the gradient for b is larger (in absolute value, 3.04) than the gradient for w (in
absolute value, 1.83), the answer for the question I posed you in the "Cross-
Sections" section is: The black curve (b changes, w is constant) yields the largest
changes in loss.

 "Why is that?"

To answer that, let’s first put both cross-section plots side-by-side, so we can more
easily compare them. What is the main difference between them?

Figure 0.7 - Cross-sections of the loss surface

The curve on the right is steeper. That’s your answer! Steeper curves have larger
gradients.

Cool! That’s the intuition… Now, let’s get a bit more geometrical. So, I am zooming
in on the regions given by the red and black squares of Figure 0.7.

From the "Cross-Sections" section, we already know that to minimize the loss, both
b and w needed to be increased. So, keeping in the spirit of using gradients, let’s
increase each parameter a little bit (always keeping the other one fixed!). By the

38 | Chapter 0: Visualizing Gradient Descent


way, in this example, a little bit equals 0.12 (for convenience’s sake, so it results in a
nicer plot).

What effect do these increases have on the loss? Let’s check it out:

Figure 0.8 - Computing (approximate) gradients, geometrically

On the left plot, increasing w by 0.12 yields a loss reduction of 0.21. The
geometrically computed and roughly approximate gradient is given by the ratio
between the two values: -1.79. How does this result compare to the actual value of
the gradient (-1.83)? It is actually not bad for a crude approximation. Could it be
better? Sure, if we make the increase in w smaller and smaller (like 0.01, instead of
0.12), we’ll get better and better approximations. In the limit, as the increase
approaches zero, we’ll arrive at the precise value of the gradient. Well, that’s the
definition of a derivative!

The same reasoning goes for the plot on the right: increasing b by the same 0.12
yields a larger loss reduction of 0.35. Larger loss reduction, larger ratio, larger
gradient—and larger error, too, since the geometric approximation (-2.90) is
farther away from the actual value (-3.04).

Time for another question: Which curve, red or black, do you like best to reduce
the loss? It should be the black one, right? Well, yes, but it is not as straightforward
as we’d like it to be. We’ll dig deeper into this in the "Learning Rate" section.

Backpropagation

Now that you’ve learned about computing the gradient of the loss function w.r.t. to

Step 3 - Compute the Gradients | 39


each parameter using the chain rule, let me show you how Wikipedia describes
backpropagation (highlights are mine):

The backpropagation algorithm works by computing the gradient of the loss


function with respect to each weight by the chain rule, computing the
gradient one layer at a time, iterating backward from the last layer to avoid
redundant calculations of intermediate terms in the chain rule;

The term backpropagation strictly refers only to the algorithm for computing
the gradient, not how the gradient is used; but the term is often used loosely
to refer to the entire learning algorithm, including how the gradient is used,
such as by stochastic gradient descent.

Does it seem familiar? That’s it; backpropagation is nothing more than "chained"
gradient descent. That’s, in a nutshell, how a neural network is trained: It uses
backpropagation, starting at its last layer and working its way back, to update the
weights through all the layers.

In our example, we have a single layer, even a single neuron, so there is no need to
backpropagate anything (more on that in the next chapter).

Step 4 - Update the Parameters


In the final step, we use the gradients to update the parameters. Since we are
trying to minimize our losses, we reverse the sign of the gradient for the update.

There is still another (hyper-)parameter to consider: the learning rate, denoted by


the Greek letter eta (that looks like the letter n), which is the multiplicative factor
that we need to apply to the gradient for the parameter update.

Equation 0.5 - Updating coefficients b and w using computed gradients and a learning rate

We can also interpret this a bit differently: Each parameter is going to have its

40 | Chapter 0: Visualizing Gradient Descent


value updated by a constant value eta (the learning rate), but this constant is going
to be weighted by how much that parameter contributes to minimizing the loss
(its gradient).

Honestly, I believe this way of thinking about the parameter update makes more
sense. First, you decide on a learning rate that specifies your step size, while the
gradients tell you the relative impact (on the loss) of taking a step for each
parameter. Then, you take a given number of steps that’s proportional to that
relative impact: more impact, more steps.

"How do you choose a learning rate?"

 That is a topic on its own and beyond the scope of this section as
well. We’ll get back to it later on, in the second volume of the
series.

In our example, let’s start with a value of 0.1 for the learning rate (which is a
relatively high value, as far as learning rates are concerned).

Step 4

1 # Sets learning rate - this is "eta" ~ the "n"-like Greek letter


2 lr = 0.1
3 print(b, w)
4
5 # Step 4 - Updates parameters using gradients and the
6 # learning rate
7 b = b - lr * b_grad
8 w = w - lr * w_grad
9
10 print(b, w)

Output

[0.49671415] [-0.1382643]
[0.80119529] [0.04511107]

What’s the impact of one update on our model? Let’s visually check its predictions.

Step 4 - Update the Parameters | 41


Figure 0.9 - Updated model’s predictions

It looks better … at least it started pointing in the right direction!

Learning Rate

The learning rate is the most important hyper-parameter. There is a gigantic


amount of material on how to choose a learning rate, how to modify the learning
rate during the training, and how the wrong learning rate can completely ruin the
model training.

Maybe you’ve seen this famous graph[38](from Stanford’s CS231n class) that shows
how a learning rate that is too high or too low affects the loss during training. Most
people will see it (or have seen it) at some point in time. This is pretty much general
knowledge, but I think it needs to be thoroughly explained and visually
demonstrated to be truly understood. So, let’s start!

I will tell you a little story (trying to build an analogy here, please bear with me!):
Imagine you are coming back from hiking in the mountains and you want to get
back home as quickly as possible. At some point in your path, you can either choose
to go ahead or to make a right turn.

The path ahead is almost flat, while the path to your right is kinda steep. The
steepness is the gradient. If you take a single step one way or the other, it will lead
to different outcomes (you’ll descend more if you take one step to the right instead
of going ahead).

But, here is the thing: You know that the path to your right is getting you home

42 | Chapter 0: Visualizing Gradient Descent


faster, so you don’t take just one step, but multiple steps in that direction. The
steeper the path, the more steps you take! Remember, "more impact, more steps!"
You just cannot resist the urge to take that many steps; your behavior seems to be
completely determined by the landscape (This analogy is getting weird, I know…)

But, you still have one choice: You can adjust the size of your step. You can choose
to take steps of any size, from tiny steps to long strides. That’s your learning rate.

OK, let’s see where this little story brought us so far. That’s how you’ll move, in a
nutshell:

updated location = previous location + step size * number of steps

Now, compare it to what we did with the parameters:

updated value = previous value - learning rate * gradient

You get the point, right? I hope so, because the analogy completely falls apart now.
At this point, after moving in one direction (say, the right turn we talked about),
you’d have to stop and move in the other direction (for just a fraction of a step,
because the path was almost flat, remember?). And so on and so forth. Well, I don’t
think anyone has ever returned from hiking in such an orthogonal zig-zag path!

Anyway, let’s explore further the only choice you have: the size of your step—I
mean, the learning rate.

"Choose your learning rate wisely."


 Grail Knight

Low Learning Rate

It makes sense to start with baby steps, right? This means using a low learning rate.
Low learning rates are safe(r), as expected. If you were to take tiny steps while
returning home from your hiking, you’d be more likely to arrive there safe and
sound—but it would take a lot of time. The same holds true for training models:
Low learning rates will likely get you to (some) minimum point, eventually.
Unfortunately, time is money, especially when you’re paying for GPU time in the
cloud, so, there is an incentive to try higher learning rates.

How does this reasoning apply to our model? From computing our (geometric)
gradients, we know we need to take a given number of steps: 1.79 (parameter w)

Step 4 - Update the Parameters | 43


and 2.90 (parameter b), respectively. Let’s set our step size to 0.2 (low-ish). It
means we move 0.36 for w and 0.58 for b.

IMPORTANT: In real life, a learning rate of 0.2 is usually


 considered HIGH—but in our very simple linear regression
example, it still qualifies as low-ish.

Where does this movement lead us? As you can see in the plots below (as shown by
the new dots to the right of the original ones), in both cases, the movement took us
closer to the minimum; more so on the right because the curve is steeper.

Figure 0.10 - Using a low-ish learning rate

High Learning Rate

What would have happened if we had used a high learning rate instead, say, a step
size of 0.8? As we can see in the plots below, we start to, literally, run into trouble.

44 | Chapter 0: Visualizing Gradient Descent


Figure 0.11 - Using a high learning rate

Even though everything is still OK on the left plot, the right plot shows us a
completely different picture: We ended up on the other side of the curve. That is
not good… You’d be going back and forth, alternately hitting both sides of the
curve.

 "Well, even so, I may still reach the minimum; why is it so bad?"

In our simple example, yes, you’d eventually reach the minimum because the curve
is nice and round.

But, in real problems, the "curve" has a really weird shape that allows for bizarre
outcomes, such as going back and forth without ever approaching the minimum.

In our analogy, you moved so fast that you fell down and hit the other side of the
valley, then kept going down like a ping-pong. Hard to believe, I know, but you
definitely don’t want that!

Very High Learning Rate

Wait, it may get worse than that! Let’s use a really high learning rate, say, a step
size of 1.1!

Step 4 - Update the Parameters | 45


Figure 0.12 - Using a really high learning rate

"He chose … poorly."


 Grail Knight

Ok, that is bad. On the right plot, not only did we end up on the other side of the
curve again, but we actually climbed up. This means our loss increased, instead of
decreased! How is that even possible? You’re moving so fast downhill that you end up
climbing it back up?! Unfortunately, the analogy cannot help us anymore. We need
to think about this particular case in a different way.

First, notice that everything is fine on the left plot. The enormous learning rate did
not cause any issues, because the left curve is less steep than the one on the right.
In other words, the curve on the left can take higher learning rates than the curve
on the right.

What can we learn from it?

46 | Chapter 0: Visualizing Gradient Descent


Too high, for a learning rate, is a relative concept: It depends on
how steep the curve is, or, in other words, it depends on how
large the gradient is.

We do have many curves, many gradients: one for each


parameter. But we only have one single learning rate to choose
(sorry, that’s the way it is!).
 It means that the size of the learning rate is limited by the
steepest curve. All other curves must follow suit, meaning they’d
be using a suboptimal learning rate, given their shapes.

The reasonable conclusion is: It is best if all the curves are


equally steep, so the learning rate is closer to optimal for all of
them!

"Bad" Feature

How do we achieve equally steep curves? I’m on it! First, let’s take a look at a slightly
modified example, which I am calling the "bad" dataset:

• I multiplied our feature (x) by 10, so it is in the range [0, 10] now, and renamed
it bad_x.

• But since I do not want the labels (y) to change, I divided the original true_w
parameter by 10 and renamed it bad_w—this way, both bad_w * bad_x and w *
x yield the same results.

Step 4 - Update the Parameters | 47


true_b = 1
true_w = 2
N = 100

# Data Generation
np.random.seed(42)

# We divide w by 10
bad_w = true_w / 10
# And multiply x by 10
bad_x = np.random.rand(N, 1) * 10

# So, the net effect on y is zero - it is still


# the same as before
y = true_b + bad_w * bad_x + (.1 * np.random.randn(N, 1))

Then, I performed the same split as before for both original and bad datasets and
plotted the training sets side-by-side, as you can see below:

# Generates train and validation sets


# It uses the same train_idx and val_idx as before,
# but it applies to bad_x
bad_x_train, y_train = bad_x[train_idx], y[train_idx]
bad_x_val, y_val = bad_x[val_idx], y[val_idx]

Figure 0.13 - Same data, different scales for feature x

48 | Chapter 0: Visualizing Gradient Descent


The only difference between the two plots is the scale of feature x. Its range was
[0, 1], now it is [0, 10]. The label y hasn’t changed, and I did not touch true_b.

Does this simple scaling have any meaningful impact on our gradient descent?
Well, if it hadn’t, I wouldn’t be asking it, right? Let’s compute a new loss surface and
compare to the one we had before.

Figure 0.14 - Loss surface—before and after scaling feature x (Obs.: left plot looks a bit different
than Figure 0.6 because it is centered at the "after" minimum)

Look at the contour values of Figure 0.14: The dark blue line was 3.0, and now it is
50.0! For the same range of parameter values, loss values are much higher.

Let’s look at the cross-sections before and after we multiplied feature x by 10.

Step 4 - Update the Parameters | 49


Figure 0.15 - Comparing cross-sections: before and after

What happened here? The red curve got much steeper (larger gradient), and thus
we must use a lower learning rate to safely descend along it.

More important, the difference in steepness between the red


and the black curves increased.

This is exactly what WE NEED TO AVOID!


 Do you remember why?

Because the size of the learning rate is limited by the steepest


curve!

How can we fix it? Well, we ruined it by scaling it 10x larger. Perhaps we can make
it better if we scale it in a different way.

Scaling / Standardizing / Normalizing

Different how? There is this beautiful thing called the StandardScaler, which
transforms a feature in such a way that it ends up with zero mean and unit
standard deviation.

How does it achieve that? First, it computes the mean and the standard deviation of
a given feature (x) using the training set (N points):

50 | Chapter 0: Visualizing Gradient Descent


Equation 0.6 - Computing mean and standard deviation

Then, it uses both values to scale the feature:

Equation 0.7 - Standardizing

If we were to recompute the mean and the standard deviation of the scaled
feature, we would get 0 and 1, respectively. This pre-processing step is commonly
referred to as normalization, although, technically, it should always be referred to as
standardization.

IMPORTANT: Pre-processing steps like the StandardScaler


MUST be performed AFTER the train-validation-test split;
otherwise, you’ll be leaking information from the validation and /

 or test sets to your model!

After using the training set only to fit the StandardScaler, you
should use its transform() method to apply the pre-processing
step to all datasets: training, validation, and test.

Step 4 - Update the Parameters | 51


Zero Mean and Unit Standard Deviation

Let’s start with the unit standard deviation; that is, scaling the feature
values such that its standard deviation equals one. This is one of the most
important pre-processing steps, not only for the sake of improving the
performance of gradient descent, but for other techniques such as principal
component analysis (PCA) as well. The goal is to have all numerical features
in a similar scale, so the results are not affected by the original range of each
feature.

Think of two common features in a model: age and salary. While age usually
varies between 0 and 110, salaries can go from the low hundreds (say, 500)
to several thousand (say, 9,000). If we compute the corresponding standard
deviations, we may get values like 25 and 2,000, respectively. Thus, we need
to standardize both features to have them on equal footing.

And then there is the zero mean; that is, centering the feature at zero.
Deeper neural networks may suffer from a very serious condition called
vanishing gradients. Since the gradients are used to update the parameters,
smaller and smaller (that is, vanishing) gradients mean smaller and smaller
updates, up to the point of a standstill: The network simply stops learning.
One way to help the network to fight this condition is to center its inputs,
the features, at zero. We’ll get back to this later on, in the second volume of
the series, while discussing activation functions.

The code below will illustrate this well.

scaler = StandardScaler(with_mean=True, with_std=True)


# We use the TRAIN set ONLY to fit the scaler
scaler.fit(x_train)

# Now we can use the already fit scaler to TRANSFORM


# both TRAIN and VALIDATION sets
scaled_x_train = scaler.transform(x_train)
scaled_x_val = scaler.transform(x_val)

Notice that we are not regenerating the data—we are using the original feature x
as input for the StandardScaler and transforming it into a scaled x. The labels (y)

52 | Chapter 0: Visualizing Gradient Descent


are left untouched.

Let’s plot the three of them—original, "bad", and scaled—side-by-side to illustrate


the differences.

Figure 0.16 - Same data, three different scales for feature x

Once again, the only difference between the plots is the scale of feature x. Its
range was originally [0, 1], then we made it [0, 10], and now the StandardScaler
made it [-1.5, 1.5].

OK, time to check the loss surface: To illustrate the differences, I am plotting the
three of them side-by-side: original, "bad", and scaled. It looks like Figure 0.17.

Figure 0.17 - Loss surfaces for different scales for feature x (Obs.: left and center plots look a bit
different than Figure 0.14 because they are centered at the "scaled" minimum)

BEAUTIFUL, isn’t it? The textbook definition of a bowl :-)

In practice, this is the best surface one could hope for: The cross-sections are going
to be similarly steep, and a good learning rate for one of them is also good for the

Step 4 - Update the Parameters | 53


other.

Sure, in the real world, you’ll never get a pretty bowl like that. But our conclusion
still holds:

1. Always standardize (scale) your features.


 2. DO NOT EVER FORGET #1!

Step 5 - Rinse and Repeat!


Now we use the updated parameters to go back to Step 1 and restart the process.

Definition of Epoch

 An epoch is complete whenever every point in the training set


(N) has already been used in all steps: forward pass, computing
loss, computing gradients, and updating parameters.

During one epoch, we perform at least one update, but no more


than N updates.

The number of updates (N/n) will depend on the type of gradient


descent being used:

• For batch (n = N) gradient descent, this is trivial, as it uses all


 points for computing the loss—one epoch is the same as one
update.

• For stochastic (n = 1) gradient descent, one epoch means N


updates, since every individual data point is used to perform
an update.

• For mini-batch (of size n), one epoch has N/n updates, since a
mini-batch of n data points is used to perform an update.

Repeating this process over and over for many epochs is, in a nutshell, training a
model.

What happens if we run it over 1,000 epochs?

54 | Chapter 0: Visualizing Gradient Descent


Figure 0.18 - Final model’s predictions

In the next chapter, we’ll put all these steps together and run it for 1,000 epochs, so
we’ll get to the parameters depicted in the figure above, b = 1.0235 and w = 1.9690.

 "Why 1,000 epochs?"

No particular reason, but this is a fairly simple model, and we can afford to run it
over a large number of epochs. In more-complex models, though, a couple of dozen
epochs may be enough. We’ll discuss this a bit more in Chapter 1.

The Path of Gradient Descent

In Step 3, we have seen the loss surface and both random start and minimum
points.

Which path is gradient descent going to take to go from random start to a


minimum? How long will it take? Will it actually reach the minimum?

The answers to all these questions depend on many things, like the learning rate, the
shape of the loss surface, and the number of points we use to compute the loss.

Depending on whether we use batch, mini-batch, or stochastic gradient descent,


the path is going to be more or less smooth, and it is likely to reach the minimum in
more or less time.

To illustrate the differences, I’ve generated paths over 100 epochs using either 80
data points (batch), 16 data points (mini-batch), or a single data point (stochastic) for

Step 5 - Rinse and Repeat! | 55


computing the loss, as shown in the figure below.

Figure 0.19 - The paths of gradient descent (Obs.: random start is different from Figure 0.4)

You can see that the resulting parameters at the end of Epoch 1 differ greatly from
one another. This is a direct consequence of the number of updates happening
during one epoch, according to the batch size. In our example, for 100 epochs:

• 80 data points (batch): 1 update / epoch, totaling 100 updates


• 16 data points (mini-batch): 5 updates / epoch, totaling 500 updates
• 1 data point (stochastic): 80 updates / epoch, totaling 8,000 updates

So, for both center and right plots, the path between random start and Epoch 1
contains multiple updates, which are not depicted in the plot (otherwise it would
be very cluttered)—that’s why the line connecting two epochs is dashed, instead of
solid. In reality, there would be zig-zagging lines connecting every two epochs.

There are two things to notice:

• It should be no surprise that mini-batch gradient descent is able to get closer to


the minimum point (using the same number of epochs) since it benefits from a
larger number of updates than batch gradient descent.

• The stochastic gradient descent path is somewhat weird: It gets quite close to
the minimum point at the end of Epoch 1 already, but then it seems to fail to
actually reach it. But this is expected since it uses a single data point for each
update; it will never stabilize, forever hovering in the neighborhood of the
minimum point.

Clearly, there is a trade-off here: Either we have a stable and smooth trajectory, or
we move faster toward the minimum.

56 | Chapter 0: Visualizing Gradient Descent


Recap
This finishes our journey through the inner workings of gradient descent. By now, I
hope you have developed better intuition about the many different aspects
involved in the process.

In time, with practice, you’ll observe the behaviors described here in your own
models. Make sure to try plenty of different combinations: mini-batch sizes,
learning rates, etc. This way, not only will your models learn, but so will you :-)

This is a (not so) short recap of everything we covered in this chapter:

• defining a simple linear regression model


• generating synthetic data for it
• performing a train-validation split on our dataset
• randomly initializing the parameters of our model
• performing a forward pass; that is, making predictions using our model
• computing the errors associated with our predictions
• aggregating the errors into a loss (mean squared error)
• learning that the number of points used to compute the loss defines the kind of
gradient descent we’re using: batch (all), mini-batch, or stochastic (one)

• visualizing an example of a loss surface and using its cross-sections to get the
loss curves for individual parameters

• learning that a gradient is a partial derivative and it represents how much the
loss changes if one parameter changes a little bit

• computing the gradients for our model’s parameters using equations, code,
and geometry

• learning that larger gradients correspond to steeper loss curves


• learning that backpropagation is nothing more than "chained" gradient
descent

• using the gradients and a learning rate to update the parameters


• comparing the effects on the loss of using low, high, and very high learning
rates

• learning that loss curves for all parameters should be, ideally, similarly steep

Recap | 57
• visualizing the effects of using a feature with a larger range, making the loss
curve for the corresponding parameter much steeper

• using Scikit-Learn’s StandardScaler to bring a feature to a reasonable range


and thus making the loss surface more bowl-shaped and its cross-sections
similarly steep

• learning that preprocessing steps like scaling should be applied after the train-
validation split to prevent leakage

• figuring out that performing all steps (forward pass, loss, gradients, and
parameter update) makes one epoch

• visualizing the path of gradient descent over many epochs and realizing it is
heavily dependent on the kind of gradient descent used: batch, mini-batch, or
stochastic

• learning that there is a trade-off between the stable and smooth path of batch
gradient descent and the fast and somewhat chaotic path of stochastic gradient
descent, making the use of mini-batch gradient descent a good compromise
between the other two

You are now ready to put it all together and actually train a model using PyTorch!

[33] https://github.com/dvgodoy/PyTorchStepByStep/blob/master/Chapter00.ipynb
[34] https://colab.research.google.com/github/dvgodoy/PyTorchStepByStep/blob/master/Chapter00.ipynb
[35] https://en.wikipedia.org/wiki/Gradient_descent
[36] https://en.wikipedia.org/wiki/Gaussian_noise
[37] https://en.wikipedia.org/wiki/Chain_rule
[38] https://bit.ly/2BxCxTO

58 | Chapter 0: Visualizing Gradient Descent


Chapter 1
A Simple Regression Problem
Spoilers
In this chapter, we will:

• briefly review the steps of gradient descent (optional)


• use gradient descent to implement a linear regression in Numpy
• create tensors in PyTorch (finally!)
• understand the difference between CPU and GPU tensors
• understand PyTorch’s main feature, autograd, used to perform automatic
differentiation

• visualize the dynamic computation graph


• create a loss function
• define an optimizer
• implement our own model class
• implement nested and sequential models, using PyTorch’s layers
• organize our code into three parts: data preparation, model configuration, and
model training

Jupyter Notebook
The Jupyter notebook corresponding to Chapter 1[39] is part of the official Deep
Learning with PyTorch Step-by-Step repository on GitHub. You can also run it
directly in Google Colab[40].

If you’re using a local installation, open your terminal or Anaconda prompt and
navigate to the PyTorchStepByStep folder you cloned from GitHub. Then, activate
the pytorchbook environment and run jupyter notebook:

$ conda activate pytorchbook

(pytorchbook)$ jupyter notebook

Spoilers | 59
If you’re using Jupyter’s default settings, this link should open Chapter 1’s
notebook. If not, just click on Chapter01.ipynb on your Jupyter’s home page.

Imports

For the sake of organization, all libraries needed throughout the code used in any
given chapter are imported at its very beginning. For this chapter, we’ll need the
following imports:

import numpy as np
from sklearn.linear_model import LinearRegression

import torch
import torch.optim as optim
import torch.nn as nn
from torchviz import make_dot

A Simple Regression Problem


Most tutorials start with some nice and pretty image classification problem to
illustrate how to use PyTorch. It may seem cool, but I believe it distracts you from
the main goal: learning how PyTorch works.

For this reason, in this first example, I will stick with a simple and familiar problem:
a linear regression with a single feature x! It doesn’t get much simpler than that!

Equation 1.1 - Simple linear regression model

It is also possible to think of it as the simplest neural network possible: one input,
one output, and no activation function (that is, linear).

60 | Chapter 1: A Simple Regression Problem


Figure 1.1 - The simplest of all neural networks

If you have read Chapter 0, you can either choose to skip to the
 "Linear Regression in Numpy" section or to use the next two
sections as a review.

Data Generation
Let’s start generating some synthetic data. We start with a vector of 100 (N) points
for our feature (x) and create our labels (y) using b = 1, w = 2, and some Gaussian
noise[41] (epsilon).

Synthetic Data Generation


Data Generation

1 true_b = 1
2 true_w = 2
3 N = 100
4
5 # Data Generation
6 np.random.seed(42)
7 x = np.random.rand(N, 1)
8 epsilon = (.1 * np.random.randn(N, 1))
9 y = true_b + true_w * x + epsilon

Next, let’s split our synthetic data into train and validation sets, shuffling the array
of indices and using the first 80 shuffled points for training.

Data Generation | 61
Notebook Cell 1.1 - Splitting synthetic dataset into train and validation sets for linear regression

1 # Shuffles the indices


2 idx = np.arange(N)
3 np.random.shuffle(idx)
4
5 # Uses first 80 random indices for train
6 train_idx = idx[:int(N*.8)]
7 # Uses the remaining indices for validation
8 val_idx = idx[int(N*.8):]
9
10 # Generates train and validation sets
11 x_train, y_train = x[train_idx], y[train_idx]
12 x_val, y_val = x[val_idx], y[val_idx]

Figure 1.2 - Synthetic data: train and validation sets

We know that b = 1, w = 2, but now let’s see how close we can get to the true
values by using gradient descent and the 80 points in the training set (for training,
N = 80).

Gradient Descent
I’ll cover the five basic steps you’ll need to go through to use gradient descent and
the corresponding Numpy code.

62 | Chapter 1: A Simple Regression Problem


Step 0 - Random Initialization

For training a model, you need to randomly initialize the parameters / weights (we
have only two, b and w).

Step 0

# Step 0 - Initializes parameters "b" and "w" randomly


np.random.seed(42)
b = np.random.randn(1)
w = np.random.randn(1)

print(b, w)

Output

[0.49671415] [-0.1382643]

Step 1 - Compute Model’s Predictions

This is the forward pass; it simply computes the model’s predictions using the current
values of the parameters / weights. At the very beginning, we will be producing really
bad predictions, as we started with random values from Step 0.

Step 1

# Step 1 - Computes our model's predicted output - forward pass


yhat = b + w * x_train

Step 2 - Compute the Loss

For a regression problem, the loss is given by the mean squared error (MSE); that
is, the average of all squared errors; that is, the average of all squared differences
between labels (y) and predictions (b + wx).

In the code below, we are using all data points of the training set to compute the
loss, so n = N = 80, meaning we are performing batch gradient descent.

Gradient Descent | 63
Step 2

# Step 2 - Computing the loss


# We are using ALL data points, so this is BATCH gradient
# descent. How wrong is our model? That's the error!
error = (yhat - y_train)

# It is a regression, so it computes mean squared error (MSE)


loss = (error ** 2).mean()

print(loss)

Output

2.7421577700550976

Batch, Mini-batch, and Stochastic Gradient Descent

• If we use all points in the training set (n = N) to compute the


loss, we are performing a batch gradient descent.

 • If we were to use a single point (n = 1) each time, it would be a


stochastic gradient descent.

• Anything else (n) in between 1 and N characterizes a mini-


batch gradient descent.

Step 3 - Compute the Gradients

A gradient is a partial derivative. Why partial? Because one computes it with


respect to (w.r.t.) a single parameter. We have two parameters, b and w, so we must
compute two partial derivatives.

A derivative tells you how much a given quantity changes when you slightly vary
some other quantity. In our case, how much does our MSE loss change when we
vary each of our two parameters separately?

Gradient = how much the loss changes if ONE parameter


 changes a little bit!

64 | Chapter 1: A Simple Regression Problem


Step 3

# Step 3 - Computes gradients for both "b" and "w" parameters


b_grad = 2 * error.mean()
w_grad = 2 * (x_train * error).mean()
print(b_grad, w_grad)

Output

-3.044811379650508 -1.8337537171510832

Step 4 - Update the Parameters

In the final step, we use the gradients to update the parameters. Since we are
trying to minimize our losses, we reverse the sign of the gradient for the update.

There is still another (hyper-)parameter to consider: the learning rate, denoted by


the Greek letter eta (that looks like the letter n), which is the multiplicative factor
that we need to apply to the gradient for the parameter update.

"How do you choose a learning rate?"

 That is a topic on its own and beyond the scope of this section as
well. We’ll get back to it in the second volume of the series.

In our example, let’s start with a value of 0.1 for the learning rate (which is a
relatively high value, as far as learning rates are concerned!).

Step 4

# Sets learning rate - this is "eta" ~ the "n"-like Greek letter


lr = 0.1
print(b, w)

# Step 4 - Updates parameters using gradients and


# the learning rate
b = b - lr * b_grad
w = w - lr * w_grad

print(b, w)

Gradient Descent | 65
Output

[0.49671415] [-0.1382643]
[0.80119529] [0.04511107]

Step 5 - Rinse and Repeat!

Now we use the updated parameters to go back to Step 1 and restart the process.

Definition of Epoch

 An epoch is complete whenever every point in the training set


(N) has already been used in all steps: forward pass, computing
loss, computing gradients, and updating parameters.

During one epoch, we perform at least one update, but no more


than N updates.

The number of updates (N/n) will depend on the type of gradient


descent being used:

• For batch (n = N) gradient descent, this is trivial, as it uses all


 points for computing the loss—one epoch is the same as one
update.

• For stochastic (n = 1) gradient descent, one epoch means N


updates, since every individual data point is used to perform
an update.

• For mini-batch (of size n), one epoch has N/n updates, since a
mini-batch of n data points is used to perform an update.

Repeating this process over and over for many epochs is, in a nutshell, training a
model.

Linear Regression in Numpy


It’s time to implement our linear regression model using gradient descent and
Numpy only.

66 | Chapter 1: A Simple Regression Problem


"Wait a minute … I thought this book was about PyTorch!" Yes, it is,
but this serves two purposes: first, to introduce the structure of
 our task, which will remain largely the same and, second, to show
you the main pain points so you can fully appreciate how much
PyTorch makes your life easier :-)

For training a model, there is a first initialization step (line numbers refer to
Notebook Cell 1.2 code below):

• Random initialization of parameters / weights (we have only two, b and


w)—lines 3 and 4

• Initialization of hyper-parameters (in our case, only learning rate and number of
epochs)—lines 9 and 11

Make sure to always initialize your random seed to ensure the


reproducibility of your results. As usual, the random seed is 42[42],
 the (second) least random[43] of all random seeds one could
possibly choose.

For each epoch, there are four training steps (line numbers refer to Notebook Cell
1.2 code below):

• Compute model’s predictions—this is the forward pass—line 15


• Compute the loss, using predictions and labels and the appropriate loss function
for the task at hand—lines 20 and 22

• Compute the gradients for every parameter—lines 25 and 26


• Update the parameters—lines 30 and 31

For now, we will be using batch gradient descent only, meaning, we’ll use all data
points for each one of the four steps above. It also means that going once through
all of the steps is already one epoch. Then, if we want to train our model over 1,000
epochs, we just need to add a single loop.

In Chapter 2, we’ll introduce mini-batch gradient descent, and


 then we’ll have to include a second inner loop.

Linear Regression in Numpy | 67


Notebook Cell 1.2 - Implementing gradient descent for linear regression using Numpy

1 # Step 0 - Initializes parameters "b" and "w" randomly


2 np.random.seed(42)
3 b = np.random.randn(1) ①
4 w = np.random.randn(1) ①
5
6 print(b, w)
7
8 # Sets learning rate - this is "eta" ~ the "n"-like Greek letter
9 lr = 0.1 ②
10 # Defines number of epochs
11 n_epochs = 1000 ②
12
13 for epoch in range(n_epochs):
14 # Step 1 - Computes model's predicted output - forward pass
15 yhat = b + w * x_train ③
16
17 # Step 2 - Computes the loss
18 # We are using ALL data points, so this is BATCH gradient
19 # descent. How wrong is our model? That's the error!
20 error = (yhat - y_train) ④
21 # It is a regression, so it computes mean squared error (MSE)
22 loss = (error ** 2).mean() ④
23
24 # Step 3 - Computes gradients for both "b" and "w" parameters
25 b_grad = 2 * error.mean() ⑤
26 w_grad = 2 * (x_train * error).mean() ⑤
27
28 # Step 4 - Updates parameters using gradients and
29 # the learning rate
30 b = b - lr * b_grad ⑥
31 w = w - lr * w_grad ⑥
32
33 print(b, w)

① Step 0: Random initialization of parameters / weights

② Initialization of hyper-parameters

③ Step 1: Forward pass

④ Step 2: Computing loss

68 | Chapter 1: A Simple Regression Problem


⑤ Step 3: Computing gradients

⑥ Step 4: Updating parameters

Output

# b and w after initialization


[0.49671415] [-0.1382643]
# b and w after our gradient descent
[1.02354094] [1.96896411]

"Do we need to run it for 1,000 epochs? Shouldn’t it stop


 automatically after getting close enough to the minimum loss?"

Good question: We don’t need to run it for 1,000 epochs. There are ways of
stopping it earlier, once the progress is considered negligible (for instance, if the
loss was barely reduced). These are called, most appropriately, early stopping
methods. For now, since our model is a very simple one, we can afford to train it for
1,000 epochs.

Figure 1.3 - Fully trained model’s predictions

Just to make sure we haven’t made any mistakes in our code, we can use Scikit-
Learn’s linear regression to fit the model and compare the coefficients.

Linear Regression in Numpy | 69


# Sanity Check: do we get the same results as our
# gradient descent?
linr = LinearRegression()
linr.fit(x_train, y_train)
print(linr.intercept_, linr.coef_[0])

Output

# intercept and coef from Scikit-Learn


[1.02354075] [1.96896447]

They match up to six decimal places—we have a fully working implementation of


linear regression using Numpy.

Time to TORCH it!

PyTorch
First, we need to cover a few basic concepts that may throw you off-balance if you
don’t grasp them well enough before going full-force on modeling.

In deep learning, we see tensors everywhere. Well, Google’s framework is called


TensorFlow for a reason! What is a tensor, anyway?

Tensor

In Numpy, you may have an array that has three dimensions, right? That is,
technically speaking, a tensor.

A scalar (a single number) has zero dimensions, a vector has one


 dimension, a matrix has two dimensions, and a tensor has three
or more dimensions. That’s it!

But, to keep things simple, it is commonplace to call vectors and matrices tensors as
well—so, from now on, everything is either a scalar or a tensor.

70 | Chapter 1: A Simple Regression Problem


Figure 1.4 - Tensors are just higher-dimensional matrices - make sure to check this version out :-)

You can create tensors in PyTorch pretty much the same way you create arrays in
Numpy. Using tensor() you can create either a scalar or a tensor.

PyTorch’s tensors have equivalent functions to its Numpy counterparts, like


ones(), zeros(), rand(), randn(), and many more. In the example below, we create
one of each: scalar, vector, matrix, and tensor—or, saying it differently, one scalar
and three tensors.

scalar = torch.tensor(3.14159)
vector = torch.tensor([1, 2, 3])
matrix = torch.ones((2, 3), dtype=torch.float)
tensor = torch.randn((2, 3, 4), dtype=torch.float)

print(scalar)
print(vector)
print(matrix)
print(tensor)

PyTorch | 71
Output

tensor(3.1416)
tensor([1, 2, 3])
tensor([[1., 1., 1.],
[1., 1., 1.]])
tensor([[[-1.0658, -0.5675, -1.2903, -0.1136],
[ 1.0344, 2.1910, 0.7926, -0.7065],
[ 0.4552, -0.6728, 1.8786, -0.3248]],

[[-0.7738, 1.3831, 1.4861, -0.7254],


[ 0.1989, -1.0139, 1.5881, -1.2295],
[-0.5338, -0.5548, 1.5385, -1.2971]]])

You can get the shape of a tensor using its size() method or its shape attribute.

print(tensor.size(), tensor.shape)

Output

torch.Size([2, 3, 4]) torch.Size([2, 3, 4])

All tensors have shapes, but scalars have "empty" shapes, since they are
dimensionless (or zero dimensions, if you prefer):

print(scalar.size(), scalar.shape)

Output

torch.Size([]) torch.Size([])

You can also reshape a tensor using its view() (preferred) or reshape() methods.

72 | Chapter 1: A Simple Regression Problem


Beware: The view() method only returns a tensor with the
desired shape that shares the underlying data with the original
tensor—it DOES NOT create a new, independent, tensor!

 The reshape() method may or may not create a copy! The


reasons behind this apparently weird behavior are beyond the
scope of this section, but this behavior is the reason why view()
is preferred.

# We get a tensor with a different shape but it still is


# the SAME tensor
same_matrix = matrix.view(1, 6)
# If we change one of its elements...
same_matrix[0, 1] = 2.
# It changes both variables: matrix and same_matrix
print(matrix)
print(same_matrix)

Output

tensor([[1., 2., 1.],


[1., 1., 1.]])
tensor([[1., 2., 1., 1., 1., 1.]])

If you want to copy all data, that is, duplicate the data in memory, you may use
either its new_tensor() or clone() methods.

# We can use "new_tensor" method to REALLY copy it into a new one


different_matrix = matrix.new_tensor(matrix.view(1, 6))
# Now, if we change one of its elements...
different_matrix[0, 1] = 3.
# The original tensor (matrix) is left untouched!
# But we get a "warning" from PyTorch telling us
# to use "clone()" instead!
print(matrix)
print(different_matrix)

PyTorch | 73
Output

tensor([[1., 2., 1.],


[1., 1., 1.]])
tensor([[1., 3., 1., 1., 1., 1.]])

Output

UserWarning: To copy construct from a tensor, it is


recommended to use sourceTensor.clone().detach() or
sourceTensor.clone().detach().requires_grad_(True),
rather than tensor.new_tensor(sourceTensor).
"""Entry point for launching an IPython kernel.

It seems that PyTorch prefers that we use clone()—together with


detach()—instead of new_tensor(). Both ways accomplish exactly the same
result, but the code below is deemed cleaner and more readable.

# Let's follow PyTorch's suggestion and use "clone" method


another_matrix = matrix.view(1, 6).clone().detach()
# Again, if we change one of its elements...
another_matrix[0, 1] = 4.
# The original tensor (matrix) is left untouched!
print(matrix)
print(another_matrix)

Output

tensor([[1., 2., 1.],


[1., 1., 1.]])
tensor([[1., 4., 1., 1., 1., 1.]])

You’re probably asking yourself: "But, what about the detach()


 method—what does it do?"

It removes the tensor from the computation graph, which probably raises more
questions than it answers, right? Don’t worry, we’ll get back to it later in this
chapter.

74 | Chapter 1: A Simple Regression Problem


Loading Data, Devices, and CUDA

It is time to start converting our Numpy code to PyTorch: We’ll start with the
training data; that is, our x_train and y_train arrays.

 "How do we go from Numpy’s arrays to PyTorch’s tensors?"

That’s what as_tensor() is good for (which works like from_numpy()).

This operation preserves the type of the array:

x_train_tensor = torch.as_tensor(x_train)
x_train.dtype, x_train_tensor.dtype

Output

(dtype('float64'), torch.float64)

You can also easily cast it to a different type, like a lower-precision (32-bit) float,
which will occupy less space in memory, using float():

float_tensor = x_train_tensor.float()
float_tensor.dtype

Output

torch.float32

IMPORTANT: Both as_tensor() and from_numpy() return a


tensor that shares the underlying data with the original Numpy
 array. Similar to what happened when we used view() in the last
section, if you modify the original Numpy array, you’re modifying
the corresponding PyTorch tensor too, and vice-versa.

PyTorch | 75
dummy_array = np.array([1, 2, 3])
dummy_tensor = torch.as_tensor(dummy_array)
# Modifies the numpy array
dummy_array[1] = 0
# Tensor gets modified too...
dummy_tensor

Output

tensor([1, 0, 3])

"What do I need as_tensor() for? Why can’t I just use


 torch.tensor()?"

Well, you could … just keep in mind that torch.tensor() always makes a copy of
the data, instead of sharing the underlying data with the Numpy array.

You can also perform the opposite operation, namely, transforming a PyTorch
tensor back to a Numpy array. That’s what numpy() is good for:

dummy_tensor.numpy()

Output

array([1, 0, 3])

So far, we have only created CPU tensors. What does it mean? It means the data in
the tensor is stored in the computer’s main memory and any operations performed
on it are going to be handled by its CPU (the central processing unit; for instance,
an Intel® Core™ i7 Processor). So, although the data is, technically speaking, in the
memory, we’re still calling this kind of tensor a CPU tensor.

 "Is there any other kind of tensor?"

Yes, there is also a GPU tensor. A GPU (which stands for graphics processing unit)
is the processor of a graphics card. These tensors store their data in the graphics
card’s memory, and operations on top of them are performed by the GPU. For

76 | Chapter 1: A Simple Regression Problem


more information on the differences between CPUs and GPUs, please refer to this
link[44].

If you have a graphics card from NVIDIA, you can use the power of its GPU to
speed up model training. PyTorch supports the use of these GPUs for model
training using CUDA (Compute Unified Device Architecture), which needs to be
previously installed and configured (please refer to the "Setup Guide" for more
information on this).

If you do have a GPU (and you managed to install CUDA), we’re getting to the part
where you get to use it with PyTorch. But, even if you do not have a GPU, you
should stick around in this section anyway. Why? First, you can use a free GPU
from Google Colab, and, second, you should always make your code GPU-ready;
that is, it should automatically run in a GPU, if one is available.

 "How do I know if a GPU is available?"

PyTorch has your back once more—you can use cuda.is_available() to find out if
you have a GPU at your disposal and set your device accordingly. So, it is good
practice to figure this out at the top of your code:

Defining Your Device

device = 'cuda' if torch.cuda.is_available() else 'cpu'

So, if you don’t have a GPU, your device is called cpu. If you do have a GPU, your
device is called cuda or cuda:0. Why isn’t it called gpu, then? Don’t ask me… The
important thing is, your code will be able to always use the appropriate device.

 "Why cuda:0? Are there others, like cuda:1, cuda:2 and so on?"

There may be if you are lucky enough to have multiple GPUs in your computer. Since
this is usually not the case, I am assuming you have either one GPU or none. So,
when we tell PyTorch to send a tensor to cuda without any numbering, it will send it
to the current CUDA device, which is device #0 by default.

If you are using someone else’s computer and you don’t know how many GPUs it
has, or which model they are, you can figure it out using cuda.device_count() and
cuda.get_device_name():

PyTorch | 77
n_cudas = torch.cuda.device_count()
for i in range(n_cudas):
print(torch.cuda.get_device_name(i))

Output

GeForce GTX 1060 6GB

In my case, I have only one GPU, and it is a GeForce GTX 1060 model with 6 GB RAM.

There is only one thing left to do: turn our tensor into a GPU tensor. That’s what
to() is good for. It sends a tensor to the specified device.

gpu_tensor = torch.as_tensor(x_train).to(device)
gpu_tensor[0]

Output - GPU

tensor([0.7713], device='cuda:0', dtype=torch.float64)

Output - CPU

tensor([0.7713], dtype=torch.float64)

In this case, there is no device information in the printed output because PyTorch
simply assumes the default (cpu).

 "Should I use to(device), even if I am using CPU only?"

Yes, you should, because there is no cost in doing so. If you have only a CPU, your
tensor is already a CPU tensor, so nothing will happen. But if you share your code
with others on GitHub, whoever has a GPU will benefit from it.

Let’s put it all together now and make our training data ready for PyTorch.

78 | Chapter 1: A Simple Regression Problem


Notebook Cell 1.3 - Loading data: turning Numpy arrays into PyTorch tensors

1 device = 'cuda' if torch.cuda.is_available() else 'cpu'


2
3 # Our data was in Numpy arrays, but we need to transform them
4 # into PyTorch tensors and then send them to the
5 # chosen device
6 x_train_tensor = torch.as_tensor(x_train).float().to(device)
7 y_train_tensor = torch.as_tensor(y_train).float().to(device)

So, we defined a device, converted both Numpy arrays into PyTorch tensors, cast
them to floats, and sent them to the device. Let’s take a look at the types:

# Here we can see the difference - notice that .type() is more


# useful since it also tells us WHERE the tensor is (device)
print(type(x_train), type(x_train_tensor), x_train_tensor.type())

Output - GPU

<class 'numpy.ndarray'> <class 'torch.Tensor'>


torch.cuda.FloatTensor

Output - CPU

<class 'numpy.ndarray'> <class 'torch.Tensor'>


torch.FloatTensor

If you compare the types of both variables, you’ll get what you’d expect:
numpy.ndarray for the first one and torch.Tensor for the second one.

But where does the x_train_tensor "live"? Is it a CPU or a GPU tensor? You can’t
say, but if you use PyTorch’s type(), it will reveal its location
—torch.cuda.FloatTensor—a GPU tensor in this case (assuming the output using a
GPU, of course).

There is one more thing to be aware of when using GPU tensors. Remember
numpy()? What if we want to turn a GPU tensor back into a Numpy array? We’ll get
an error:

PyTorch | 79
back_to_numpy = x_train_tensor.numpy()

Output

TypeError: can't convert CUDA tensor to numpy. Use


Tensor.cpu() to copy the tensor to host memory first.

Unfortunately, Numpy cannot handle GPU tensors! You need to make them CPU
tensors first using cpu():

back_to_numpy = x_train_tensor.cpu().numpy()

So, to avoid this error, use first cpu() and then numpy(), even if you are using a CPU.
It follows the same principle of to(device): You can share your code with others
who may be using a GPU.

Creating Parameters

What distinguishes a tensor used for training data (or validation, or test)—like the
ones we’ve just created—from a tensor used as a (trainable) parameter / weight?

The latter requires the computation of its gradients, so we can update their values
(the parameters’ values, that is). That’s what the requires_grad=True argument is
good for. It tells PyTorch to compute gradients for us.

 A tensor for a learnable parameter requires a gradient!

You may be tempted to create a simple tensor for a parameter and, later on, send it
to your chosen device, as we did with our data, right? Not so fast…

80 | Chapter 1: A Simple Regression Problem


In the next few pages, I will present four chunks of code showing
different attempts at creating parameters.

The first three attempts are shown to build up to a solution. The


 first one only works well if you never use a GPU. The second one
doesn’t work at all. The third one works, but it is too verbose.

The recommended way of creating parameters is the last:


Notebook Cell 1.4.

The first chunk of code below creates two tensors for our parameters, including
gradients and all. But they are CPU tensors, by default.

# FIRST
# Initializes parameters "b" and "w" randomly, ALMOST as we
# did in Numpy, since we want to apply gradient descent on
# these parameters we need to set REQUIRES_GRAD = TRUE
torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, dtype=torch.float)
w = torch.randn(1, requires_grad=True, dtype=torch.float)
print(b, w)

Output

tensor([0.3367], requires_grad=True)
tensor([0.1288], requires_grad=True)

Never forget to set the seed to ensure reproducibility, just like


 we did before while using Numpy. PyTorch’s equivalent is
torch.manual_seed().

"If I use the same seed in PyTorch as I used in Numpy (or, to put it
 differently, if I use 42 everywhere), will I get the same numbers?"

Unfortunately, NO.

You’ll get the same numbers for the same seed in the same package. PyTorch
generates a number sequence that is different from the one generated by Numpy,
even if you use the same seed in both.

PyTorch | 81
I am assuming you’d like to use your GPU (or the one from Google Colab), right? So
we need to send those tensors to the device. We can try the naive approach, the
one that worked well for sending the training data to the device. That’s our second
(and failed) attempt:

# SECOND
# But what if we want to run it on a GPU? We could just
# send them to device, right?
torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, dtype=torch.float).to(device)
w = torch.randn(1, requires_grad=True, dtype=torch.float).to(device)
print(b, w)
# Sorry, but NO! The to(device) "shadows" the gradient...

Output

tensor([0.3367], device='cuda:0', grad_fn=<CopyBackwards>)


tensor([0.1288], device='cuda:0', grad_fn=<CopyBackwards>)

We succeeded in sending them to another device, but we "lost" the gradients


somehow, since there is no more requires_grad=True, (don’t bother with the weird
grad_fn). Clearly, we need to do better…

In the third chunk, we first send our tensors to the device and then use the
requires_grad_() method to set its requires_grad attribute to True in place.

In PyTorch, every method that ends with an underscore (_), like


 the requires_grad_() method above, makes changes in-place,
meaning, they will modify the underlying variable.

82 | Chapter 1: A Simple Regression Problem


# THIRD
# We can create regular tensors and send them to
# the device (as we did with our data)
torch.manual_seed(42)
b = torch.randn(1, dtype=torch.float).to(device)
w = torch.randn(1, dtype=torch.float).to(device)
# and THEN set them as requiring gradients...
b.requires_grad_()
w.requires_grad_()
print(b, w)

Output

tensor([0.3367], device='cuda:0', requires_grad=True)


tensor([0.1288], device='cuda:0', requires_grad=True)

This approach worked fine; we managed to end up with gradient-requiring GPU


tensors for our parameters b and w. It seems a lot of work, though… Can we do
better still?

Yes, we can do better: We can assign tensors to a device at the moment of their
creation.

Notebook Cell 1.4 - Actually creating variables for the coefficients

# FINAL
# We can specify the device at the moment of creation
# RECOMMENDED!

# Step 0 - Initializes parameters "b" and "w" randomly


torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, \
dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, \
dtype=torch.float, device=device)
print(b, w)

PyTorch | 83
Output

tensor([0.1940], device='cuda:0', requires_grad=True)


tensor([0.1391], device='cuda:0', requires_grad=True)

Much easier, right?

Always assign tensors to a device at the moment of their


 creation to avoid unexpected behaviors!

If you do not have a GPU, your outputs are going to be slightly different:

Output - CPU

tensor([0.3367], requires_grad=True)
tensor([0.1288], requires_grad=True)

 "Why are they different, even if I am using the same seed?"

Similar to what happens when using the same seed in different packages (Numpy
and PyTorch), we also get different sequences of random numbers if PyTorch
generates them in different devices (CPU and GPU).

Now that we know how to create tensors that require gradients, let’s see how
PyTorch handles them. That’s the role of the…

Autograd
Autograd is PyTorch’s automatic differentiation package. Thanks to it, we don’t need
to worry about partial derivatives, chain rule, or anything like it.

backward

So, how do we tell PyTorch to do its thing and compute all gradients? That’s the
role of the backward() method. It will compute gradients for all (gradient-requiring)
tensors involved in the computation of a given variable.

Do you remember the starting point for computing the gradients? It was the loss,
as we computed its partial derivatives w.r.t. our parameters. Hence, we need to
invoke the backward() method from the corresponding Python variable:

84 | Chapter 1: A Simple Regression Problem


loss.backward().

Notebook Cell 1.5 - Autograd in action!

# Step 1 - Computes our model's predicted output - forward pass


yhat = b + w * x_train_tensor

# Step 2 - Computes the loss


# We are using ALL data points, so this is BATCH gradient
# descent. How wrong is our model? That's the error!
error = (yhat - y_train_tensor)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()

# Step 3 - Computes gradients for both "b" and "w" parameters


# No more manual computation of gradients!
# b_grad = 2 * error.mean()
# w_grad = 2 * (x_tensor * error).mean()
loss.backward() ①

① New "Step 3 - Computing Gradients" using backward()

Which tensors are going to be handled by the backward() method applied to the
loss?

• b
• w
• yhat
• error

We have set requires_grad=True to both b and w, so they are obviously included in


the list. We use them both to compute yhat, so it will also make it to the list. Then
we use yhat to compute the error, which is also added to the list.

Do you see the pattern here? If a tensor in the list is used to compute another
tensor, the latter will also be included in the list. Tracking these dependencies is
exactly what the dynamic computation graph is doing, as we’ll see shortly.

What about x_train_tensor and y_train_tensor? They are involved in the


computation too, but we created them as non-gradient-requiring tensors, so
backward() does not care about them.

Autograd | 85
print(error.requires_grad, yhat.requires_grad, \
b.requires_grad, w.requires_grad)
print(y_train_tensor.requires_grad, x_train_tensor.requires_grad)

Output

True True True True


False False

grad

What about the actual values of the gradients? We can inspect them by looking at
the grad attribute of a tensor.

print(b.grad, w.grad)

Output

tensor([-3.3881], device='cuda:0')
tensor([-1.9439], device='cuda:0')

If you check the method’s documentation, it clearly states that gradients are
accumulated. What does that mean? It means that, if we run Notebook Cell 1.5's
code (Steps 1 to 3) twice and check the grad attribute afterward, we will end up
with:

Output

tensor([-6.7762], device='cuda:0')
tensor([-3.8878], device='cuda:0')

If you do not have a GPU, your outputs are going to be slightly different:

Output

tensor([-3.1125]) tensor([-1.8156])

86 | Chapter 1: A Simple Regression Problem


Output

tensor([-6.2250]) tensor([-3.6313])

These gradients' values are exactly twice as much as they were before, as
expected!

OK, but that is actually a problem: We need to use the gradients corresponding to
the current loss to perform the parameter update. We should NOT use
accumulated gradients.

"If accumulating gradients is a problem, why does PyTorch do it by


 default?"

It turns out this behavior can be useful to circumvent hardware limitations.

During the training of large models, the necessary number of data points in a mini-
batch may be too large to fit in memory (of the graphics card). How can one solve
this, other than buying more-expensive hardware?

One can split a mini-batch into "sub-mini-batches" (horrible name, I know, don’t
quote me on this!), compute the gradients for those "subs" and accumulate them to
achieve the same result as computing the gradients on the full mini-batch.

Sounds confusing? No worries, this is fairly advanced already and somewhat


outside of the scope of this book, but I thought this particular behavior of PyTorch
needed to be explained.

Luckily, this is easy to solve!

zero_

Every time we use the gradients to update the parameters, we need to zero the
gradients afterward. And that’s what zero_() is good for.

# This code will be placed _after_ Step 4


# (updating the parameters)
b.grad.zero_(), w.grad.zero_()

Autograd | 87
Output

(tensor([0.], device='cuda:0'),
tensor([0.], device='cuda:0'))

What does the underscore (_) at the end of the method’s name
 mean? Do you remember? If not, go back to the previous section
and find out.

So, let’s ditch the manual computation of gradients and use both the backward()
and zero_() methods instead.

That’s it? Well, pretty much … but there is always a catch, and this time it has to do
with the update of the parameters.

Updating Parameters

"One does not simply update parameters…"


 Boromir

Unfortunately, our Numpy's code for updating parameters is not enough. Why
not?! Let’s try it out, simply copying and pasting it (this is the first attempt), changing
it slightly (second attempt), and then asking PyTorch to back off (yes, it is PyTorch’s
fault!).

Notebook Cell 1.6 - Updating parameters

1 # Sets learning rate - this is "eta" ~ the "n"-like Greek letter


2 lr = 0.1
3
4 # Step 0 - Initializes parameters "b" and "w" randomly
5 torch.manual_seed(42)
6 b = torch.randn(1, requires_grad=True, \
7 dtype=torch.float, device=device)
8 w = torch.randn(1, requires_grad=True, \
9 dtype=torch.float, device=device)
10
11 # Defines number of epochs
12 n_epochs = 1000
13

88 | Chapter 1: A Simple Regression Problem


14 for epoch in range(n_epochs):
15 # Step 1 - Computes model's predicted output - forward pass
16 yhat = b + w * x_train_tensor
17
18 # Step 2 - Computes the loss
19 # We are using ALL data points, so this is BATCH gradient
20 # descent. How wrong is our model? That's the error!
21 error = (yhat - y_train_tensor)
22 # It is a regression, so it computes mean squared error (MSE)
23 loss = (error ** 2).mean()
24
25 # Step 3 - Computes gradients for both "b" and "w"
26 # parameters. No more manual computation of gradients!
27 # b_grad = 2 * error.mean()
28 # w_grad = 2 * (x_tensor * error).mean()
29 # We just tell PyTorch to work its way BACKWARDS
30 # from the specified loss!
31 loss.backward()
32
33 # Step 4 - Updates parameters using gradients and
34 # the learning rate. But not so fast...
35 # FIRST ATTEMPT - just using the same code as before
36 # AttributeError: 'NoneType' object has no attribute 'zero_'
37 # b = b - lr * b.grad ①
38 # w = w - lr * w.grad ①
39 # print(b) ①
40
41 # SECOND ATTEMPT - using in-place Python assignment
42 # RuntimeError: a leaf Variable that requires grad
43 # has been used in an in-place operation.
44 # b -= lr * b.grad ②
45 # w -= lr * w.grad ②
46
47 # THIRD ATTEMPT - NO_GRAD for the win!
48 # We need to use NO_GRAD to keep the update out of
49 # the gradient computation. Why is that? It boils
50 # down to the DYNAMIC GRAPH that PyTorch uses...
51 with torch.no_grad(): ③
52 b -= lr * b.grad ③
53 w -= lr * w.grad ③
54
55 # PyTorch is "clingy" to its computed gradients; we

Autograd | 89
56 # need to tell it to let it go...
57 b.grad.zero_() ④
58 w.grad.zero_() ④
59
60 print(b, w)

① First Attempt: leads to an AttributeError

② Second Attempt: leads to a RuntimeError

③ Third Attempt: no_grad() solves the problem!

④ zero_() prevents gradient accumulation

In the first attempt, if we use the same update structure as in our Numpy code, we’ll
get the weird error below, but we can get a hint of what’s going on by looking at the
tensor itself. Once again, we "lost" the gradient while reassigning the update
results to our parameters. Thus, the grad attribute turns out to be None, and it
raises the error.

Output - First Attempt - Keeping the same code

tensor([0.7518], device='cuda:0', grad_fn=<SubBackward0>)


AttributeError: 'NoneType' object has no attribute 'zero_'

We then change it slightly, using a familiar in-place Python assignment in our


second attempt. And, once again, PyTorch complains about it and raises an error.

Output - Second Attempt - In-place assignment

RuntimeError: a leaf Variable that requires grad has been used in


an in-place operation.

Why?! It turns out to be a case of "too much of a good thing." The culprit is
PyTorch’s ability to build a dynamic computation graph from every Python
operation that involves any gradient-computing tensor or its dependencies.

We’ll go deeper into the inner workings of the dynamic computation graph in the
next section.

Time for our third attempt…

90 | Chapter 1: A Simple Regression Problem


no_grad

So, how do we tell PyTorch to "back off" and let us update our parameters without
messing up its fancy dynamic computation graph? That’s what torch.no_grad() is
good for. It allows us to perform regular Python operations on tensors without
affecting PyTorch’s computation graph.

Finally, we managed to successfully run our model and get the resulting
parameters. Surely enough, they match the ones we got in our Numpy-only
implementation.

Output - Third Attempt - NO_GRAD for the win!

# THIRD ATTEMPT - NO_GRAD for the win!


tensor([1.0235], device='cuda:0', requires_grad=True)
tensor([1.9690], device='cuda:0', requires_grad=True)

Remember:

"One does not simply update parameters … without no_grad"


 Boromir

It was true for going into Mordor, and it is also true for updating parameters.

It turns out, no_grad() has another use case other than allowing us to update
parameters; we’ll get back to it in Chapter 2 when dealing with a model’s
evaluation.

Dynamic Computation Graph


"Unfortunately, no one can be told what the dynamic computation

 graph is. You have to see it for yourself."

Morpheus

How great was The Matrix? Right? Right? But, jokes aside, I want you to see the
graph for yourself too!

The PyTorchViz package and its make_dot(variable) method allow us to easily


visualize a graph associated with a given Python variable involved in the gradient

Dynamic Computation Graph | 91


computation.

If you chose "Local Installation" in the "Setup Guide" and skipped


or had issues with Step 5 ("Install GraphViz software and
 TorchViz package"), you will get an error when trying to visualize
the graphs using make_dot.

So, let’s stick with the bare minimum: two (gradient-computing) tensors for our
parameters, predictions, errors, and loss—these are Steps 0, 1, and 2.

# Step 0 - Initializes parameters "b" and "w" randomly


torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, \
dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, \
dtype=torch.float, device=device)
# Step 1 - Computes our model's predicted output - forward pass
yhat = b + w * x_train_tensor
# Step 2 - Computes the loss
error = (yhat - y_train_tensor)
loss = (error ** 2).mean()
# We can try plotting the graph for any variable: yhat, error, loss
make_dot(yhat)

Running the code above will produce the graph below:

Figure 1.5 - Computation graph generated for yhat; Obs.: the corresponding variable names were
inserted manually

92 | Chapter 1: A Simple Regression Problem


Let’s take a closer look at its components:

• blue boxes ((1)s): these boxes correspond to the tensors we use as


parameters, the ones we’re asking PyTorch to compute gradients for

• gray boxes (MulBackward0 and AddBackward0): Python operations that involve


gradient-computing tensors or its dependencies

• green box ((80, 1)): the tensor used as the starting point for the computation
of gradients (assuming the backward() method is called from the variable used
to visualize the graph)—they are computed from the bottom-up in a graph

Now, take a closer look at the gray box at the bottom of the graph: Two arrows are
pointing to it since it is adding up two variables, b and w*x. Seems obvious, right?

Then, look at the other gray box (MulBackward0) of the same graph: It is performing
a multiplication operation, namely, w*x. But there is only one arrow pointing to it!
The arrow comes from the blue box that corresponds to our parameter w.

 "Why don’t we have a box for our data (x)?"

The answer is: We do not compute gradients for it!

So, even though there are more tensors involved in the operations performed by
the computation graph, it only shows gradient-computing tensors and their
dependencies.

What would happen to the computation graph if we set requires_grad to False for
our parameter b?

b_nograd = torch.randn(1, requires_grad=False, \


dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, \
dtype=torch.float, device=device)

yhat = b_nograd + w * x_train_tensor

make_dot(yhat)

Dynamic Computation Graph | 93


Figure 1.6 - Now parameter "b" does NOT have its gradient computed, but it is STILL used in
computation

Unsurprisingly, the blue box corresponding to parameter b is no more!

 Simple enough: No gradients, no graph!

The best thing about the dynamic computation graph is that you can make it as
complex as you want it. You can even use control flow statements (e.g., if
statements) to control the flow of the gradients.

Figure 1.7 shows an example of this. And yes, I do know that the computation itself
is complete nonsense!

b = torch.randn(1, requires_grad=True, \
dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, \
dtype=torch.float, device=device)
yhat = b + w * x_train_tensor
error = yhat - y_train_tensor
loss = (error ** 2).mean()
# this makes no sense!!
if loss > 0:
yhat2 = w * x_train_tensor
error2 = yhat2 - y_train_tensor
# neither does this!!
loss += error2.mean()
make_dot(loss)

94 | Chapter 1: A Simple Regression Problem


Figure 1.7 - Complex (and nonsensical!) computation graph just to make a point

Even though the computation is nonsensical, you can clearly see the effect of
adding a control flow statement like if loss > 0: It branches the computation
graph into two parts. The right branch performs the computation inside the if
statement, which gets added to the result of the left branch in the end. Cool, right?

Even though we are not building more-complex models like that in this book, this
small example illustrates very well PyTorch’s capabilities and how easily they can
be implemented in code.

Optimizer
So far, we’ve been manually updating the parameters using the computed
gradients. That’s probably fine for two parameters, but what if we had a whole lot
of them? We need to use one of PyTorch’s optimizers, like SGD, RMSprop, or
Adam.

Optimizer | 95
There are many optimizers: SGD is the most basic of them, and
Adam is one of the most popular.

Different optimizers use different mechanics for updating the


parameters, but they all achieve the same goal through, literally,
different paths.

To see what I mean by this, check out this animated GIF[45]


 developed by Alec Radford[46], available at Stanford’s "CS231n:
Convolutional Neural Networks for Visual Recognition"[47]
course. The animation shows a loss surface, just like the ones we
computed in Chapter 0, and the paths traversed by some
optimizers to achieve the minimum (represented by a star).

Remember, the choice of mini-batch size influences the path of


gradient descent, and so does the choice of an optimizer.

step / zero_grad

An optimizer takes the parameters we want to update, the learning rate we want
to use (and possibly many other hyper-parameters as well!), and performs the
updates through its step() method.

# Defines an SGD optimizer to update the parameters


optimizer = optim.SGD([b, w], lr=lr)

Besides, we also don’t need to zero the gradients one by one anymore. We just
invoke the optimizer’s zero_grad() method, and that’s it!

In the code below, we create a stochastic gradient descent (SGD) optimizer to update
our parameters b and w.

Don’t be fooled by the optimizer’s name: If we use all training


data at once for the update—as we are actually doing in the
 code—the optimizer is performing a batch gradient descent,
despite its name.

96 | Chapter 1: A Simple Regression Problem


Notebook Cell 1.7 - PyTorch’s optimizer in action—no more manual update of parameters!

1 # Sets learning rate - this is "eta" ~ the "n"-like Greek letter


2 lr = 0.1
3
4 # Step 0 - Initializes parameters "b" and "w" randomly
5 torch.manual_seed(42)
6 b = torch.randn(1, requires_grad=True, \
7 dtype=torch.float, device=device)
8 w = torch.randn(1, requires_grad=True, \
9 dtype=torch.float, device=device)
10
11 # Defines a SGD optimizer to update the parameters
12 optimizer = optim.SGD([b, w], lr=lr) ①
13
14 # Defines number of epochs
15 n_epochs = 1000
16
17 for epoch in range(n_epochs):
18 # Step 1 - Computes model's predicted output - forward pass
19 yhat = b + w * x_train_tensor
20
21 # Step 2 - Computes the loss
22 # We are using ALL data points, so this is BATCH gradient
23 # descent. How wrong is our model? That's the error!
24 error = (yhat - y_train_tensor)
25 # It is a regression, so it computes mean squared error (MSE)
26 loss = (error ** 2).mean()
27
28 # Step 3 - Computes gradients for both "b" and "w" parameters
29 loss.backward()
30
31 # Step 4 - Updates parameters using gradients and
32 # the learning rate. No more manual update!
33 # with torch.no_grad():
34 # b -= lr * b.grad
35 # w -= lr * w.grad
36 optimizer.step() ②
37
38 # No more telling Pytorch to let gradients go!
39 # b.grad.zero_()
40 # w.grad.zero_()

Optimizer | 97
41 optimizer.zero_grad() ③
42
43 print(b, w)

① Defining an optimizer

② New "Step 4 - Updating Parameters" using the optimizer

③ New "gradient zeroing" using the optimizer

Let’s inspect our two parameters just to make sure everything is still working fine:

Output

tensor([1.0235], device='cuda:0', requires_grad=True)


tensor([1.9690], device='cuda:0', requires_grad=True)

Cool! We’ve optimized the optimization process :-) What’s left?

Loss
We now tackle the loss computation. As expected, PyTorch has us covered once
again. There are many loss functions to choose from, depending on the task at
hand. Since ours is a regression, we are using the mean squared error (MSE) as loss,
and thus we need PyTorch’s nn.MSELoss():

# Defines an MSE loss function


loss_fn = nn.MSELoss(reduction='mean')
loss_fn

Output

MSELoss()

Notice that nn.MSELoss() is NOT the loss function itself: We do not pass
predictions and labels to it! Instead, as you can see, it returns another function,
which we called loss_fn: That is the actual loss function. So, we can pass a
prediction and a label to it and get the corresponding loss value:

98 | Chapter 1: A Simple Regression Problem


# This is a random example to illustrate the loss function
predictions = torch.tensor(0.5, 1.0)
labels = torch.tensor(2.0, 1.3)
loss_fn(predictions, labels)

Output

tensor(1.1700)

Moreover, you can also specify a reduction method to be applied;


that is, how do you want to aggregate the errors for individual
points? You can average them (reduction=“mean”) or simply sum
 them up (reduction=“sum”). In our example, we use the typical
mean reduction to compute MSE. If we had used sum as reduction,
we would actually be computing SSE (sum of squared errors).

Technically speaking, nn.MSELoss() is a higher-order function.

 If you’re not familiar with the concept, I will explain it briefly in


Chapter 2.

We then use the created loss function in the code below, at line 29, to compute the
loss, given our predictions and our labels:

Loss | 99
Notebook Cell 1.8 - PyTorch’s loss in action: no more manual loss computation!

1 # Sets learning rate - this is "eta" ~ the "n"-like


2 # Greek letter
3 lr = 0.1
4
5 # Step 0 - Initializes parameters "b" and "w" randomly
6 torch.manual_seed(42)
7 b = torch.randn(1, requires_grad=True, \
8 dtype=torch.float, device=device)
9 w = torch.randn(1, requires_grad=True, \
10 dtype=torch.float, device=device)
11
12 # Defines an SGD optimizer to update the parameters
13 optimizer = optim.SGD([b, w], lr=lr)
14
15 # Defines an MSE loss function
16 loss_fn = nn.MSELoss(reduction='mean') ①
17
18 # Defines number of epochs
19 n_epochs = 1000
20
21 for epoch in range(n_epochs):
22 # Step 1 - Computes model's predicted output - forward pass
23 yhat = b + w * x_train_tensor
24
25 # Step 2 - Computes the loss
26 # No more manual loss!
27 # error = (yhat - y_train_tensor)
28 # loss = (error ** 2).mean()
29 loss = loss_fn(yhat, y_train_tensor) ②
30
31 # Step 3 - Computes gradients for both "b" and "w" parameters
32 loss.backward()
33
34 # Step 4 - Updates parameters using gradients and
35 # the learning rate
36 optimizer.step()
37 optimizer.zero_grad()
38
39 print(b, w)

100 | Chapter 1: A Simple Regression Problem


① Defining a loss function

② New "Step 2 - Computing Loss" using loss_fn()

Output

tensor([1.0235], device='cuda:0', requires_grad=True)


tensor([1.9690], device='cuda:0', requires_grad=True)

Let’s take a look at the loss value at the end of training…

loss

Output

tensor(0.0080, device='cuda:0', grad_fn=<MeanBackward0>)

What if we wanted to have it as a Numpy array? I guess we could just use numpy()
again, right? (And cpu() as well, since our loss is in the cuda device.)

loss.cpu().numpy()

Output

RuntimeError Traceback (most recent call last)


<ipython-input-43-58c76a7bac74> in <module>
----> 1 loss.cpu().numpy()

RuntimeError: Can't call numpy() on Variable that requires


grad. Use var.detach().numpy() instead.

What happened here? Unlike our data tensors, the loss tensor is actually computing
gradients; to use numpy(), we need to detach() the tensor from the computation
graph first:

loss.detach().cpu().numpy()

Loss | 101
Output

array(0.00804466, dtype=float32)

This seems like a lot of work; there must be an easier way! And there is one,
indeed: We can use item(), for tensors with a single element, or tolist()
otherwise (it still returns a scalar if there is only one element, though).

print(loss.item(), loss.tolist())

Output

0.008044655434787273 0.008044655434787273

At this point, there’s only one piece of code left to change: the predictions. It is
then time to introduce PyTorch’s way of implementing a…

Model
In PyTorch, a model is represented by a regular Python class that inherits from the
Module class.

IMPORTANT: Are you comfortable with object-oriented


programming (OOP) concepts like classes, constructors, methods,
instances, and attributes?

If you’re unsure about any of these terms, I’d strongly


 recommend you follow tutorials like Real Python’s "Object-
Oriented Programming (OOP) in Python 3"[48] and "Supercharge
Your Classes With Python super()"[49] before proceeding.

Having a good understanding of OOP is key to benefitting the


most from PyTorch’s capabilities.

So, assuming you’re already comfortable with OOP, let’s dive into developing a
model in PyTorch.

102 | Chapter 1: A Simple Regression Problem


The most fundamental methods a model class needs to implement are:

• __init__(self): It defines the parts that make up the model—in our case, two
parameters, b and w.

You are not limited to defining parameters, though. Models can


contain other models as their attributes as well, so you can easily
nest them. We’ll see an example of this shortly as well.
 Besides, do not forget to include super().__init__() to execute
the __init__() method of the parent class (nn.Module) before
your own.

• forward(self, x): It performs the actual computation; that is, it outputs a


prediction, given the input x.

It may seem weird but, whenever using your model to make


predictions, you should NOT call the forward(x) method!

You should call the whole model instead, as in model(x), to


 perform a forward pass and output predictions.

The reason is, the call to the whole model involves extra steps,
namely, handling forward and backward hooks. If you don’t use
hooks (and we don’t use any right now), both calls are equivalent.

Hooks are a very useful mechanism that allows retrieving


 intermediate values in deeper models. We’ll get to them
eventually.

Model | 103
Let’s build a proper (yet simple) model for our regression task. It should look like
this:

Notebook Cell 1.9 - Building our "Manual" model, creating parameter by parameter!

1 class ManualLinearRegression(nn.Module):
2 def __init__(self):
3 super().__init__()
4 # To make "b" and "w" real parameters of the model,
5 # we need to wrap them with nn.Parameter
6 self.b = nn.Parameter(torch.randn(1,
7 requires_grad=True,
8 dtype=torch.float))
9 self.w = nn.Parameter(torch.randn(1,
10 requires_grad=True,
11 dtype=torch.float))
12
13 def forward(self, x):
14 # Computes the outputs / predictions
15 return self.b + self.w * x

Parameters

In the __init__() method, we define our two parameters, b and w, using the
Parameter class, to tell PyTorch that these tensors, which are attributes of the
ManualLinearRegression class, should be considered parameters of the model the
class represents.

Why should we care about that? By doing so, we can use our model’s parameters()
method to retrieve an iterator over the model’s parameters, including parameters
of nested models. Then we can use it to feed our optimizer (instead of building a list
of parameters ourselves!).

torch.manual_seed(42)
# Creates a "dummy" instance of our ManualLinearRegression model
dummy = ManualLinearRegression()
list(dummy.parameters())

104 | Chapter 1: A Simple Regression Problem


Output

[Parameter containing:
tensor([0.3367], requires_grad=True), Parameter containing:
tensor([0.1288], requires_grad=True)]

state_dict

Moreover, we can get the current values of all parameters using our model’s
state_dict() method.

dummy.state_dict()

Output

OrderedDict([('b', tensor([0.3367])), ('w', tensor([0.1288]))])

The state_dict() of a given model is simply a Python dictionary that maps each
attribute / parameter to its corresponding tensor. But only learnable parameters
are included, as its purpose is to keep track of parameters that are going to be
updated by the optimizer.

By the way, the optimizer itself has a state_dict() too, which contains its internal
state, as well as other hyper-parameters. Let’s take a quick look at it:

optimizer.state_dict()

Output

{'state': {},
'param_groups': [{'lr': 0.1,
'momentum': 0,
'dampening': 0,
'weight_decay': 0,
'nesterov': False,
'params': [140535747664704, 140535747688560]}]}

Model | 105
 "What do we need this for?"

It turns out, state dictionaries can also be used for checkpointing a model, as we will
see in Chapter 2.

Device

IMPORTANT: We need to send our model to the same device


 where the data is. If our data is made of GPU tensors, our model
must "live" inside the GPU as well.

If we were to send our dummy model to a device, it would look like this:

torch.manual_seed(42)
# Creates a "dummy" instance of our ManualLinearRegression model
# and sends it to the device
dummy = ManualLinearRegression().to(device)

Forward Pass

The forward pass is the moment when the model makes predictions.

Remember: You should make predictions calling model(x).

 DO NOT call model.forward(x)!

Otherwise, your model’s hooks will not work (if you have them).

We can use all these handy methods to change our code, which should be looking
like this:

106 | Chapter 1: A Simple Regression Problem


Notebook Cell 1.10 - PyTorch’s model in action: no more manual prediction / forward step!

1 # Sets learning rate - this is "eta" ~ the "n"-like


2 # Greek letter
3 lr = 0.1
4
5 # Step 0 - Initializes parameters "b" and "w" randomly
6 torch.manual_seed(42)
7 # Now we can create a model and send it at once to the device
8 model = ManualLinearRegression().to(device) ①
9
10 # Defines an SGD optimizer to update the parameters
11 # (now retrieved directly from the model)
12 optimizer = optim.SGD(model.parameters(), lr=lr)
13
14 # Defines an MSE loss function
15 loss_fn = nn.MSELoss(reduction='mean')
16
17 # Defines number of epochs
18 n_epochs = 1000
19
20 for epoch in range(n_epochs):
21 model.train() # What is this?!? ②
22
23 # Step 1 - Computes model's predicted output - forward pass
24 # No more manual prediction!
25 yhat = model(x_train_tensor) ③
26
27 # Step 2 - Computes the loss
28 loss = loss_fn(yhat, y_train_tensor)
29
30 # Step 3 - Computes gradients for both "b" and "w" parameters
31 loss.backward()
32
33 # Step 4 - Updates parameters using gradients and
34 # the learning rate
35 optimizer.step()
36 optimizer.zero_grad()
37
38 # We can also inspect its parameters using its state_dict
39 print(model.state_dict())

Model | 107
① Instantiating a model

② What IS this?!?

③ New "Step 1 - Forward Pass" using a model

Now, the printed statements will look like this—final values for parameters b and w
are still the same, so everything is OK :-)

Output

OrderedDict([('b', tensor([1.0235], device='cuda:0')),


('w', tensor([1.9690], device='cuda:0'))])

train

I hope you noticed one particular statement in the code (line 21), to which I
assigned a comment "What is this?!?"—model.train().

In PyTorch, models have a train() method, which, somewhat


disappointingly, does NOT perform a training step. Its only
purpose is to set the model to training mode.
 Why is this important? Some models may use mechanisms like
Dropout, for instance, which have distinct behaviors during
training and evaluation phases.

It is good practice to call model.train() in the training loop. It is also possible to set
a model to evaluation mode, but this is a topic for the next chapter.

Nested Models

In our model, we manually created two parameters to perform a linear regression.


What if, instead of defining individual parameters, we use PyTorch’s Linear model?

We are implementing a single-feature linear regression, one input and one output, so
the corresponding linear model would look like this:

linear = nn.Linear(1, 1)
linear

108 | Chapter 1: A Simple Regression Problem


Output

Linear(in_features=1, out_features=1, bias=True)

Do we still have our b and w parameters? Sure, we do:

linear.state_dict()

Output

OrderedDict([('weight', tensor([[-0.2191]])),
('bias', tensor([0.2018]))])

So, our former parameter b is the bias, and our former parameter w is the weight
(your values will be different since I haven’t set up a random seed for this example).

Now, let’s use PyTorch’s Linear model as an attribute of our own, thus creating a
nested model.

You are not limited to defining parameters, though; models can


 contain other models as their attributes as well, so you can easily
nest them. We’ll see an example of this shortly.

Even though this clearly is a contrived example, since we are pretty much wrapping
the underlying model without adding anything useful (or, at all!) to it, it illustrates the
concept well.

Notebook Cell 1.11 - Building a model using PyTorch’s Linear model

class MyLinearRegression(nn.Module):
def __init__(self):
super().__init__()
# Instead of our custom parameters, we use a Linear model
# with a single input and a single output
self.linear = nn.Linear(1, 1)

def forward(self, x):


# Now it only makes a call
self.linear(x)

Model | 109
In the __init__() method, we create an attribute that contains our nested Linear
model.

In the forward() method, we call the nested model itself to perform the forward
pass (notice, we are not calling self.linear.forward(x)!).

Now, if we call the parameters() method of this model, PyTorch will figure out the
parameters of its attributes recursively.

torch.manual_seed(42)
dummy = MyLinearRegression().to(device)
list(dummy.parameters())

Output

[Parameter containing:
tensor([[0.7645]], device='cuda:0', requires_grad=True),
Parameter containing:
tensor([0.8300], device='cuda:0', requires_grad=True)]

You can also add extra Linear attributes, and, even if you don’t use them at all in
the forward pass, they will still be listed under parameters().

If you prefer, you can also use state_dict() to get the parameter values, together
with their names:

dummy.state_dict()

Output

OrderedDict([('linear.weight',
tensor([[0.7645]], device='cuda:0')),
('linear.bias',
tensor([0.8300], device='cuda:0'))])

Notice that both bias and weight have a prefix with the attribute name: linear, from
the self.linear in the __init__() method.

110 | Chapter 1: A Simple Regression Problem


Sequential Models

Our model was simple enough. You may be thinking: "Why even bother to build a
class for it?!" Well, you have a point…

For straightforward models that use a series of built-in PyTorch models (like
Linear), where the output of one is sequentially fed as an input to the next, we can
use a, er … Sequential model :-)

In our case, we would build a sequential model with a single argument; that is, the
Linear model we used to train our linear regression. The model would look like
this:

Notebook Cell 1.12 - Building a model using PyTorch’s Sequential model

1 torch.manual_seed(42)
2 # Alternatively, you can use a Sequential model
3 model = nn.Sequential(nn.Linear(1, 1)).to(device)
4
5 model.state_dict()

Output

OrderedDict([('0.weight', tensor([[0.7645]], device='cuda:0')),


('0.bias', tensor([0.8300], device='cuda:0'))])

Simple enough, right?

We’ve been talking about models inside other models. This may get confusing real
quick, so let’s follow convention and call any internal model a layer.

Model | 111
Layers

A Linear model can be seen as a layer in a neural network.

Figure 1.8 - Layers of a neural network

In the figure above, the hidden layer would be nn.Linear(3, 5) (since it takes
three inputs—from the input layer—and generates five outputs), and the output
layer would be nn.Linear(5, 1) (since it takes five inputs—the outputs from the
hidden layer—and generates a single output).

If we use Sequential() to build it; it looks like this:

torch.manual_seed(42)
# Building the model from the figure above
model = nn.Sequential(nn.Linear(3, 5), nn.Linear(5, 1)).to(device)

model.state_dict()

112 | Chapter 1: A Simple Regression Problem


Output

OrderedDict([
('0.weight',
tensor([[ 0.4414, 0.4792, -0.1353],
[ 0.5304, -0.1265, 0.1165],
[-0.2811, 0.3391, 0.5090],
[-0.4236, 0.5018, 0.1081],
[ 0.4266, 0.0782, 0.2784]],
device='cuda:0')),
('0.bias',
tensor([-0.0815, 0.4451, 0.0853, -0.2695, 0.1472],
device='cuda:0')),
('1.weight',
tensor([[-0.2060, -0.0524, -0.1816, 0.2967, -0.3530]],
device='cuda:0')),
('1.bias',
tensor([-0.2062], device='cuda:0'))])

Since this sequential model does not have attribute names, state_dict() uses
numeric prefixes.

You can also use a model’s add_module() method to name the layers:

torch.manual_seed(42)
# Building the model from the figure above
model = nn.Sequential()
model.add_module('layer1', nn.Linear(3, 5))
model.add_module('layer2', nn.Linear(5, 1))
model.to(device)

Output

Sequential(
(layer1): Linear(in_features=3, out_features=5, bias=True)
(layer2): Linear(in_features=5, out_features=1, bias=True)
)

Model | 113
There are MANY different layers that can be used in PyTorch:

• Convolution Layers
• Pooling Layers
• Padding Layers
• Non-linear Activations
• Normalization Layers
• Recurrent Layers
• Transformer Layers
• Linear Layers
• Dropout Layers
• Sparse Layers (embeddings)
• Vision Layers
• DataParallel Layers (multi-GPU)
• Flatten Layer

So far, we have just used a Linear layer. In the next volume of the series, we’ll use
many others, like convolution, pooling, padding, flatten, dropout, and non-linear
activations.

Putting It All Together


We’ve covered a lot of ground so far, from coding a linear regression in Numpy
using gradient descent to transforming it into a PyTorch model, step-by-step.

It is time to put it all together and organize our code into three fundamental parts,
namely:

• data preparation (not data generation!)


• model configuration
• model training

Let’s tackle these three parts, in order.

114 | Chapter 1: A Simple Regression Problem


Data Preparation

There hasn’t been much data preparation up to this point, to be honest. After
generating our data points in Notebook Cell 1.1, the only preparation step
performed so far has been transforming Numpy arrays into PyTorch tensors, as in
Notebook Cell 1.3, which is reproduced below:

Define - Data Preparation V0

1 %%writefile data_preparation/v0.py
2
3 device = 'cuda' if torch.cuda.is_available() else 'cpu'
4
5 # Our data was in Numpy arrays, but we need to transform them
6 # into PyTorch's Tensors and then send them to the
7 # chosen device
8 x_train_tensor = torch.as_tensor(x_train).float().to(device)
9 y_train_tensor = torch.as_tensor(y_train).float().to(device)

Run - Data Preparation V0

%run -i data_preparation/v0.py

This part will get much more interesting in the next chapter when we get to use
Dataset and DataLoader classes :-)

 "What’s the purpose of saving cells to these files?"

We know we have to run the full sequence to train a model: data preparation,
model configuration, and model training. In Chapter 2, we’ll gradually improve each
of these parts, versioning them inside each corresponding folder. So, saving them to
files allows us to run a full sequence using different versions without having to
duplicate code.

Let’s say we start improving model configuration (and we will do exactly that in
Chapter 2), but the other two parts are still the same; how do we run the full
sequence?

Putting It All Together | 115


We use magic, just like that:

%run -i data_preparation/v0.py
%run -i model_configuration/v1.py
%run -i model_training/v0.py

Since we’re using the -i option, it works exactly as if we had copied the code from
the files into a cell and executed it.

Jupyter’s Magic Commands

You probably noticed the somewhat unusual %%writefile and %run


commands above. These are built-in magic commands.[50] A magic is a kind of
shortcut that extends a notebook’s capabilities.

We are using the following two magics to better organize our code:

• %%writefile[51]: As its name says, it writes the contents of the cell to a


file, but it does not run it, so we need to use yet another magic.

• %run[52]: It runs the named file inside the notebook as a program—but


independent of the rest of the notebook, so we need to use the -i
option to make all variables available, from both the notebook and the
file (technically speaking, the file is executed in IPython’s namespace).

In a nutshell, a cell containing one of our three fundamental parts will be


written to a versioned file inside the folder corresponding to that part.

In the example above, we write the cell to the data_preparation folder,


name it v0.py, and then execute it using the %run -i magic.

Model Configuration

We have seen plenty of this part: from defining parameters b and w manually, then
wrapping them up using the Module class, to using layers in a Sequential model.
We have also defined a loss function and an optimizer for our particular linear
regression model.

For the purpose of organizing our code, we’ll include the following elements in the
model configuration part:

116 | Chapter 1: A Simple Regression Problem


• a model
• a loss function (which needs to be chosen according to your model)
• an optimizer (although some people may disagree with this choice, it makes it
easier to further organize the code)

Most of the corresponding code can be found in Notebook Cell 1.10, lines 1-15, but
we’ll replace the ManualLinearRegression model with the Sequential model from
Notebook Cell 1.12:

Define - Model Configuration V0

1 %%writefile model_configuration/v0.py
2
3 # This is redundant now, but it won't be when we introduce
4 # Datasets...
5 device = 'cuda' if torch.cuda.is_available() else 'cpu'
6
7 # Sets learning rate - this is "eta" ~ the "n"-like Greek letter
8 lr = 0.1
9
10 torch.manual_seed(42)
11 # Now we can create a model and send it at once to the device
12 model = nn.Sequential(nn.Linear(1, 1)).to(device)
13
14 # Defines an SGD optimizer to update the parameters
15 # (now retrieved directly from the model)
16 optimizer = optim.SGD(model.parameters(), lr=lr)
17
18 # Defines an MSE loss function
19 loss_fn = nn.MSELoss(reduction='mean')

Run - Model Configuration V0

%run -i model_configuration/v0.py

Model Training

This is the last part, where the actual training takes place. It loops over the gradient
descent steps we saw at the beginning of this chapter:

Putting It All Together | 117


• Step 1: compute model’s predictions
• Step 2: compute the loss
• Step 3: compute the gradients
• Step 4: update the parameters

This sequence is repeated over and over until the number of epochs is reached.
The corresponding code for this part also comes from Notebook Cell 1.10, lines 17-
36.

 "What happened to the random initialization step?"

Since we are not manually creating parameters anymore, the initialization is


handled inside each layer during model creation.

Define - Model Training V0

1 %%writefile model_training/v0.py
2
3 # Defines number of epochs
4 n_epochs = 1000
5
6 for epoch in range(n_epochs):
7 # Sets model to TRAIN mode
8 model.train()
9
10 # Step 1 - Computes model's predicted output - forward pass
11 yhat = model(x_train_tensor)
12
13 # Step 2 - Computes the loss
14 loss = loss_fn(yhat, y_train_tensor)
15
16 # Step 3 - Computes gradients for both "b" and "w" parameters
17 loss.backward()
18
19 # Step 4 - Updates parameters using gradients and
20 # the learning rate
21 optimizer.step()
22 optimizer.zero_grad()

118 | Chapter 1: A Simple Regression Problem


Run - Model Training V0

%run -i model_training/v0.py

One last check to make sure we have everything right:

print(model.state_dict())

Output

OrderedDict([('0.weight', tensor([[1.9690]], device='cuda:0')),


('0.bias', tensor([1.0235], device='cuda:0'))])

Now, take a close, hard look at the code inside the training loop.

Ready? I have a question for you then…

"Would this code change if we were using a different optimizer, or


 loss, or even model?"

Before I give you the answer, let me address something else that may be on your
mind: "What is the point of all this?"

Well, in the next chapter we’ll get fancier, using more of PyTorch’s classes (like
Dataset and DataLoader) to further refine our data preparation step, and we’ll also
try to reduce boilerplate code to a minimum. So, splitting our code into three
logical parts will allow us to better handle these improvements.

And here is the answer: NO, the code inside the loop would not change.

I guess you figured out which boilerplate I was referring to, right?

Putting It All Together | 119


Recap
First of all, congratulations are in order: You have successfully implemented a fully
functioning model and training loop in PyTorch!

We have covered a lot of ground in this first chapter:

• implementing a linear regression in Numpy using gradient descent


• creating tensors in PyTorch, sending them to a device, and making parameters
out of them

• understanding PyTorch’s main feature, autograd, to perform automatic


differentiation using its associated properties and methods, like backward(),
grad, zero_(), and no_grad()

• visualizing the dynamic computation graph associated with a sequence of


operations

• creating an optimizer to simultaneously update multiple parameters, using its


step() and zero_grad() methods

• creating a loss function using PyTorch’s corresponding higher-order function


(more on that topic in the next chapter)

• understanding PyTorch’s Module class and creating your own models,


implementing __init__() and forward() methods, and making use of its built-
in parameters() and state_dict() methods

• transforming the original Numpy implementation into a PyTorch one using the
elements above

• realizing the importance of including model.train() inside the training loop


(never forget that!)

• implementing nested and sequential models using PyTorch’s layers


• putting it all together into neatly organized code divided into three distinct
parts: data preparation, model configuration, and model training

You are now ready for the next chapter. We’ll see more of PyTorch’s capabilities,
and we’ll further develop our training loop so it can be used for different problems
and models. You’ll be building your own, small draft of a library for training deep
learning models.

[39] https://github.com/dvgodoy/PyTorchStepByStep/blob/master/Chapter01.ipynb
[40] https://colab.research.google.com/github/dvgodoy/PyTorchStepByStep/blob/master/Chapter01.ipynb

120 | Chapter 1: A Simple Regression Problem


[41] https://en.wikipedia.org/wiki/Gaussian_noise
[42] https://bit.ly/2XZXjnk
[43] https://bit.ly/3fjCSHR
[44] https://bit.ly/2Y0lhPn
[45] https://bit.ly/2UDXDWM
[46] https://twitter.com/alecrad
[47] http://cs231n.stanford.edu/
[48] https://realpython.com/python3-object-oriented-programming/
[49] https://realpython.com/python-super/
[50] https://ipython.readthedocs.io/en/stable/interactive/magics.html
[51] https://bit.ly/30GH0vO
[52] https://bit.ly/3g1eQCm

Recap | 121
KEEP ON READING

Get the FULL version on

Kindle for $9.95


PDF for $14.95
Paperback for $24.95

KEEP ON LEARNING
Take your skills to the next level!
Tackle Computer Vision tasks using
convolutional neural networks and
transfer learning in the second
volume!

Kindle PDF Paperback

Discover how to handle sequences


and Natural Language Processing
tasks using recurrent neural
networks, Transformers, and
HuggingFace in the third volume!

Kindle PDF Paperback

You might also like