0% found this document useful (0 votes)
25 views

03_pytorch_computer_vision

The document discusses computer vision and convolutional neural networks (CNNs), outlining various problems such as binary classification, multiclass classification, object detection, and segmentation. It covers the architecture of CNNs using PyTorch, including data handling, model creation, training, and evaluation. Additionally, it addresses concepts like overfitting, data augmentation, and popular architectures in computer vision.

Uploaded by

Saksham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

03_pytorch_computer_vision

The document discusses computer vision and convolutional neural networks (CNNs), outlining various problems such as binary classification, multiclass classification, object detection, and segmentation. It covers the architecture of CNNs using PyTorch, including data handling, model creation, training, and evaluation. Additionally, it addresses concepts like overfitting, data augmentation, and popular architectures in computer vision.

Uploaded by

Saksham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Computer Vision & Convolutional

Neural Networks
with
Where can you get help?
“If in doubt, run the code”

• Follow along with the code


• Try it for yourself
• Press SHIFT + CMD + SPACE to read the docstring
• Search for it
• Try again
• Ask

https://www.github.com/mrdbourke/pytorch-deep-learning/discussions
“What is a computer vision
problem?”
Example computer vision problems
“Is this a photo of steak or pizza?” “Where’s the thing we’re looking for?”

Binary classi cation t h er )


Object detection
o r an o
(one thing
“What are the di erent sections in this image?”
“Is this a photo of sushi, steak or pizza?”

Multiclass classi cation


(more than one thing or
another) Segmentation
Source: On-device Panoptic Segmentation for Camera Using Transformers.
fi
ff
fi
Tesla Computer Vision

Source: Tesla AI Day Video (49:49). PS see 2:01:31 of the same video for surprise ;)
Tesla Computer Vision

Source: AI Drivr YouTube channel.


What we’re going to cover
(broadly)
• Getting a vision dataset to work with using torchvision.datasets

• Architecture of a convolutional neural network (CNN) with PyTorch

• An end-to-end multi-class image classi cation problem

• Steps in modelling with CNNs in PyTorch

• Creating a CNN model with PyTorch

• Picking a loss and optimizer

• Training a PyTorch computer vision model

• Evaluating a model

👩🍳 👩🔬
(w e’ ll be co ok ing u p lots of co d e! )

How:
fi
Computer vision inputs and outputs
224

W = 224 224 Sushi 🍣


H = 224 Steak 🥩
C=3 Pizza 🍕
(c = colour channels, R, G, B) Actual output
This is often a
convolutional neural network (CNN)!
🍣 🥩 🍕
[[0.31, 0.62, 0.44…], [[0.97, 0.00, 0.03],
[0.92, 0.03, 0.27…], [0.81, 0.14, 0.05],
[0.25, 0.78, 0.07…], [0.03, 0.07, 0.90],
…, (normalized pixel valu …,
es)
Numerical
Predicted output
encoding (often already ex
ists, if not,
you can build on (comes from looking at lots
e) of these)
Input and output shapes
(for an image classification example) We’re going to be building CNNs
to do this part!

224
[[0.31, 0.62, 0.44…], 🍣 🥩 🍕
224 [0.92, 0.03, 0.27…], [0.00, 0.97, 0.03]
[0.25, 0.78, 0.07…], i o n p r ob ab i l i t i e s )
(predict
…,

(gets represented as a tens


or)
[batch_size, width, height, colour_channels] Shape = [3]
Shape = [None, 224, 224, 3]
or
Shape = [32, 224, 224, 3] These will vary depending on the
(32 is a v e ry c o m m o n b a t c h problem you’re working on.
size)
Input and output shapes
(gets represented as a tens
28
or)
[[0.00, 0.62, 0.44…], 🥾 👕 👖…
28
[0.00, 0.03, 0.27…], [0.00, 0.97, …]
[0.01, 0.78, 0.07…], t i o n p r o b ab i l i t i e s )
(predic
…,

(colour channels last)


[batch_size, height, width, colour_channels] (NHWC)
or (colour channels first) Shape = [10]
[batch_size, colour_channels, height, width] (NCHW)

Shape = [None, 28, 28, 1] (NHWC)


Shape = [None, 1, 28, 28] (NCHW) These will vary depending on the
or problem you’re working on.
Shape = [32, 28, 28, 1]
(32 is a very common batch
size)
“What is a convolutional neural
network (CNN)?”
Let’s code!
FashionMNIST

“What type of clothing is in


this image?”
Multiclass classi cation
(more than one thing or
another)

torchvision.datasets.FashionMNIST
fi
Input and output shapes
(gets represented as a tens
28
or)
[[0.00, 0.62, 0.44…], 🥾 👕 👖…
28
[0.00, 0.03, 0.27…], [0.00, 0.97, …]
[0.01, 0.78, 0.07…], t i o n p r o b ab i l i t i e s )
(predic
…,

(colour channels last)


[batch_size, height, width, colour_channels] (NHWC)
or (colour channels first) Shape = [10]
[batch_size, colour_channels, height, width] (NCHW)

Shape = [None, 28, 28, 1] (NHWC)


Shape = [None, 1, 28, 28] (NCHW) These will vary depending on the
or problem you’re working on.
Shape = [32, 28, 28, 1]
(32 is a very common batch
size)
FashionMNIST: Batched batch_size=32
(32 samples per batch)

Sample 0 1 2 3 4 5 32
Batch 0 …

1 …

2 …

torch.utils.data.DataLoader

3 …

torchvision.datasets.FashionMNIST 4 …

shuffle=True


(samples all mixed up)

Num samples/
batch_size
(typical)*

Architecture of a CNN

(what we’re working towa


rds
building)

Steak 🥩
Pizza 🍕
Sushi 🍣

*Note: there are almost an unlimited amount of ways you could stack together a convolutional neural network, this slide demonstrates only one.
Typical architecture of a CNN
(col o ur e d b l o c k e d it i o n )
Simple CNN

Deeper CNN
CNN Explainer model
Input layer Conv2d layers ReLU activation layers Pooling layers Output layer

Source: CNN Explainer website, architecture is known as TinyVGG.


Breakdown of torch.nn.Conv2d layer
Example code: torch.nn.Conv2d(in_channels=3, out_channels=10, kernel_size=(3, 3), stride=(1, 1), padding=0)
Example 2 (same as above): torch.nnConv2d(in_channels=3, out_channels=10, kernel_size=3, stride=1, padding=0)

Hyperparameter name What does it do? Typical values

in_channels De nes the number of input channels of the input data. 1 (grayscale), 3 (RGB color images)

De nes the number output channels of the layer (could


out_channels 10, 128, 256, 512
also be called hidden units).

kernel_size (also referred to as 3, 5, 7 (lowers values learn smaller


Determines the shape of the kernel (sliding windows) over
features, higher values learn larger
lter size) the input. features)

The number of steps a lter takes across an image at a


stride time (e.g. if strides=1, a lter moves across an image 1 1 (default), 2
pixel at a time).

Pads the target tensor with zeroes (if “same”) to preserve


padding input shape. Or leaves in the target tensor as is (if 0, 1, “same”, “valid”
“valid”), lowering output shape.

📖 Resource: For an interactive demonstration of the above hyperparameters, see the CNN Explainer website.
fi
fi
fi
fi
fi
Breakdown of torch.nn.Conv2d layer (Visually)

📖 Resource: For an interactive demonstration of the above hyperparameters, see the CNN Explainer website.
FashionMNIST -> CNN
Output layer outputs
predictions

[[0.00, 0.62, 0.44…],


[0.00, 0.03, 0.27…],
[0.01, 0.78, 0.07…],
🥾
[0.21, 0.34, 0.00…],
[0.91, 0.66, 0.81…],
[0.90, 0.55, 0.99…],
👕
👖
[0.00, 0.22, 0.57…],
…,

👡
Numerical Layers learn numerical
Inputs
encoding representation 👗


Keep going until number
of classes is fulfilled
torchvision.transforms
torch.utils.data.Dataset
torch.save
torch.utils.data.DataLoader torchmetrics torch.load

torch.optim torch.nn torch.utils.tensorboard


torch.nn.Module
torchvision.models

See more: https://pytorch.org/tutorials/beginner/ptcheat.html


What is overfitting?
Over tting — when a model over learns patterns in a particular dataset and isn’t able to
generalise to unseen data.

For example, a student who studies the course materials too hard and then isn’t able to perform
well on the nal exam. Or tries to put their knowledge into practice at the workplace and nds
what they learned has nothing to do with the real world.

Under tting Balanced Over tting


(goldilocks zone)
fi
fi
fi
fi
fi
Improving a model (from a model’s perspective)

Smaller model

Common ways to improve a deep model:


• Adding layers
• Increase the number of hidden units
• Change/add activation functions Larger model
• Change the optimization function
• Change the learning rate (because you can alter each of

these, they’re hyperparameters)
Fitting for longer
Improving a model (from a data perspective)

Method to improve a model


What does it do?
(reduce over tting)

Gives a model more of a chance to learn patterns between samples


More data (e.g. if a model is performing poorly on images of pizza, show it more
images of pizza).

Increase the diversity of your training dataset without collecting more


data (e.g. take your photos of pizza and randomly rotate them 30°).
Data augmentation
Increased diversity forces a model to learn more generalisation
patterns.

Not all data samples are created equally. Removing poor samples
Better data from or adding better samples to your dataset can improve your
model’s performance.

Take a model’s pre-learned patterns from one problem and tweak


Use transfer learning them to suit your own problem. For example, take a model trained on
pictures of cars to recognise pictures of trucks.
fi
What is data augmentation?
Looking at the same image but from di erent perspective(s)*.

Original Rotate Shift Zoom

*Note: There are many more di erent kinds of data augmentation such as, cropping, replacing, shearing. This slide only demonstrates a few.
ff
ff
Popular & useful computer vision
architectures: see torchvision.models
Release
Architecture Paper Use in PyTorch When to use
Date

A good backbone for


ResNet (residual https://arxiv.org/abs/
2015 torchvision.models.resnet… many computer vision
networks) 1512.03385
problems

Typically now better than


https://arxiv.org/abs/
E cientNet(s) 2019 torchvision.models.e cientnet… ResNets for computer
1905.11946
vision

https://arxiv.org/abs/ Transformer architecture


Vision Transformer (ViT) 2020 torchvision.models.vit_…
2010.11929 applied to vision

Lightweight architecture
https://arxiv.org/abs/
MobileNet(s) 2017 torchvision.models.mobilenet… suitable for devices with
1704.04861
less computing power
ffi
ffi
The machine learning explorer’s
motto
“Visualize, visualize, visualize”
Data

Model It’s a good idea to visualize


these as often as possible.

Training

Predictions
The machine learning practitioner’s
motto

“Experiment, experiment, experiment”

👩🍳 👩🔬
(try lots of things an
d see what
tastes good)

You might also like