1 Deep Learning Primer
1 Deep Learning Primer
Learning Objectives
You will be able to:
2
AI / Machine Learning / Deep Learning
Data size Performs reasonably well on small / medium data Need large amount of data for reasonable
(see next slide for performance
graph)
Scaling Doesn't scale with large amount of data Scales well with large amount of data
Compute power Doesn't need a lot of compute (works well on single Needs a lot of compute power (usually runs on
machines) clusters)
CPU/GPU Mostly CPU bound Can utilize GPU for certain computes (massive
matrix operations)
Feature engineering Features need to be specified manually by experts DL can learn high-level features from data
automatically
Execution time Training usually takes seconds, minutes, hours Training takes lot longer (days)
Now
A Deep Learning Example : Image Recognition
12
History, 1943: McCulloch Pitts Neural Model
14
Perceptron (Single Layer Perceptron)
Mark 1 Perceptron
A Very Simple Perceptron
x.w
n Number of inputs
Image Recognition
ResNet (from 2015) with 152 layers
A CNN (Convolutional Neural Network) from Microsoft that won ILSRVC 2015
competition
AlexNet (8 layers)
Won ImageNet Challenge (millions of images in 20,000 categories) in 2012
Machine Translation
Google Translate* uses Deep Neural Networks (DNN) to translate between
languages (English to French, etc.) with very high accuracy
Examples of Deep Neural Networks
Reinforcement Learning
Bots learning to play games automatically (at superhuman levels!)
Activation functions
propagate the output
of one layer’s nodes
forward to the next
layer
Allows us to create
different output values
Allows us to model
non-linear functions
Activation Function
Very simple
Used for Linear Regression
Output = weight* Input +
Intercept
Y = aX + b
Activation Function : SIGMOID*
SIGMOID* and Tanh both suffer from the Vanishing Gradient Problem:
The derivative of a Sigmoid is less than .25
As we propagate that through many layers, that gradient becomes smaller
and smaller and eventually 'vanishes' (too small)
ReLU to Rescue!
digit 0 1 2 3 4 5 6 7 8 9
(outcome)
Probability 0.9 0 0 0 0 0 0 0 0 0.1
Activation Functions - Comparison
Loss Functions
43
Loss Functions For Regression: Mean Absolute Error Loss
(MAE)
Averages 'absolute error' across all data points
44
Loss Functions for Regressions Takeaway
Both 'Mean Squared Error' (MSE) and 'Mean Absolute Error' (MAE) are used
widely
And they perform pretty well on most scenarios
If the inputs span a large range (X1 = 1 to 10, X2 = 1000 to 1000000), then
consider normalizing them first
Other options to consider:
Mean squared log error loss (MSLE)
Mean absolute percentage error loss (MAPE)
45
Loss Function for Classification
46
Loss Functions for Classification: Hinge Loss
Hinge Loss is used heavily when the network does hard binary classification
(0 or 1)
Can be extended for doing 'multiclass classification’
47
Loss Function for Classification : Logistic Loss
digit 0 1 2 3 4 5 6 7 8 9
(outcome)
Probability 0.75 0 0 0 0 0 0 0 0.15 0.10
48
Hyperparameters
We learned about:
Deep Learning vs. Machine Learning
How Neural Nets have evolved
Activation functions
Loss functions
Hyperparameter tuning
Recommended Resources
https://software.intel.com/en-us/ai-academy/basics
56