Machine Learning A Z Course Downloadable Slides V1.5

Machine
Learning A-Z
Course Slides
© SuperDataScience
NOT FOR DISTRIBUTION © SUPERDATASCIENCE www.superdatascience.com
Welcome to the course!
Dear student,
Welcome to the “Machine Learning A-Z” course brought to you by SuperDataScience. We are
super-excited to have you on board! In this class you will learn many interesting and useful
concepts while having lots of fun.
These slides may be updated from time-to-time. If this happens, you will be able to find them in
the course materials repository with a new version indicated in the filename.
We kindly ask that you use these slides only for the purpose of supporting your own learning
journey and we look forward to seeing you inside the class!
Enjoy machine learning,

Kirill & Hadelin
PS: if you are not yet enrolled in the course, you can find it here.
© SuperDataScience
Who Are Your Instructors?
• Hello! My name is Kirill Eremenko • Hi there! My name is Hadelin de Ponteves

• I have a bachelor’s degree in Physics & Maths • I have a master’s in Machine Learning and I
and a background in Data Analytics consulting used to do Reinforcement Learning at Google
• I used to host the SuperDataScience podcast • I wrote a research paper on Machine Learning
We’ve been teaching online together since 2016 and over 1 Million students have enrolled in our
Machine Learning and Data Science courses. You can be confident that you are in good hands!
© SuperDataScience
Data
Preprocessing
© SuperDataScience
The Machine
Learning
Process
© SuperDataScience
The Machine Learning Process
Data Pre-Processing
• Import the data
• Clean the data
• Split into training & test sets
• Feature Scaling
Modelling
• Build the model
• Train the model
• Make predictions
Evaluation
• Calculate performance metrics
• Make a verdict
© SuperDataScience
Training Set
& Test Set
© SuperDataScience
Training Set & Test Set
~
Train
80% 𝑦! = 𝑏! + 𝑏"𝑋" + 𝑏#𝑋#
Test
20% V.S.
Predicted values 𝑦! Actual values 𝑦
© SuperDataScience
Feature
Scaling
© SuperDataScience
Feature Scaling
© SuperDataScience
Feature Scaling
Normalization Standardization
𝑋 − 𝑋"#$
! 𝑋−𝜇
𝑋 = 𝑋! =
𝑋"%& − 𝑋"#$ 𝜎
[0 ; 1] [-3 ; +3]
© SuperDataScience
Feature Scaling
70,000 $ 45 yrs
10,000 1
60,000 $ 44 yrs
8,000 4
52,000 $ 40 yrs
© SuperDataScience
Feature Scaling
Normalization
!
𝑋 − 𝑋"#$
𝑋 =
𝑋"%& − 𝑋"#$
[0 ; 1]
© SuperDataScience
Feature Scaling
70,000 $ 45 yrs
60,000 $ 44 yrs
52,000 $ 40 yrs
© SuperDataScience
44 yrs
40 yrs
45 yrs
0.444
0
1
Feature Scaling
© SuperDataScience
0.75
0
1
0.444
0
1
Feature Scaling
© SuperDataScience
Regression
© SuperDataScience
Simple Linear
Regression
© SuperDataScience
Simple Linear Regression
𝑦! = 𝑏! + 𝑏" 𝑋"
Dependent variable Independent variable
y-intercept (constant)
Slope coefficient
© SuperDataScience
Each point represents
𝑦 [tonnes] a separate harvest
~
(Potato yield)
+3𝑡
𝑦! = 𝑏! + 𝑏"𝑋"
𝑃𝑜𝑡𝑎𝑡𝑜𝑒𝑠 𝑡 = 𝑏! + 𝑏"×𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟 𝑘𝑔
8𝑡
𝑏! = 8[𝑡]
𝑡 +1𝑘𝑔 𝑋! [kg]
𝑏" = 3[ ] (Nitrogen Fertilizer)
𝑘𝑔
© SuperDataScience
Ordinary Least
Squares
© SuperDataScience
𝑦 [tonnes]
Ordinary Least Squares: (Potato yield)
𝑦!
𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙: 𝜀! = 𝑦! − 𝑦%!
𝑦%!
𝑦!
𝑦! = 𝑏! + 𝑏"𝑋" 𝑦%!
𝑏!, 𝑏" such that:

𝑆𝑈𝑀(𝑦$ − 𝑦!$ )# is minimized
𝑋! [kg]
(Nitrogen Fertilizer)
© SuperDataScience
Multiple Linear
Regression
© SuperDataScience
Multiple Linear Regression
𝑦! = 𝑏! + 𝑏" 𝑋" + 𝑏# 𝑋# + ⋯ + 𝑏$ 𝑋$
Dependent variable Independent variable 1 Independent variable 2 Independent variable n
y-intercept Slope coefficient 1 Slope coefficient 2 Slope coefficient n

(constant)
© SuperDataScience
Multiple Linear Regression
~
3 3 3
𝑃𝑜𝑡𝑎𝑡𝑜𝑒𝑠 𝑡 = 8𝑡 + 3 45 ×𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑧𝑒𝑟 𝑘𝑔 − 0.54 °7 ×𝐴𝑣𝑔𝑇𝑒𝑚𝑝 °𝐶 + 0.04 88 ×𝑅𝑎𝑖𝑛[𝑚𝑚]
© SuperDataScience
Additional Reading
The Application of Multiple Linear Regression and

Artificial Neural Network Models for Yield Prediction
of Very Early Potato Cultivars before Harvest
Magdalena Piekutowska et. al. (2021)
Link:
https://www.mdpi.com/2073-4395/11/5/885
© SuperDataScience
R Squared
© SuperDataScience
R Squared
𝑦 [tonnes] Regression: 𝑦 [tonnes] Average:
(Potato yield) (Potato yield)
𝑦#$%
𝑦" 𝑦"
𝑦!"
𝑋! [kg] 𝑋! [kg]
(Nitrogen Fertilizer) (Nitrogen Fertilizer)
𝑆𝑆<=> = 𝑆𝑈𝑀(𝑦$ − 𝑦!$ )# 𝑆𝑆393 = 𝑆𝑈𝑀(𝑦$ − 𝑦:;5 )#

Rule of thumb (for our tutorials)*:
𝑆𝑆<=> 1.0 = Perfect fit (suspicious)

~0.9 = Very good
𝑅# =1− <0.7 = Not great
𝑆𝑆393 <0.4 = Terrible
<0 = Model makes no sense for this data
*This is highly dependent on the context
© SuperDataScience
Adjusted
R Squared
© SuperDataScience
Adjusted R Squared
𝑆𝑆<=>
𝑅# =1− R2 – Goodness of fit
𝑆𝑆393 (greater is better)
Problem:
𝑦% = 𝑏' + 𝑏( 𝑋( + 𝑏) 𝑋) + 𝑏* 𝑋* 𝑆𝑆<=> = 𝑆𝑈𝑀(𝑦$ − 𝑦!$ )#
𝑆𝑆"#" doesn’t change
𝑆𝑆$%& will decrease or stay the same (This is because of Ordinary Least Squares: 𝑆𝑆$%& -> Min)
Solution:
𝑛−1
𝐴𝑑𝑗 𝑅# =1− 1 − 𝑅# ×
𝑛−𝑘−1
k – number of independent variables
n – sample size
© SuperDataScience
Assumptions Of
Linear Regression
© SuperDataScience
Assumptions of Linear Regression
Anscombe's quartet (1973):
© SuperDataScience
Assumptions of Linear Regression
1. Linearity 2. Homoscedasticity 3. Multivariate Normality

(Linear relationship between Y and each X) (Equal variance) (Normality of error distribution)
4. Independence 5. Lack of Multicollinearity 6. The Outlier Check

(of observations. Includes “no autocorrelation”) (Predictors are not correlated with each other) (This is not an assumption, but an “extra”)
𝑋"~ 𝑋# 𝑋"~ 𝑋#
© SuperDataScience
Bonus
Download the Assumptions poster at:
superdatascience.com/assumptions
© SuperDataScience
Additional Reading
Verifying the Assumptions of Linear Regression

in Python and R
Eryk Lewinson (2019)
Link:
towardsdatascience.com/verifying-the-
assumptions-of-linear-regression-in-python-
and-r-f4cd2907d4c0
© SuperDataScience
© SuperDataScience
Machine Learning A-Z
Profit R&D Spend Admin Marketing State
192,261.83 165,349.20 136,897.80 471,784.10 New York

191,792.06 162,597.70 151,377.59 443,898.53 California
191,050.39 153,441.51 101,145.55 407,934.54 California
182,901.99 144,372.41 118,671.85 383,199.62 New York
166,187.94 142,107.34 91,391.77 366,168.42 California
y = b0 + b1*x1 + b2*x2 + b3*x3 + ???
Machine Learning A-Z © SuperDataScience

Dummy Variables
Profit R&D Spend Admin Marketing State New York California
192,261.83 165,349.20 136,897.80 471,784.10 New York 1 0

191,792.06 162,597.70 151,377.59 443,898.53 California 0 1
191,050.39 153,441.51 101,145.55 407,934.54 California 0 1
182,901.99 144,372.41 118,671.85 383,199.62 New York 1 0
166,187.94 142,107.34 91,391.77 366,168.42 California 0 1
y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*D1

© SuperDataScience
Dummy Variables
192,261.83 165,349.20 136,897.80 471,784.10 New York 1 0

191,792.06 162,597.70 151,377.59 443,898.53 California 0 1
191,050.39 153,441.51 101,145.55 407,934.54 California 0 1
182,901.99 144,372.41 118,671.85 383,199.62 New York 1 0
166,187.94 142,107.34 91,391.77 366,168.42 California 0 1
y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*D1

Dummy Variables
192,261.83 165,349.20 136,897.80 471,784.10 New York 1 0

191,792.06 162,597.70 151,377.59 443,898.53 California 0 1
191,050.39
182,901.99
153,441.51
144,372.41
D =1-D
101,145.55
118,671.85 2
407,934.54
383,199.62
California
1New York
0
1
1
0
166,187.94 142,107.34 91,391.77 366,168.42 California 0 1
y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*D1 + b5*D2

Dummy Variables
192,261.83 165,349.20 136,897.80 471,784.10 New York 1 0

191,792.06 162,597.70 151,377.59 443,898.53 California 0 1
191,050.39 153,441.51 101,145.55 407,934.54 California 0 1
182,901.99 144,372.41 118,671.85 383,199.62 New York 1 0
166,187.94 142,107.34 91,391.77 366,168.42 California 0 1
y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*D1 + b5*D2
Always omit one dummy

variable
© SuperDataScience
© SuperDataScience
X7
X6
X5
Why?
X4
y
X3
X2
X1

© SuperDataScience
2)
1)
5 methods of building models:
1. All-in
2. Backward Elimination
3. Forward Selection Stepwise
4. Bidirectional Elimination Regression
5. Score Comparison

“All-in” – cases:
• Prior knowledge; OR
• You have to; OR
• Preparing for Backward
Elimination

Backward Elimination
STEP 1: Select a significance level to stay in the model (e.g. SL = 0.05)
STEP 2: Fit the full model with all possible predictors
STEP 3: Consider the predictor with the highest P-value. If P > SL, go to STEP 4, otherwise go to FIN
STEP 4: Remove the predictor
STEP 5: Fit model without this variable*

FIN: Your Model Is Ready

Forward Selection
STEP 1: Select a significance level to enter the model (e.g. SL = 0.05)
STEP 2: Fit all simple regression models y ~ x n Select the one with the lowest P-value
STEP 3: Keep this variable and fit all possible models with one extra predictor added to the one(s) you
already have
STEP 4: Consider the predictor with the lowest P-value. If P < SL, go to STEP 3, otherwise go to FIN
FIN: Keep the previous model

Bidirectional Elimination
STEP 1: Select a significance level to enter and to stay in the model
e.g.: SLENTER = 0.05, SLSTAY = 0.05
STEP 2: Perform the next step of Forward Selection (new variables must have: P < SLENTER to enter)
STEP 3: Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay)
STEP 4: No new variables can enter and no old variables can exit

All Possible Models
STEP 1: Select a criterion of goodness of fit (e.g. Akaike criterion)
STEP 2: Construct All Possible Regression Models: 2N-1 total combinations
STEP 3: Select the one with the best criterion
Example:
10 columns means
1,023 models

5 methods of building models:
1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison

© SuperDataScience
In this section we learned:
1. How to create dummies for categorical IVs
2. How to avoid the dummy variable trap
3. Backward, Forward, Bidirectional, All Possible
4. We actually built a model. Step-By-Step!!
5. How to use adjusted R-squared in modelling
6. How to interpret coefficients of a MLR

© SuperDataScience
Simple
Linear
Regression
Multiple
Linear
Regression
Polynomial
Linear
Regression

© SuperDataScience
x1

y
© SuperDataScience
x1

y
© SuperDataScience
x1

y
© SuperDataScience
Polynomial
Linear
Regression

© SuperDataScience
© SuperDataScience
1992
Vladimir Vapnik

Ordinary Least Squares ε-Insensitive Tube
x2 SUM (y – ŷ)2 -> min x2
ε
ε
x1 x1

%
1
𝑤 !
+ 𝐶 & 𝜉" + 𝜉"∗ → 𝑚𝑖𝑛
2
"#$
Ordinary Least Squares ε-Insensitive Tube

x2 SUM (y – ŷ)2 -> min x2 Slack Variables ξi and ξi*
ξ5 ε
ξ3
ξ2 ε
ξ4*
ξ1*
x1 x1

Additional Reading:
Chapter 4 – Support Vector Regression

(from: Efficient Learning Machines:
Theories, Concepts, and Applications for
Engineers and System Designers)
By Mariette Awad & Rahul Khanna (2015)
Link:
https://core.ac.uk/download/pdf/81523322.pdf

© SuperDataScience
© SuperDataScience
x1
ε

x2
© SuperDataScience
Section on SVM:
• SVM Intuition
Section on Kernel SVM:

Y
• Kernel SVM Intuition
• Mapping to a higher dimension
• The Kernel Trick
• Types of Kernel Functions
• Non-linear Kernel SVR
X
Image source: http://www.cs.toronto.edu/~duvenaud/cookbook/index.html

© SuperDataScience
© SuperDataScience
Y
X2 Split 1
Split 3
200
Split 2
170
Split 4
20 40 X1

© SuperDataScience
© SuperDataScience
X1
Split 1
20
X2

© SuperDataScience
No
X1 < 20
Yes

© SuperDataScience
X1
Split 2
Split 1
20
X2
170

© SuperDataScience
No
X1 < 20
Yes

© SuperDataScience
No
X2 < 170
Yes
No
X1 < 20
Yes

X2 Split 1
Split 3
200
Split 2
170
20 X1

© SuperDataScience
No
X2 < 170
Yes
No
X1 < 20
Yes

X1 < 20
Yes No
X2 < 200 X2 < 170
Yes No Yes No

X2 Split 1
Split 3
200
Split 2
170
Split 4
20 40 X1

X1 < 20
Yes No
X2 < 200 X2 < 170
Yes No Yes No

X1 < 20
Yes No
X2 < 200 X2 < 170
Yes No Yes No
X1 < 40
Yes No

Y
X2 Split 1
65.7
Split 3 1023
200
Split 2
170
300.5 -64.1 0.7
Split 4
20 40 X1

X1 < 20
Yes No
X2 < 200 X2 < 170
Yes No Yes No
X1 < 40
Yes No

X1 < 20
Yes No
X2 < 200 X2 < 170
Yes No Yes No
300.5 65.7 X1 < 40 1023
Yes No
-64.1 0.7

© SuperDataScience
© SuperDataScience
STEP 1: Pick at random K data points from the Training set.
STEP 2: Build the Decision Tree associated to these K data points.
STEP 3: Choose the number Ntree of trees you want to build and repeat STEPS 1 & 2
STEP 4: For a new data point, make each one of your Ntree trees predict the value of Y to
for the data point in question, and assign the new data point the average across all of the
predicted Y values.

R Squared
© SuperDataScience
R Squared
𝑦 [tonnes] Regression: 𝑦 [tonnes] Average:
(Potato yield) (Potato yield)
𝑦#$%
𝑦" 𝑦"
𝑦!"
𝑋! [kg] 𝑋! [kg]
(Nitrogen Fertilizer) (Nitrogen Fertilizer)
𝑆𝑆<=> = 𝑆𝑈𝑀(𝑦$ − 𝑦!$ )# 𝑆𝑆393 = 𝑆𝑈𝑀(𝑦$ − 𝑦:;5 )#

Rule of thumb (for our tutorials)*:
𝑆𝑆<=> 1.0 = Perfect fit (suspicious)

~0.9 = Very good
𝑅# =1− <0.7 = Not great
𝑆𝑆393 <0.4 = Terrible
<0 = Model makes no sense for this data
*This is highly dependent on the context
© SuperDataScience
Adjusted
R Squared
© SuperDataScience
Adjusted R Squared
𝑆𝑆<=>
𝑅# =1− R2 – Goodness of fit
𝑆𝑆393 (greater is better)
Problem:
𝑦% = 𝑏' + 𝑏( 𝑋( + 𝑏) 𝑋) + 𝑏* 𝑋* 𝑆𝑆<=> = 𝑆𝑈𝑀(𝑦$ − 𝑦!$ )#
𝑆𝑆"#" doesn’t change
𝑆𝑆$%& will decrease or stay the same (This is because of Ordinary Least Squares: 𝑆𝑆$%& -> Min)
Solution:
𝑛−1
𝐴𝑑𝑗 𝑅# =1− 1 − 𝑅# ×
𝑛−𝑘−1
k – number of independent variables
n – sample size
© SuperDataScience
Classification
© SuperDataScience
What is Classification?
Classification: a Machine Learning technique to identify the
category of new observations based on training data.
Likely to stay Likely to leave Dogs Cats
© SuperDataScience
Logistic
Regression
© SuperDataScience
Logistic Regression
Logistic regression: predict a categorical 𝑦 [yes/no]
dependent variable from a number of (Took up offer?)
independent variables.
~
YES
YES
81%
≥ 50%
Will purchase Age
health insurance: < 50%
Yes / No 42%
𝑝
ln = 𝑏! + 𝑏" 𝑋" NO
NO
1−𝑝 18 35 45 60
𝑋! [yrs]
(Age)
© SuperDataScience
Logistic Regression
~
Will purchase Age Income Level of Family or
health insurance: Education Single
Yes / No
'
ln ()' = 𝑏* + 𝑏( 𝑋( + 𝑏+ 𝑋+ + 𝑏, 𝑋, + 𝑏- 𝑋-
© SuperDataScience
Maximum
Likelihood
© SuperDataScience
Maximum Likelihood
𝑦 [yes/no] 𝑦 [yes/no]
(Took up offer?) (Took up offer?)
YES YES
0.95 0.98 1- 0.96
0.92
1- 0.58
0.54
1- 0.10
1- 0.01 1- 0.04
0.03
NO NO
𝑋! [yrs] 𝑋! [yrs]
18 60 18 60
(Age) (Age)
Likelihood = 0.03 x 0.54 x 0.92 x 0.95 x 0.98 x (1 – 0.01) x (1 – 0.04) x (1 – 0.10) x (1 – 0.58) x (1 – 0.96)
Likelihood = 0.00019939
© SuperDataScience
Maximum Likelihood
𝑦 [yes/no]
(Took up offer?)
YES Likelihood = 0.00016553
Best Curve <= Maximum Likelihood
NO
𝑋! [yrs]
18 60
(Age)
© SuperDataScience
© SuperDataScience
Before K-NN After K-NN
X2 x2
Category 2 Category 2
New data point K-NN New data point assigned

to Category 1
Category 1 Category 1
x1 x1

STEP 1: Choose the number K of neighbors
STEP 2: Take the K nearest neighbors of the new data point, according to the Euclidean distance
STEP 3: Among these K neighbors, count the number of data points in each category
STEP 4: Assign the new data point to the category where you counted the most neighbors
Your Model is Ready

STEP 1: Choose the number K of neighbors: K = 5
x2
Category 2
New data point
Category 1
x1

y
y2 P2(x2,y2)
y1
P1(x1,y1)
x1 x2 x

x2
Category 1: 3 neighbors
Category 2
Category 2: 2 neighbors
New data point
Category 1
x1

© SuperDataScience
© SuperDataScience
x1
x2

Maximum Margin
x2
Support
Vectors
x1

Maximum Margin
x2 Positive Hyperplane
Maximum Margin
Hyperplane
(Maximum Margin Classifier)
Support
Vectors
Negative Hyperplane
x1

© SuperDataScience
© SuperDataScience
© SuperDataScience
x1
x2

© SuperDataScience
Support
Vectors
x1
x2

© SuperDataScience
Linearly Separable Not Linearly Separable
x2 x2
x1 x1

© SuperDataScience
© SuperDataScience
f=x-5
x1
0

© SuperDataScience
f = (x – 5)^2
f=x-5
x1
0

© SuperDataScience
f = (x – 5)^2
f=x-5
x1
0

New z
2D Space Dimension
3D Space
x2
Hyperplane
Mapping Function
x2
x1
x1
2D Space
z 3D Space x2
Non Linear Separator
Projection
x2
x1
x1
Mapping to a Higher Dimensional Space
can be highly compute-intensive

© SuperDataScience
© SuperDataScience
"
#* "
⃗ )⃗
(& !
&
⃗%=𝑒⃗ 𝑙
𝐾 𝑥,

"
⃗ %⃗!
$"
"
⃗ 𝑙⃗! =
𝐾 𝑥, 𝑒 &' "

2D Space
x2
"
⃗ %⃗!
$"
"
La
nd
ma
⃗ 𝑙⃗! =
𝐾 𝑥, 𝑒 &' "
rk
x1

2D Space
x2
"
⃗ %⃗!
$"
"
⃗ 𝑙⃗! =
𝐾 𝑥, 𝑒 &' "
x1

2D Space
x2
"
⃗ %⃗!
$"
"
⃗ 𝑙⃗! =
𝐾 𝑥, 𝑒 &' "
x1

x2 ⃗ 𝑙⃗( + 𝐾 𝑥,
𝐾 𝑥, ⃗ 𝑙⃗&
(Simplified Formula)
Green when:
⃗ 𝑙⃗( + 𝐾 𝑥,
𝐾 𝑥, ⃗ 𝑙⃗& > 0
Red when:
⃗ 𝑙⃗( + 𝐾 𝑥,
𝐾 𝑥, ⃗ 𝑙⃗& = 0
x1

© SuperDataScience
"
⃗ %⃗!
$"
"
Gaussian RBF Kernel ⃗ 𝑙⃗! =
𝐾 𝑥, 𝑒 &' "
Sigmoid Kernel 𝐾 𝑋, 𝑌 = tanh 𝛾 X 𝑋 E 𝑌 + 𝑟
Polynomial Kernel 𝐾 𝑋, 𝑌 = 𝛾 X 𝑋 E 𝑌 + 𝑟 F , 𝛾 > 0

© SuperDataScience
Section on SVR:
• SVR Intuition
Section on SVM:
• SVM Intuition
Y
Section on Kernel SVM:
• Kernel SVM Intuition
• Mapping to a higher dimension
• The Kernel Trick
• Types of Kernel Functions X
• Non-linear Kernel SVR


© SuperDataScience
X
Y

© SuperDataScience
X
Y

© SuperDataScience
X
Y

© SuperDataScience
X
Y

© SuperDataScience
X
Y

© SuperDataScience
X
Y

#
⃗ (⃗"
'%
%
⃗ 𝑙⃗$ =
𝐾 𝑥, 𝑒 )*#
Y Y
X X

#
⃗ (⃗"
'%
%
⃗ 𝑙⃗$ =
𝐾 𝑥, 𝑒 )*#
Y Y
X X

#
⃗ (⃗"
'%
%
⃗ 𝑙⃗$ =
𝐾 𝑥, 𝑒 )*#
Y Y
X X

© SuperDataScience
X
Y

#
Y ⃗ (⃗"
'%
%
⃗ 𝑙⃗$ =
𝐾 𝑥, 𝑒 )*#

© SuperDataScience
m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1
m2 m2 m2 m2 m2 m2 m2 m2 m2 m2

© SuperDataScience
© SuperDataScience
© SuperDataScience
m2

© SuperDataScience
m2

© SuperDataScience
𝑃 𝐵 𝐴 ∗ 𝑃(𝐴)
𝑃(𝐵)
𝑃 𝐴𝐵 =

© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
𝑃(𝐵)
𝑃 𝐴𝐵 =

Salary
X2
Category
Drives 2
Walks 1
Category
X
Age
1

Salary
Drives
Walks
New data point
Age

© SuperDataScience
© SuperDataScience
𝑃(𝐵)
𝑃 𝐴𝐵 =

#3 Likelihood #1 Prior Probability
#4 Posterior Probability
𝑃 𝑋 𝑊𝑎𝑙𝑘𝑠 ∗ 𝑃(𝑊𝑎𝑙𝑘𝑠)
𝑃 𝑊𝑎𝑙𝑘𝑠 𝑋 =
𝑃(𝑋)
#2 Marginal Likelihood

𝑃 𝑋 𝐷𝑟𝑖𝑣𝑒𝑠 ∗ 𝑃(𝐷𝑟𝑖𝑣𝑒𝑠)
𝑃 𝐷𝑟𝑖𝑣𝑒𝑠 𝑋 =
𝑃(𝑋)

𝑃 𝑊𝑎𝑙𝑘𝑠 𝑋 𝑣. 𝑠. 𝑃 𝐷𝑟𝑖𝑣𝑒𝑠 𝑋

© SuperDataScience
© SuperDataScience
Age
Drives
Walks
Salary

Salary
Drives
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑊𝑎𝑙𝑘𝑒𝑟𝑠
𝑃(𝑊𝑎𝑙𝑘𝑠) =
𝑇𝑜𝑡𝑎𝑙 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
10
Walks 𝑃 𝑊𝑎𝑙𝑘𝑠 =
30
Age

𝑃(𝑋)

Salary
Drives
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑆𝑖𝑚𝑖𝑙𝑎𝑟 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑃(𝑋) =
4
Walks 𝑃 𝑋 =
30
Age

𝑃(𝑋)

Salary
Drives
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑆𝑖𝑚𝑖𝑙𝑎𝑟
𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝐴𝑚𝑜𝑛𝑔 𝑡ℎ𝑜𝑠𝑒 𝑤ℎ𝑜 𝑊𝑎𝑙𝑘
𝑃(𝑋|𝑊𝑎𝑙𝑘𝑠) =
Walks 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑊𝑎𝑙𝑘𝑒𝑟𝑠
3
𝑃 𝑋|𝑊𝑎𝑙𝑘𝑠 =
10
Age

𝑃(𝑋)

3 10
10 ∗ 30
𝑃 𝑊𝑎𝑙𝑘𝑠 𝑋 = = 0.75
4
30

© SuperDataScience
𝑃(𝑋)

1 20
20 ∗ 30
𝑃 𝐷𝑟𝑖𝑣𝑒𝑠 𝑋 = = 0.25
4
30

© SuperDataScience

© SuperDataScience
0.75 𝑣. 𝑠. 0.25

© SuperDataScience
0.75 > 0.25

© SuperDataScience
𝑃 𝑊𝑎𝑙𝑘𝑠 𝑋 > 𝑃 𝐷𝑟𝑖𝑣𝑒𝑠 𝑋

Salary
Drives
Walks
New data point
Age

Salary
Drives
Walks
New data point
Age

© SuperDataScience
𝑃(𝑋)

© SuperDataScience
Age
Drives
Walks
Salary

Salary
Drives
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑟𝑖𝑣𝑒𝑟𝑠
𝑃(𝐷𝑟𝑖𝑣𝑒𝑠) =
20
Walks 𝑃 𝐷𝑟𝑖𝑣𝑒𝑠 =
30
Age

𝑃(𝑋)

Salary
Drives
𝑃(𝑋) =
4
Walks 𝑃 𝑋 =
30
Age

𝑃(𝑋)

Salary
Drives
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑆𝑖𝑚𝑖𝑙𝑎𝑟
𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝐴𝑚𝑜𝑛𝑔 𝑡ℎ𝑜𝑠𝑒 𝑤ℎ𝑜 𝑊𝑎𝑙𝑘
𝑃(𝑋|𝐷𝑟𝑖𝑣𝑒𝑠) =
Walks 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑊𝑎𝑙𝑘𝑒𝑟𝑠
1
𝑃 𝑋|𝐷𝑟𝑖𝑣𝑒𝑠 =
20
Age

𝑃(𝑋)

1 20
20 ∗ 30
𝑃 𝐷𝑟𝑖𝑣𝑒𝑠 𝑋 = = 0.25
4
30

© SuperDataScience
© SuperDataScience
© SuperDataScience
© SuperDataScience
Salary
Drives
Walks
New data point
Age

© SuperDataScience
Salary
Drives
𝑃(𝑋) =
4
Walks 𝑃 𝑋 =
30
Age
NOTE: Same both times

𝑃(𝑋)

𝑃(𝑋)

© SuperDataScience
𝑃 𝐷𝑟𝑖𝑣𝑒𝑠 𝑋
𝑣. 𝑠.
𝑃 𝑊𝑎𝑙𝑘𝑠 𝑋

𝑃 𝑋 𝑊𝑎𝑙𝑘𝑠 ∗ 𝑃(𝑊𝑎𝑙𝑘𝑠) 𝑃 𝑋 𝐷𝑟𝑖𝑣𝑒𝑠 ∗ 𝑃(𝐷𝑟𝑖𝑣𝑒𝑠)
𝑣. 𝑠.
𝑃(𝑋) 𝑃(𝑋)

© SuperDataScience

© SuperDataScience
0.75 𝑣. 𝑠. 0.25

© SuperDataScience
0.75 > 0.25

© SuperDataScience
𝑃 𝑊𝑎𝑙𝑘𝑠 𝑋 > 𝑃 𝐷𝑟𝑖𝑣𝑒𝑠 𝑋

© SuperDataScience
© SuperDataScience
© SuperDataScience
x1
x2

© SuperDataScience
Split 1
x1
60
x2

© SuperDataScience
Split 1
x1
Split 2
50
60
x2

x2 Split 2
60 Split 3 Split 1
50 70
x1

x2 Split 2
60 Split 3 Split 1
Split 4
50 70
x1

© SuperDataScience
© SuperDataScience
Split 1
x1
60
x2

© SuperDataScience
No
X2 < 60
Yes

© SuperDataScience
Split 1
x1
Split 2
50
60
x2

© SuperDataScience
No
X1 < 50
Yes
No
X2 < 60
Yes

x2 Split 2
60 Split 3 Split 1
50 70
x1

X2 < 60
Yes No
X1 < 70 X1 < 50
Yes No Yes No

x2 Split 2
60 Split 3 Split 1
20 Split 4
50 70
x1

X2 < 60
Yes No
X1 < 70 X1 < 50
Yes No Yes No
X2 < 20
Yes No

© SuperDataScience
© SuperDataScience
© SuperDataScience
STEP 1: Pick at random K data points from the Training set.
STEP 2: Build the Decision Tree associated to these K data points.
STEP 3: Choose the number Ntree of trees you want to build and repeat STEPS 1 & 2
STEP 4: For a new data point, make each one of your Ntree trees predict the category to
which the data points belongs, and assign the new data point to the category that wins
the majority vote.

© SuperDataScience
pŷ̂ (Probability)
y(Predicted
(Actual DV)
DV)
ŷ=1 ŷ=1
1
0.5
X
ŷ=0 ŷ=0
20 30 40 50

p̂ y
(Probability)
(Actual DV)
#2 #4
1
0.5
#1 #3 X

False Posi.ve
p̂ (Probability)
y (Actual DV) (Type I Error)
ŷ (Predicted DV) #2 #4
1
0.5
#1 #3 X
False Nega.ve
(Type II Error) Fin.

Confusion
Matrix &
Accuracy
© SuperDataScience
Confusion Matrix & Accuracy
Prediction
NEG POS
TRUE FALSE
NEG
NEG POS
Actual
FALSE TRUE
POS
NEG POS
Type II Error Type I Error

Image source: nature.com
(False Negatives) (False Positives)
© SuperDataScience
Confusion Matrix & Accuracy
Prediction
NEG POS Accuracy Rate and Error Rate:
𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑇𝑁 + 𝑇𝑃 84
𝐴𝑅 = = = = 84%
𝑇𝑜𝑡𝑎𝑙 𝑇𝑜𝑡𝑎𝑙 100
NEG 43 12
𝐼𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝐹𝑃 + 𝐹𝑁 16
Actual
𝐸𝑅 = = = = 16%
𝑇𝑜𝑡𝑎𝑙 𝑇𝑜𝑡𝑎𝑙 100
POS 4 41
Type II Error Type I Error

(False Negatives) (False Positives)
© SuperDataScience
Additional Reading
Understanding the Confusion Matrix from Scikit

learn
Samarth Agrawal (2021)
Link:
https://towardsdatascience.com/understanding-the-
confusion-matrix-from-scikit-learn-c51d88929c79
© SuperDataScience
© SuperDataScience
ŷ (Predicted DV) Scenario 1:
Accuracy Rate = Correct / Total

0 1 AR = 9,800/10,000 = 98%
0 9,700 150
y (Actual DV)
1 50 100

ŷ (Predicted DV) Scenario 1:

0 1 AR = 9,800/10,000 = 98%
0 9,850 0 Scenario 2:
y (Actual DV)

AR = 9,850/10,000 = 98.5%
1 150 0

© SuperDataScience
Purchased
10,000
8,000
6,000
4,000
2,000
0
0 20,000 40,000 60,000 80,000 100,000 Total Contacted

Purchased Crystal Ball Good Model
100%
80%
60%
Poor Model
40%
Random
20%
0
10%
0 20% 40% 60% 80% 100% Total Contacted

Note:
CAP = Cumulative Accuracy Profile
ROC = Receiver Operating Characteristic

Purchased Model
100%
80%
60%
40%
Random
20%
0
0 20% 40% 60% 80% 100% Total Contacted

© SuperDataScience
Purchased Perfect Model Good Model
100%
80%
aP
aR
60%
aR AR = aP
40%
Random Model
20%
0
0 20% 40% 60% 80% 100% Total Contacted

Purchased Perfect Model Good Model
100%
X%
80% 90% < X < 100% Too Good
80% < X < 90% Very Good
60% 70% < X < 80% Good
60% < X < 70% Poor
X < 60% Rubbish
40%
Random Model
20%
0
50%
0 20% 40% 60% 80% 100% Total Contacted

Clustering
© SuperDataScience
What is Clustering?
Clustering – grouping
unlabelled data
© SuperDataScience
What is Clustering?
Supervised Learning
(e.g. Regression, Classification)
Unsupervised Learning
(e.g. Clustering)
Image source: mdpi.com/2073-8994/10/12/734
© SuperDataScience
What is Clustering?
Spending Score Spending Score
Clustering
Annual Income $ Annual Income $
© SuperDataScience
K-Means
Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
K-Means Clustering
© SuperDataScience
The Elbow
Method
© SuperDataScience
The Elbow Method
© SuperDataScience
The Elbow Method
Within Cluster Sum of Squares:
...
© SuperDataScience
The Elbow Method
© SuperDataScience
Cluster 1
C1
The Elbow Method
© SuperDataScience
Cluster 2
C2
Cluster 1
The Elbow Method
C1
© SuperDataScience
The Elbow Method
C2 Cluster 2
C1
Cluster 1
Cluster 3
C3
© SuperDataScience
The Elbow Method
The Elbow Method
Optimal number of clusters
© SuperDataScience
K-Means++
© SuperDataScience
K-Means++
Cluster 2
K-Means
Cluster 1
Cluster 3
Different results
Cluster 2
K-Means
Cluster 3
Cluster 1
© SuperDataScience
K-Means++
K-Means++ Initialization Algorithm:
Step 1: Choose first centroid at random among data points
Step 2: For each of the remaining data points compute the distance (D)
to the nearest out of already selected centroids
Step 3: Choose next centroid among remaining data points using

weighted random selection – weighted by D2
Step 4: Repeat Steps 2 and 3 until all k centroids have been selected
Step 5: Proceed with standard k-means clustering
© SuperDataScience
K-Means++
© SuperDataScience
K-Means++
© SuperDataScience
K-Means++
© SuperDataScience
K-Means++
© SuperDataScience
K-Means++
© SuperDataScience
Cluster 2
Cluster 3
Cluster 1
K-Means++
© SuperDataScience
After HC
Same as K-Means but different process

HC
Before HC
STEP 1: Make each data point a single-point cluster That forms N clusters
STEP 2: Take the two closest data points and make them one cluster That forms N-1
clusters
STEP 3: Take the two closest clusters and make them one cluster That forms N - 2
clusters
STEP 4: Repeat STEP 3 until there is only one cluster
FIN
x
P2(x2,y2)
x2
P1(x1,y1)
x1
y2
y1
y
Distance Between Two Clusters:
• Option 1: Closest Points
• Option 2: Furthest Points
• Option 3: Average Distance
• Option 4: Distance Between Centroids

Consider the following dataset of N = 6 data points
STEP 1: Make each data point a single-point cluster That forms 6 clusters
STEP 1: Make each data point a single-point cluster That forms 6 clusters
STEP 2: Take the two closest data points and make them one cluster
That forms 5 clusters
STEP 3: Take the two closest clusters and make them one cluster
That forms 4 clusters
FIN
© SuperDataScience
© SuperDataScience
P6
P5
P4
z
P3
P2
P1
P1
P3
P2

P4
P6
P5
© SuperDataScience
P6
P5
P4
z
P3
P2
P1
P1
P3
P2

P4
P6
P5
© SuperDataScience
P6
P5
P4
z
P3
P2
P1
P1
P3
P2

P4
P6
P5
© SuperDataScience
P6
P5
P4
z
P3
P2
P1
P1
P3
P2

P4
P6
P5
© SuperDataScience
P6
P5
P4
z
P3
P2
P1
P1
P3
P2

P4
P6
P5
© SuperDataScience
P6
P5
P4
P3
P2
P1
P1
P3
P2

P4
P6
P5
© SuperDataScience
© SuperDataScience
P1
P3
P2

P4
P6
P5
© SuperDataScience
2 clusters
P1
P3
P2

P4
P6
P5
© SuperDataScience
4 clusters
P1
P3
P2

P4
P6
P5
© SuperDataScience
6 clusters
P1
P3
P2

P4
P6
P5
© SuperDataScience
Largest distance
2 clusters
P1
P3
P2

P4
P6
P5
© SuperDataScience
© SuperDataScience
Largest distance
3 clusters
P1
P3
P2
P4
P6
P5

P7
P9
P8
© SuperDataScience
People who bought also bought …
User ID Movies liked
46578 Movie1, Movie2, Movie3, Movie4
98989 Movie1, Movie2
71527 Movie1, Movie2, Movie4
Movie1 Movie2
Potential Rules: Movie2 Movie4
Movie1 Movie3
Transaction ID Products purchased
46578 Burgers, French Fries, Vegetables
98989 Burgers, French Fries, Ketchup
71527 Vegetables, Fruits
78981 Pasta, Fruits, Butter, Vegetables
89192 Burgers, Pasta, French Fries
61557 Fruits, Orange Juice, Vegetables
87923 Burgers, French Fries, Ketchup, Mayo
Burgers French Fries
Potential Rules: Vegetables Fruits
Burgers, French Fries Ketchup

Market Basket Optimisation:
Movie Recommendation:
Support = 10 / 100 = 10%
Confidence = 7 / 40 = 17.5%
Lift = 17.5% / 10% = 1.75
Step 1: Set a minimum support and confidence
Step 2: Take all the subsets in transactions having higher support than minimum support
Step 3: Take all the rules of these subsets having higher confidence than minimum confidence
Step 4: Sort the rules by decreasing lift

© SuperDataScience
People who bought also bought …
User ID Movies liked
46578 Movie1, Movie2, Movie3, Movie4
71527 Movie1, Movie2, Movie4
Movie1 Movie2
Potential Rules: Movie2 Movie4
Movie1 Movie3
Transaction ID Products purchased
46578 Burgers, French Fries, Vegetables
98989 Burgers, French Fries, Ketchup
71527 Vegetables, Fruits
78981 Pasta, Fruits, Butter, Vegetables
89192 Burgers, Pasta, French Fries
61557 Fruits, Orange Juice, Vegetables
87923 Burgers, French Fries, Ketchup, Mayo
Burgers French Fries
Potential Rules: Vegetables Fruits
Burgers, French Fries Ketchup

Step 1: Set a minimum support
Step 2: Take all the subsets in transactions having higher support than minimum support
Step 3: Sort these subsets by decreasing support

© SuperDataScience
© SuperDataScience
© SuperDataScience
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D1 D2 D3 D4 D5
Examples used for educational purposes. No affiliation with Coca-

Cola
© SuperDataScience
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
D5
D4
D3
D2
D1
© SuperDataScience
D5
D4
D3
D2
D1
Where we think the μ* values will be
I.e. We are NOT trying to guess the distributions behind the machines
Return
We’ve generated our own bandit configuration
New Round
New Round
© SuperDataScience
• Deterministic • Probabilistic
• Requires update at every round • Can accommodate delayed feedback
• Better empirical evidence
© SuperDataScience
© SuperDataScience
Here’s what we will learn:
• Types of Natural Language Processing
• Classical vs Deep Learning Models
• End-to-end Deep Learning Models
• Bag-Of-Words
• Note: Seq2Seq and Chatbots are outside the

scope of this course
© SuperDataScience
Learning
Deep
Seq2Seq
DNLP
Processing
Language
Natural
© SuperDataScience
Learning
Deep
Seq2Seq
DNLP
Processing
Language
Natural
Some examples:
1. If / Else Rules (Chatbot)
2. Audio frequency components analysis (Speech
Recognition) NLP DL
3. Bag-of-words model (Classification)
4. CNN for text Recognition (Classification)
5. Seq2Seq (many applications)
Comment Pass/Fail
Great job! 1
Amazing work.
Well done.
1
Yes
1
I’m back EOS Seq2Seq
Very well written. 1
Poor effort. 0
Could have done better. 0
h0 h1 h2Try harder
h3 next time.
… h0
n g0 g1 g2
… …
Hello Kirill , Checking EOS Yes I’m back
Encoder Decoder
Image Source: www.wildml.com

© SuperDataScience
DL
NLP
DL
NLP
DL
NLP
DL
NLP
Deep Learning
End-to-end
Models
DL
NLP
© SuperDataScience
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... , 0]
20,000 elements long
if badminton table
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... , 0]
SOS Special
EOS Words
Hello Kirill, Checking if you are back to Oz. Let me know if you are around … Cheers, V
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... , 0]

[1, 1, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... , 3]

[1, 1, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... , 3]

[1, 1, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... , 3]
Training Data:
Hey mate, have you read about Hinton’s capsule networks?
Did you like that recipe I sent you last week?
Hi Kirill, are you coming to dinner tonight?
Dear Kirill, would you like to service your car with us again?
Are you coming to Australia in December?
…
[1, 1, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... , 3]
Training Data:
[1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, ... , 2]
[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... , 0]
[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, ... , 1]
[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, ... , 1]
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, ... , 1]
…
NLP DL
[1, 1, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... , 3]
Training Data:
[1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, ... , 2]
[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, ... , 0]
[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, ... , 1]
[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, ... , 1]
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, ... , 1]
Image
… Source: www.helloacm.com
© SuperDataScience
© SuperDataScience
© SuperDataScience
2017
25,600x
1980
2x
1956
Source: mkomo.com
Log-scale
Source: nature.com
Source: Time Magazine
Geoffrey Hinton
Image Source: www.austincc.edu
4
Input value 1 1
5
Input value 2 2
8 Output value
Input value 3 3
Input Hidden Output

Layer Layer Layer
Output Layer
8
4
Input Layer
2
3
1
Output Layer
8
Hidden Layers
4
Input Layer
2
3
1
© SuperDataScience
Artificial Neural Networks Used for Regression & Classification
Supervised
Convolutional Neural Networks Used for Computer Vision
Recurrent Neural Networks Used for Time Series Analysis

Unsupervised
Self-Organizing Maps Used for Feature Detection
Deep Boltzmann Machines Used for Recommendation Systems
AutoEncoders Used for Recommendation Systems

© SuperDataScience
What we will learn in this section:
• The Neuron
• The Activation Function
• How do Neural Networks work? (example)
• How do Neural Networks learn?
• Gradient Descent
• Stochastic Gradient Descent
• Backpropagation
© SuperDataScience
Image Source: www.austincc.edu
Image Source: Wikipedia
Axon
Neuron
Dendrites
Node
neuron
neuron
Input signal m
Input signal 1
Input signal 2
Output signal
neuron
Input signal m
Input signal 1
Input signal 2
Input value 1 X1
Input value 2 X2 neuron Output signal
Input value m Xm
Synapse
Input value 1 X1
Input value 2 X2 neuron y Output value
Input value m Xm
Synapse
Input value 1 X1
Independent
variable 1

Independent
variable 2
Input value m Xm
Independent
variable m
Standardize
Additional Reading:
Efficient BackProp
By Yann LeCun et al. (1998)
Link:
http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
Input value 1 X1
Independent
variable 1
Input value 2 X2 neuron y Output

Outputvalue
value
Independent
variable 2
Can be:
• Continuous (price)
• Binary (will exit yes/no)
Input value m Xm • Categorical
Independent
variable m
Input value 1 X1
Independent
variable 1 y1 Output value 1
Input value 2 X2 neuron yy2 Output value 2

Independent
variable 2
y3 Output value p
Input value m Xm
Independent
variable m
Input value 1 X1
Independent
variable 1

Independent
variable 2
Input value m Xm
Independent
variable m
Same observation
Single Observation Single Observation
Input value 1 X1
w1
Input value 2 X2 w2 neuron y Output value
wm
Input value m Xm
Input value 1 X1
w1
Input value 2 X2 w2
?
neuron y Output value
wm
Input value m Xm
Input value 1 X1
w1
1st step:
Input value 2 X2 w2 y Output value
wm
Input value m Xm
Input value 1 X1
w1
2nd step:
wm
Input value m Xm
Input value 1 X1
w1
2nd step:
3rd step
wm
Input value m Xm
© SuperDataScience
Input value 1 X1
w1
2nd step:
3rd step
wm
Input value m Xm
0
1
y
Threshold Function
0
1
y
Sigmoid
0
1
y
Rectifier
0
1
y
Hyperbolic Tangent (tanh)
-1
1
y
Additional Reading:
Deep sparse rectifier

neural networks
By Xavier Glorot et al. (2011)
Link:
http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf
Input value 1 X1
w1
2nd step:
3rd step
wm
Input value m Xm
If threshold activation function:
Assuming the DV is binary (y = 0 or

1)
If sigmoid activation function:
4
Input value 1 X1
5
Output value
Input value 2 X2 y
6
Input value m Xm
7
Input Hidden Output

Layer Layer Layer
© SuperDataScience
Area (feet2) X1
w1
Bedrooms X2 w2 Price = w1*x1+ w2*x2+ w3*x3+ w4*x4

y
w3
Distance to city (Miles) X3
w4
Age X4
Input Layer Output Layer

Area (feet2) X1
Bedrooms X2
Age X4
.
Input Layer Hidden Layer Output Layer
Area (feet2) X1
Bedrooms X2
Age X4
.
Area (feet2) X1
Bedrooms X2
Age X4

Area (feet2) X1
Bedrooms X2
Age X4

Area (feet2) X1
Bedrooms X2
Age X4

Area (feet2) X1
Bedrooms X2
Age X4

Area (feet2) X1
Bedrooms X2
Age X4

Area (feet2) X1
Bedrooms X2
Age X4

Area (feet2) X1
Bedrooms X2
y Price
Age X4

© SuperDataScience
Input value 1 X1
w1
Input value 2 X2 w2 ŷ
y Output value
wm
Input value m Xm
y Actual value
ŷ y C
Input value 1 X1
w1
Input value 2 X2 w2 ŷ Output value
wm C = ½(ŷ- y)2
Input value m Xm
y Actual value
ŷ y C
Input value 1 X1
w1
wm
Input value m Xm
y Actual value
ŷ y C
Input value 1 X1
w1
wm C = ½(ŷ- y)2
Input value m Xm
y Actual value
ŷ y C
Input value 1 X1
w1
wm C = ½(ŷ- y)2
Input value m Xm
y Actual value
ŷ y C
Input value 1 X1
w1
wm C = ½(ŷ- y)2
Input value m Xm
y Actual value
ŷ y C
Input value 1 X1
w1
wm C = ½(ŷ- y)2
Input value m Xm
y Actual value
ŷ y C
Input value 1 X1
w1
wm C = ½(ŷ- y)2
Input value m Xm
y Actual value
ŷ y C
Input value 1 X1
w1
wm C = ½(ŷ- y)2
Input value m Xm
y Actual value
X1 X1
w1 w1
X2 w2 ŷ X2 w2 ŷ
wm wm
Xm Xm
y y
X1 X1
w1 w1
w2 w2
C = ∑ ½(ŷ- y)2
X2 ŷ X2 ŷ
wm wm
Xm Xm
y y
X1 X1
Adjust w1, w2, w3

w1 w1
X2 w2 ŷ X2 w2 ŷ
wm wm
Xm
y
Xm
y
C
X1 X1
w1 w1
X2 w2 ŷ X2 w2 ŷ
Xm
wm
y
Xm
wm
y
ŷ y ŷ y ŷ y ŷ y ŷ y ŷ y ŷ y ŷ y
Additional Reading:
A list of cost functions used in

neural networks, alongside
applications
CrossValidated (2015)
Link:
http://stats.stackexchange.com/questions/154879/a-list-of-cost-
functions-used-in-neural-networks-alongside-applications
© SuperDataScience
X1 X1
w1 w1
X2 w2 ŷ X2 w2 ŷ
wm wm
Xm Xm
y y
X1 X1
w1 w1
w2 w2
C = ∑ ½(ŷ- y)2
X2 ŷ X2 ŷ
wm wm
Xm Xm
y y
X1 X1
Adjust w1, w2, w3

w1 w1
X2 w2 ŷ X2 w2 ŷ
wm wm
Xm
y
Xm
y
C
X1 X1
w1 w1
X2 w2 ŷ X2 w2 ŷ
Xm
wm
y
Xm
wm
y
C = ½(ŷ- y)2
Output value
Actual value
y
y
ŷ
w1
X1
Input value
C = ½(ŷ- y)2
ŷ
Best!
C
Area (feet2) X1
Bedrooms X2
y Price
Age X4

Area (feet2) X1
Bedrooms X2
y Price
25 weights
Age X4

1,000 x 1,000 x … x 1,000 = 1,00025 = 1075 combinations
Sunway TaihuLight: World’s fastest Super Computer
93 PFLOPS
93 x 1015
1075 / (93 x 1015)
= 1.08 x 1058 seconds
= 3.42 x 1050 years

Image Source: neuralnetworksanddeeplearning.com
C = ½(ŷ- y)2
ŷ
C
© SuperDataScience
C = ½(ŷ- y)2
ŷ
C
ŷ
Best!
C
X1 X1
w1 w1
X2 w2 ŷ X2 w2 ŷ
wm wm
Xm Xm
y y
X1 X1
w1 w1
w2 w2
C = ∑ ½(ŷ- y)2
X2 ŷ X2 ŷ
wm wm
Xm Xm
y y
X1 X1
Adjust w1, w2, w3

w1 w1
X2 w2 ŷ X2 w2 ŷ
wm wm
Xm
y
Xm
y
C
X1 X1
w1 w1
X2 w2 ŷ X2 w2 ŷ
Xm
wm
y
Xm
wm
y
X1 X1
w1 w1
X2 w2 ŷ X2 w2 ŷ
wm wm
Xm Xm
y y
X1 X1
w1 w1
w2 w2
C = ∑ ½(ŷ- y)2
X2 ŷ X2 ŷ
wm wm
Xm Xm
y y
X1 X1
Adjust w1, w2, w3

w1 w1
X2 w2 ŷ X2 w2 ŷ
wm wm
Xm
y
Xm
y
C
X1 X1
w1 w1
X2 w2 ŷ X2 w2 ŷ
Xm
wm
y
Xm
wm
y
Upd w’s
Upd w’s
Upd w’s
Upd w’s
Upd w’s
Upd w’s
Upd w’s
Upd w’s
Upd w’s
Additional Reading:
A Neural Network in 13 lines

of Python (Part 2 - Gradient
Descent)
Andrew Trask (2015)
Link:
https://iamtrask.github.io/2015/07/27/python-network-part2/
Additional Reading:
Neural Networks and Deep

Learning
Michael Nielsen (2015)
Link:
http://neuralnetworksanddeeplearning.com/chap2.html
© SuperDataScience
Forward
Backpropagation
Propagation
Image Source: neuralnetworksanddeeplearning.com

Additional Reading:
Neural Networks and Deep

Learning
Michael Nielsen (2015)
Link:
http://neuralnetworksanddeeplearning.com/chap2.html
STEP 1: Randomly initialise the weights to small numbers close to 0 (but not 0).
STEP 2: Input the first observation of your dataset in the input layer, each feature in one input node.
STEP 3: Forward-Propagation: from left to right, the neurons are activated in a way that the impact of each
neuron’s activation is limited by the weights. Propagate the activations until getting the predicted result y.
STEP 4: Compare the predicted result to the actual result. Measure the generated error.
STEP 5: Back-Propagation: from right to left, the error is back-propagated. Update the weights according to
how much they are responsible for the error. The learning rate decides by how much we update the
weights.
STEP 6: Repeat Steps 1 to 5 and update the weights after each observation (Reinforcement Learning). Or:
Repeat Steps 1 to 5 but update the weights only after a batch of observations (Batch Learning).
STEP 7: When the whole training set passed through the ANN, that makes an epoch. Redo more epochs.
© SuperDataScience
© SuperDataScience
What we will learn in this section:
• What are Convolutional Neural Networks?
• Step 1 - Convolution Operation
• Step 1(b) - ReLU Layer
• Step 2 - Pooling
• Step 3 - Flattening
• Step 4 - Full Connection
• Summary
• EXTRA: Softmax & Cross-Entropy

© SuperDataScience
Image Source: a talk by Geoffrey Hinton
Source: google trends
Yann Lecun
Facebook
Google
Output
(Image
Label
class)
CNN
Input Image
Happy
Sad
CNN
CNN
B / W Image 2x2px
Pixel 1 Pixel 2 Pixel 1 Pixel 2

2d array
Pixel 3 Pixel 4 Pixel 3 Pixel 4
Red channel Green

Colored Image 2x2px channel
Pixel 1 Pixel 2
3d array Colored
Pixel 1 Pixel 2
Colored
Image
Image
Pixel 3 Pixel 4
Pixel 3 Pixel 4
Blue channel
0
0
0
0
0
0
0
0
0
0
0
0
0
0
STEP 1: Convolution
STEP 2: Max Pooling
STEP 3: Flattening
STEP 4: Full Connection

Additional Reading:
Gradient-Based Learning
Applied to Document
Recognition
By Yann LeCun et al. (1998)
Link:
http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
© SuperDataScience
Additional Reading:
Introduction to
Convolutional Neural
Networks
By Jianxin Wu (2017)
Link:
http://cs.nju.edu.cn/wujx/paper/CNN.pdf
0 0 0 0 0 0 0
0 1 0 0 0 1 0
0 0 1
0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0
Input Feature
Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0
0 0 1
0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0
Input Feature Feature Map

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1
0 0 1
0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0
0 0 1
0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0
0 0 1
0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0 1
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0 1 4
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0 1 4 2
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0 1 4 2 1
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0 1 4 2 1 0
0 1 1
0 0 1 1 1 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0 1 4 2 1 0
0 1 1
0 0 1 1 1 0 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0 1 4 2 1 0
0 1 1
0 0 1 1 1 0 0 0 0
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0 1 4 2 1 0
0 1 1
0 0 1 1 1 0 0 0 0 1
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0 1 4 2 1 0
0 1 1
0 0 1 1 1 0 0 0 0 1 2
0 0 0 0 0 0 0

Image Detector
0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 1 0 0 0
0 0 1
0 0 0 0 0 0 0 0 1 1 1 0
0 0 0 1 0 0 0 1 0 0 1 0 1 2 1
0 1 0 0 0 1 0 1 4 2 1 0
0 1 1
0 0 1 1 1 0 0 0 0 1 2 1
0 0 0 0 0 0 0

Image Detector
We create many Feature
feature maps to Maps
0 0 0 0 0 0 0 obtain our first
convolution layer
0 1 0 0 0 1 0
0 0 0 0 0 0 0
0 0 0 1 0 0 0
0 1 0 0 0 1 0
0 0 1 1 1 0 0
0 0 0 0 0 0 0
Input Image
Convolutional
Layer
Image Source: docs.gimp.org/en/plug-in-convmatrix.html
Sharpen:
Blur:
Edge Enhance:

Edge Detect:

Emboss:
Image Source: eonardoaraujosantos.gitbooks.io
© SuperDataScience
We create many Feature
feature maps to Maps
0 0 0 0 0 0 0 obtain our first
convolution layer
0 1 0 0 0 1 0
0 0 0 0 0 0 0
0 0 0 1 0 0 0
0 1 0 0 0 1 0
0 0 1 1 1 0 0
0 0 0 0 0 0 0
Input Image
Convolutional
Layer
Feature Maps
0 0 0 0 0 0 0
0 1 0 0 0 1 0 Rectifier
0 0 0 0 0 0 0
y
0 0 0 1 0 0 0
0 1 0 0 0 1 0
0 0 1 1 1 0 0
0 0 0 0 0 0 0
0
Input Image
Convolutional Layer
Image Source: http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf
Additional Reading:
Understanding
Convolutional Neural
Networks with A
Mathematical Model
By C.-C. Jay Kuo (2016)
Link:
https://arxiv.org/pdf/1609.04112.pdf
Additional Reading:
Delving Deep into Rectifiers:

Surpassing Human-Level
Performance on ImageNet
Classification
By Kaiming He et al. (2015)
Link:
https://arxiv.org/pdf/1502.01852.pdf
© SuperDataScience
0
Feature Map
0
2
0
1
1
0
0
0
0 1 0 0 0
0 1 1 1 0 Max Pooling
1 0 1 2 1
1 4 2 1 0
0 0 1 2 1
Pooled Feature
Feature Map Map
0 1 0 0 0
0 1 1 1 0 1
Max Pooling
1 0 1 2 1
1 4 2 1 0
0 0 1 2 1
Pooled Feature
Feature Map Map
0 1 0 0 0
0 1 1 1 0 1 1
Max Pooling
1 0 1 2 1
1 4 2 1 0
0 0 1 2 1
Pooled Feature
Feature Map Map
0 1 0 0 0
0 1 1 1 0 1 1 0
Max Pooling
1 0 1 2 1
1 4 2 1 0
0 0 1 2 1
Pooled Feature
Feature Map Map
0 1 0 0 0
0 1 1 1 0 1 1 0
Max Pooling
1 0 1 2 1 4
1 4 2 1 0
0 0 1 2 1
Pooled Feature
Feature Map Map
0 1 0 0 0
0 1 1 1 0 1 1 0
Max Pooling
1 0 1 2 1 4 2
1 4 2 1 0
0 0 1 2 1
Pooled Feature
Feature Map Map
0 1 0 0 0
0 1 1 1 0 1 1 0
Max Pooling
1 0 1 2 1 4 2 1
1 4 2 1 0
0 0 1 2 1
Pooled Feature
Feature Map Map
0 1 0 0 0
0 1 1 1 0 1 1 0
Max Pooling
1 0 1 2 1 4 2 1
1 4 2 1 0 0
0 0 1 2 1
Pooled Feature
Feature Map Map
0 1 0 0 0
0 1 1 1 0 1 1 0
Max Pooling
1 0 1 2 1 4 2 1
1 4 2 1 0 0 2
0 0 1 2 1
Pooled Feature
Feature Map Map
0 1 0 0 0
0 1 1 1 0 1 1 0
Max Pooling
1 0 1 2 1 4 2 1
1 4 2 1 0 0 2 1
0 0 1 2 1
Pooled Feature
Feature Map Map
Additional Reading:
Evaluation of Pooling
Operations in Convolutional
Architectures for Object
Recognition
By Dominik Scherer et al. (2010)
Link:
http://ais.uni-bonn.de/papers/icann2010_maxpool.pdf
0 0 0 0 0 0 0
0 1 0 0 0 1 0
0 0 0 0 0 0 0
0 0 0 1 0 0 0
0 1 0 0 0 1 0 Convolution Pooling
0 0 1 1 1 0 0
0 0 0 0 0 0 0
Input Image
Convolutional Pooling Layer
Layer
Image Source: scs.ryerson.ca/~aharley/vis/conv/flat.html
© SuperDataScience
Pooled Feature
0
1
1
1
2
2
1
4
0
Map
1
1
0
4
2
1
0
2
1
Flattening
Pooled Feature
0
1
1
1
2
2
1
4
0
Map
Input layer of a future ANN
Flattening
Pooling Layer
0 0 0 0 0 0 0
0 1 0 0 0 1 0
0 0 0 0 0 0 0
0 0 0 1 0 0 0 Convolution Pooling Flattening
0 1 0 0 0 1 0
0 0 1 1 1 0 0
0 0 0 0 0 0 0
Input Image
Input
layer of
Convolutional Pooling Layer a future
Layer ANN
© SuperDataScience
X1
Flattening
X2 Output
value
Xm
Input Layer Fully Connected Layer Output Layer

Dog
Cat
Flattening
Dog
Cat
Flattening
Dog
Cat
0.9
0.2
0.2
0.1
0.1
0.1
1
1
Flattening
Dog
Cat
0.9
0.2
0.2
0.1
0.1
0.1
1
1
Flattening
Dog
Cat
Flattening
Dog
Cat
0.9
0.9
0.2
0.2
0.2
0.1
0.1
1
Flattening
Dog
Cat
0.9
0.9
0.2
0.2
0.2
0.1
0.1
1
Flattening
Dog
Cat
Flattening
0.05
0.95
Dog
Cat
0.4
0.8
0.8
0.2
0.1
0.1
1
Flattening
0.79
0.21
Dog
Cat
0.4
0.8
0.9
0.2
0.1
0.1
1
1
Flattening
Image Source: a talk by Geoffrey Hinton
© SuperDataScience
Additional Reading:
The 9 Deep Learning Papers

You Need To Know About
(Understanding CNNs Part 3)
Adit Deshpande (2016)
Link:
https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-
Learning-Papers-You-Need-To-Know-About.html
© SuperDataScience
0.05
0.95
Dog
Cat
Flattening
0.05
0.95
z2
z1
Dog
Cat
Flattenin
g
1
0
Dog 0.9
0.1
Cat
0.4
0.6
0.9
0.7
0.3
0.1
0.4
0.9
0.9
0.6
0.1
0.1
1
0
Dog
Dog
Dog
Cat
Cat
Cat
Row Dog Cat^ Dog Cat Row Dog Cat^ Dog Cat
^ ^
#1 0.9 0.1 1 0 #1 0.6 0.4 1 0
#2 0.1 0.9 0 1 #2 0.3 0.7 0 1
Classification Error
#3 0.4 0.6
1/3 = 0.33 1 0 #3 0.1 0.9
1/3 = 0.33 1 0
Mean Squared Error
0.25 0.71
Cross-Entropy
0.38 1.06
Additional Reading:
A Friendly Introduction to
Cross-Entropy Loss
By Rob DiPietro (2016)
Link:
https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
Additional Reading:
How to implement a neural

network Intermezzo 2
By Peter Roelants (2016)
Link:
http://peterroelants.github.io/posts/neural_network_implementation
_intermezzo02/

Machine Learning A Z Course Downloadable Slides V1.5

Uploaded by

Copyright:

Available Formats

Machine Learning A Z Course Downloadable Slides V1.5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning A Z Course Downloadable Slides V1.5

Uploaded by

Copyright:

Available Formats

Machine

Enjoy machine learning,

• Hello! My name is Kirill Eremenko • Hi there! My name is Hadelin de Ponteves

𝑏!, 𝑏" such that:

y-intercept Slope coefficient 1 Slope coefficient 2 Slope coefficient n

The Application of Multiple Linear Regression and

Magdalena Piekutowska et. al. (2021)

𝑆𝑆<=> = 𝑆𝑈𝑀(𝑦$ − 𝑦!$ )# 𝑆𝑆393 = 𝑆𝑈𝑀(𝑦$ − 𝑦:;5 )#

𝑆𝑆<=> 1.0 = Perfect fit (suspicious)

*This is highly dependent on the context

1. Linearity 2. Homoscedasticity 3. Multivariate Normality

4. Independence 5. Lack of Multicollinearity 6. The Outlier Check

Download the Assumptions poster at:

Verifying the Assumptions of Linear Regression

Eryk Lewinson (2019)

192,261.83 165,349.20 136,897.80 471,784.10 New York

y = b0 + b1*x1 + b2*x2 + b3*x3 + ???

Machine Learning A-Z © SuperDataScience

Profit R&D Spend Admin Marketing State New York California

192,261.83 165,349.20 136,897.80 471,784.10 New York 1 0

y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*D1

Machine Learning A-Z © SuperDataScience

Profit R&D Spend Admin Marketing State New York California

192,261.83 165,349.20 136,897.80 471,784.10 New York 1 0

y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*D1

Machine Learning A-Z © SuperDataScience

Profit R&D Spend Admin Marketing State New York California

192,261.83 165,349.20 136,897.80 471,784.10 New York 1 0

y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*D1 + b5*D2

Machine Learning A-Z © SuperDataScience

Profit R&D Spend Admin Marketing State New York California

192,261.83 165,349.20 136,897.80 471,784.10 New York 1 0

y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*D1 + b5*D2

Always omit one dummy