Intro to Neural Networks
Explained for beginners
Sajjad Murtaza
Kashif
Mustafa
AI Sciences Instructor
@AISciencesLearn
Problem
10
Criteria
9 Series1; 9
academic Marks
6
2 2
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Problem
10 Criteria
9 Series1; 9
7 ?
academic Marks
6 6
2 2
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Problem
10 Criteria
9
academic Marks
6
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Problem
10 Criteria
9
academic Marks
6
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Problem
10 Criteria
9
academic Marks
6
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Linear Equation
10 Criteria
9 • 3x + y – b = 0
8
7
academic Marks
5
• 3*test +academia – b = 0
4
2 • If score is > 0 i.e Positive
1 Accept
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 • If score is < 0 i.e Negative
Marks in Test
Reject
Linear Equation Vectorized Form
10 Criteria
9 w1x1 + w2x2 + b = 0
8
7
If:
academic Marks
5
• W = (w1,w2)
4
• x = (x1,x2)
3 • y = label:
2 • 0 if False
1 • 1 if True
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 • Then:
Marks in Test
• Wx + b = 0
• ŷ 1 if Wx + b>0
• ŷ 0 if Wx + b<0
Higher Dimensional Space
w1X + w2Y + w3Z + b = 0
w1x1 + w2x2 + w3x3 + b = 0
Wx + b = 0
• Wx + b = 0
• ŷ 1 if Wx + b>0
• ŷ 0 if Wx + b<0
N Dimensional Space
Test Academia . . . . . . . . N
W1x1 + W2x2 + ….. + WnXn + b = 0
Emp1 5 2 …. 8
W = (w1, w2,…, w3)
Emp2 8 7 6 9 x = (x1, x2, … , x3)
. … … … … Wx + b = 0
.
Emp n 6 5 … 7 • Wx + b = 0
• ŷ 1 if Wx + b>0
• ŷ 0 if Wx + b<0
Perceptron
X1
W1
Wx + b = 0
1 if Wx + b>0
X2 W2 0 if Wx + b<0
b
1
Perceptron
X1
W1
Wx + b = 0
1 if Wx + b>0
X2 Yes / No
W2 0 if Wx + b<0
….
Wn
Xn
b
1
Human Brain
X1
W1
Wx + b = 0
1 if Wx + b>0
X2 Yes / No
W2 0 if Wx + b<0
….
Wn
Xn
b
1
Using Logical Gates for P1 P2 P1 AND P2
Perceptron 1 1 1
AND Gate 1 0 0
0 1 0
0 0 0
Perceptron 1 1/0
AND
Perceptron 2 1/0
Using Logical Gates for P1 P2 P1 AND P2
Perceptron 1 1 1
OR Gate 1 0 1
0 1 1
0 0 0
Perceptron 1 1/0
OR
Perceptron 2 1/0
Perceptron’s Training
Perceptron’s Training
Perceptron’s Training
When a Negative point is Positively
2x1 + 3x2 – 7 > 0 Labeled, then we Subtract
• Line Eq: 2x1 + 3x2 – 7 = 0
X2
2x1 + 3
x2 – 7
• Wrong Point: (3,4)
=0 3,4
2 3 -7
3 4 1 (Subtract)
2x1 + 3x2 – 7 < 0 -1 -1 -8
X1
Rapid Change may Misclassify other
Point
Perceptron’s Training When a Negative point is Positively
2x1 + 3x2 – 7 > 0 Labeled, then we Subtract
• Line Eq: 2x1 + 3x2 – 7 = 0
X2
2x1 + 3
x2 – 7
• Wrong Point: (3,4)
=0 3,4 • Learning Rate: 0.1 (0--1)
2 3 -7
- 3(0.1) 4(0.1) 1(0.1) (Subtract)
2x1 + 3x2 – 7 < 0 1.7 2.6 -7.1
X1
New Line: 1.7x1 + 2.6x2 - 7.1 = 0
Perceptron’s Training When a Positive point is Negatively
2x1 + 3x2 – 7 > 0 Labeled, then we Add
• Line Eq: 2x1 + 3x2 – 7 = 0
X2
2x1 + 3
x2 – 7
• Wrong Point: (4,1)
4,1 =0 • Learning Rate: 0.1 (0--1)
2 3 -7
+ 4(0.1) 1(0.1) 1(0.1) (Add)
2x1 + 3x2 – 7 < 0 2.4 3.1 -6.9
X1
New Line: 2.4x1 + 3.1x2 - 6.9 = 0
Perceptron Algorithm
Loop over all Points
Start with random Weights
If point is classified correctly
Move on
If point is misclassified
If prediction is 0
Change wi + axi
If prediction is 1
Change wi - axi
Problem With “Linear
Solutions”
10 Criteria
9
We don’t want
8 to hire an
7 employee with
academic Marks such a bad
6
5
“academic record“
4
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Problem With “Linear
Solutions”
10 Criteria
9
academic Marks
6
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Possible Solution to the Problem
10 Criteria
9
academic Marks
6
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Possible Solution to the Problem
10 Criteria
9
academic Marks
6
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Error Function
Error Function
Error : Height
Error Function
Error = 2
Error Function
Error = 2
Discrete vs Continuous
Error : Height
Log Loss Error Function
Error = 0.1+0.1+0.1+5 +
0.1+0.1+0.1+5
Activation Function
Step Function
X1
W1
Wx + b = 0 1
1 if Wx + b>0
X2 W2 0 if Wx + b<0
0
….
Wn
Xn
b
1
Step Function
X1
W1
Wx + b = 0 0.6
1 if Wx + b>0
X2 W2 0 if Wx + b<0
0.3
….
Wn 0.1
Xn
b
1
Multi-Class Classification
Chances of Rain:
60%(0.6) YES
40%(0.4) NO
Multi-Class Classification
P(Car) = 0.67 P(Bike) = 0.24 P(Bi-Cycle) = 0.09
Score = 2 Score = 1 Score = 0
2 1 0
2+1+0 2+1+0 2+1+0
Problem: Negative Numbers
Softmax Function
P(Bi-Cycle) = 0.09
P(Car) = 0.67 P(Bike) = 0.24
Score = 2 Score = 1 Score = 0
2 1 0
2+1+0 2+1+0 2+1+0
2 1 0
𝑒 𝑒 𝑒
2 1 0 2 1 0 2 1 0
𝑒 + 𝑒 +𝑒 𝑒 + 𝑒 +𝑒 𝑒 + 𝑒 +𝑒
Quiz
Write a function in python that’ll receive a list of
numbers and returns a list that contains the Softmax
value of every number.
One Hot Encoding
Vehicle Value
2
One Hot Encoding
Vehicle Car Bike Bi-Cycle
1 0 0
0 1 0
0 0 1
Maximum Likelihood
Model A Model B
Maximum Likelihood
P(green)=0.9 P(green)=0.6
P(red)=0.1 P(green)=0.6
Model A
P(red)=0.4 P(red)=0.7 0.9 * 0.4 * 0.3 * 0.8 = 0.0864
P(green)=0.3 P(green)=0.8
P(red)=0.7 P(green)=0.2
P(red)=0.8 P(red)=0.9
Model A Model B Model B
0.6 * 0.8 * 0.7 * 0.9 = 0.3024
Error vs Probability
Error Probability
Maximizing Probabilities
P(green)=0.9
Product
P(green)=0.6
P(red)=0.1 P(green)=0.6
P(red)=0.4 P(red)=0.7
P(green)=0.3 P(green)=0.8
P(red)=0.7 P(green)=0.2
Model A
P(red)=0.8 P(red)=0.9
0.9 * 0.4 * 0.3 * 0.8 = 0.0864
Model A Model B
Products are bad Model B
Sums are Good 0.6 * 0.8 * 0.7 * 0.9 = 0.3024
Quiz
Which Function can be used to replace product with
Sums?
A. Sin
B. Cos
C. Exp
D. Log
Log(ab) = Log(a)+Log(b)
Cross-Entropy
Goal is to minimize the Cross-Entropy
Model A
0.9 * 0.4 * 0.3 * 0.8 = 0.0864
Log(0.9) + Log(0.4) + Log(0.3) + Log(0.8) = -1.06348625752
-Log(0.9) -Log(0.4) -Log(0.3) -Log(0.8) = 1.06348625752
Model B
0.6 * 0.8 * 0.7 * 0.9 = 0.3024
Log(0.6) + Log(0.8) + Log(0.7) + Log(0.9) = -0.51941821317
-Log(0.6) -Log(0.8) -Log(0.7) -Log(0.9) = 0.51941821317
Events vs Probability
Events Probability
Cross-
Entroppy
Cross-Entropy Formulation
Selected NOT Selected Cross-Entropy
-ln(0.8) - ln(0.1)
P1 = 0.8
P2 = 0.1
P(selected) = 0.8 P(selected) = 0.9
p1 1 - p2
Y1 = 1 Y2 = 0
• Y: 1 if selected
• Y: 0 if NOT selected
Cross-Entropy for Multi-class
Vehicle G1 G2 G3
0.8 0.3 0.2
p11 p12 p13
0.1 0.1 0.6
p21 p22 p23
0.1 0.6 0.2
p31 p32 p33
Quiz
What is relation between cross-Entropy and
Probability?
Directly proportional
Inversely proportional
Minimizing the Error Function
Gradient Decent
Gradient Decent
Error : Height
Convex Function
Convex Functions
Curve is like a bowl
Derivatives possible
Derivatives
Derivatives are also called Slopes.
F(a) = 2a When: a = 1 f(a) = 2
25
When: a = 5 f(a) = 10
When: a = 5.001 f(a) = 10.002
20
Slope = height/width
15
10
0
15 20 25 30 35 40
How Gradient Decent Works
Gradient Step
Wi’ = Wi – learningRate (d/dE)
Wi’ = Wi – learningRate (-(y - yhat)xi )
Wi’ = Wi + learningRate (y - yhat)xi
b’ = b + learningRate(y – yhat)
Logistic Regression Algorithm
Start with random weights
w1, w2,……., wn, b
For Every point(x1, x2,……, xn)
Update W’
Update b’
Repeat until error is small
Perceptron Algorithm?
To Do
Sigmoid activation function
𝜎(𝑥)=1/1+𝑒−𝑥
Output (prediction) formula
yhat=𝜎(𝑤1𝑥1+𝑤2𝑥2+𝑏)
Error function
𝐸𝑟𝑟𝑜𝑟(𝑦,yhat)=−𝑦log(yhat)−(1−𝑦)log(1− yhat)
The function that updates the weights
𝑤𝑖⟶𝑤𝑖+𝛼(𝑦− yhat)𝑥𝑖
𝑏⟶𝑏+𝛼(𝑦− yhat)
Perceptron VS Gradient Decent
Perceptron Algo
Gradient Decent
• Start with random weights • Loop over all Points
• Start with random Weights
• w1, w2,……., wn, b
• If point is classified correctly
• For Every point(x1, x2,……, xn) • Move on
• If point is misclassified
• Update W’
• If prediction is 0
• Update b’ • Change wi + axi
• Repeat until error is small • If prediction is 1
• Change wi - axi
Problem With “Linear
Solutions”
10 Criteria
9
We don’t want
8 to hire an
7 employee with
academic Marks such a bad
6
5
“academic record“
4
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Problem With “Linear
Solutions”
10 Criteria
9
academic Marks
6
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Possible Solution to the Problem
10 Criteria
9
academic Marks
6
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Possible Solution to the Problem
10 Criteria
9
academic Marks
6
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test
Non-Linear Boundaries
Non-Linear Boundaries
0.7
0.7 + 0.8 = 1.5
0.8
Sigmoid(1.5) = 0.82
Weighted Sums
0.7
6 6 * 0.7
4.2 + 3.2 - 3 = 4.4
0.8
4 4 * 0.8
Sigmoid(4.4) = 0.98
Neural Networks
-3
4
Neural Networks
5x1 - 2x2 + 8
-2
7x1 - 3x2 - 1
7
-1
-3
Neural Networks
5x1 - 2x2 + 8
7 8
-2
7
-6
7x1 - 3x2 - 1 5
-1
5
-3
Neural Networks
5x1 - 2x2 + 8
X1
5
7 8
-2
7
X2
-6
7x1 - 3x2 - 1
5
X1
7
-1
5
-3
X2
Neural Networks
5x1 - 2x2 + 8
7 5 8
X1
7 7
-6
7x1 - 3x2 - 1
5
-2
X2 -1
5 -3
Adding Bias
5
X1 7
7
-2
5
X2 -3
8
1
-6
1 1
Architecture
INPUT Layer Hidden Layer
Output Layer
5
X1 7
7
-2
5
X2 -3
8
1
-6
1 1
DEEP Neural Network
5
X1
7
-2
X2 -3
8
1
1 1 1 1
Multi-Class Classification
5
X1
7
-2
X2 -3
8
1
1 1 1 1
Feed Forward
w11
X1 w31
w12
w21
w32
X2 w22
Feed Forward
w11
X1 w31
w12
w21
w32
X2 w22
Feed Forward
Yhat = 𝜎
𝜎 ( W2 𝜎 (W1) X )
W(1)11 W(2)11
X1
W(1)12
Yhat =
W(1)21 W(2)11
X2 W(1)22
W(1)31
1 W(2)11
W (1)
32
1
Yhat = 𝜎 o W4 𝜎 o W3 𝜎 o W2 𝜎 o W1(X)
DNN Feed Forward
W1 W2 W3
X1
W4
X2
1 1 1 1
Deep Learning Algorithm
Doing a feedforward operation.
Comparing the output of the model with the desired
output.
Calculating the error.
Running the feedforward operation backwards
(backpropagation) to spread the error to each of the
weights.
Use this to update the weights, and get a better model.
Continue this until we have a model that is good.
Back Propagation
w11
X1 w31
w12
w21
w32
X2 w22
Back Propagation Prediction
X1
W1
Error Function
X2 E(W) =
W2
….
Wn
Xn Gradient of Error Function
1
Yhat = 𝜎 o W4 𝜎 o W3 𝜎 o W2 𝜎 o W1(X)
Back Propagation in Deep Net
W1
W
2 W3 Error Function
E(W) =
X1
W4
Gradient of Error Function
X2
1 1 1 1
Chain Rule
W1 W2 W3
X1
W4
X2
1 1 1 1
Chain Rule
f g
x A B
A = f(x) B = g o f(x)
Back Propagation
Error Function
h1
E(W) =
W(1)11 W(2)11
X1
W(1)12 h
yhat
W(1)21 h2
W(2)11
X2 W(1)22
W(1)31
1 W(2)11
W (1)
32
1
Back Propagation
h1
h = W(2)11 𝜎(h1) + W(2)21 𝜎(h2) + W(2)31
W(2)11
h2
W(2)11
W(2)11
1
X0
W0
X2 W1
….
W5
X5
Optimizations
Underfitting & Overfitting
10 Criteria 10 Criteria
9 9
8 8
7 7
academic Marks
academic Marks
6 6
5 5
4 4
3 3
2 2
1 1
0 0
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Marks in Test Marks in Test
Early Stopping
Epoch 1 Epoch 10 Epoch 100 Epoch 1000
10 10 10 10
8 8 8 8
6 6 6 6
academic Marks
academic Marks
academic Marks
academic Marks
4 4 4 4
2 2 2 2
0 0 0 0
2.5 3 3.5 4Marks
4.5 5in Test
5.5 6 6.5 7 7.5 2.5 3 3.5 4Marks
4.5 5in Test
5.5 6 6.5 7 7.5 2.5 3 3.5 4Marks
4.5 5in Test
5.5 6 6.5 7 7.5 2.5 3 3.5 4Marks
4.5 5in Test
5.5 6 6.5 7 7.5
Chart Title
Underfitting
Overfitting
Elbow
Series1
Quiz
(1,1)
• x1 + x 2
(-1,-1)
• 10x1 + 10x2
Regularization ŷ = w1x1 + w2x2
• 𝜎(1+1) = 0.88
• x1 + x 2
• 𝜎(-1-1) = 0.12
(1,1)
x1
• 𝜎(10+10) =
• 10x1 + 10x2
+ x2
=0
(-1,-1)
• 𝜎(-10-10) =
0.9999999979
0.0000000021
Regularization Problem: Large Coefficients Overfitting
x1 + x 2 10x1 + 10x2
Regularization: Solution
Large Coefficients Overfitting
Penalize Large Weights
(w1, w2, …, wn)
+ (|w1|+ … + |wn|)
L1 Error Function =
Error Function = + (w12+ … + wn2)
L2
Usages
L1 Regularization L2 Regularization
Sparsity : (1, 0, 1, 1, 0, 0) Sparsity : (0.3, 0.9, 0.5, -0.1, 0.2, 0.2)
Good for Feature Selection Good for Training Models
Dropout Dropout = 0.2
W1 W2 W3
X1
W4
X2
1 1 1 1
Local Minima Problem
Local Minima
Local Minima
Global Minima
Local Minima Solution: Random
Restart
Local Minima
Local Minima
Global Minima
Vanishing Gradient Problem
Vanishing Gradient Problem
Product of Small Numbers is….
h1 A very Small Number
W(1)11 W(2)11
X1
W(1)12 h
yhat
W(1)21 h2
W(2)11
X2 W(1)22
W(1)31
1 W(2)11
W (1)
32
1
Vanishing Gradient Problem
Local Minima
Local Minima
Global Minima
Vanishing Gradient: Solution
Activation Function: Tanh
Vanishing Gradient: Solution
Activation Function: ReLU
Summary: Activation Functions
Final Project
Predicting Species of Iris
W2
Architect
W1
X1 W3
P(A)
X2
P(B)
X3
P(C)
X4
1
1
1
Intro to Instructor
I am Sajjad M.
I have Over 5 Years of Teaching Experience in
University.
I have worked on a lot of Industrial Projects as well.
My Area of Interest is Data Science and Deep Learning.
I have been working with Python for over 7 years now.
Here is my email address: [email protected]
Intro to Course
Introducing the Logistic Regression Back Propagation
Problem Multiclass Optimizations
Initial Solution Classification
Underfitting/Overfitting
N-Dimensional Space Multiclass
Classification Early Stopping
Perceptron vs
Human Brain Softmax Regularization
Perceptron Training One Hot Encoding Dropout
Linear Solutions Cross Entropy Vanishing Gradient
Gradient Decent
Non-Linear Solutions
Deep Neural
Final Project
Error Functions
Netwoks
Sigmoid Feed Forward
Website : www.aisciences.io