0% found this document useful (0 votes)

118 views14 pages

DeepLearning Introduction

Deep learning is rising due to increased data, computation power, and algorithms. Neural networks can perform tasks like image classification. Logistic regression is used for binary classification problems by outputting a probability using the sigmoid function. The cost function measures error and is minimized with gradient descent, which computes partial derivatives to adjust weights in the correct direction. This allows neural networks to learn complex patterns from large amounts of data.

Uploaded by

Heny Dave

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views14 pages

DeepLearning Introduction

Uploaded by

Heny Dave

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to deep learning

Be able to explain the major trends driving the rise of deep learning, and understand where and how it is applied today.

What is a (Neural Network) NN?

Single neuron == linear regression
Simple NN graph:

1/16
Image taken from [Link]
RELU stands for rectified linear unit is the most popular activation function right now that makes deep NNs train faster
now.
Hidden layers predicts connection between inputs automatically, thats what deep learning is good at.
Deep NN consists of more hidden layers (Deeper layers)

Image taken from [Link]

Each Input will be connected to the hidden layer and the NN will decide the connections.
Supervised learning means we have the (X,Y) and we need to get the function that maps X to Y.

Supervised learning with neural networks

Different types of neural networks for supervised learning which includes:
CNN or convolutional neural networks (Useful in computer vision)
RNN or Recurrent neural networks (Useful in Speech recognition or NLP)
Standard NN (Useful for Structured data)
Hybrid/custom NN or a Collection of NNs types
Structured data is like the databases and tables.

Unstructured data is like images, video, audio, and text.

Structured data gives more money because companies relies on prediction on its big data.

Why is deep learning taking off?

[Link] 2/16
%20Neural%20Networks%20and%20Deep%20Learning
Deep learning is taking off for 3 reasons:

i. Data:
Using this image we can conclude:

For small data NN can perform as Linear regression or SVM (Support vector machine)
For big data a small NN is better that SVM

For big data a big NN is better that a medium NN is better that small NN.
Hopefully we have a lot of data because the world is using the computer a little bit more
Mobiles
IOT (Internet of things)
ii. Computation:
GPUs.
Powerful CPUs.
Distributed computing.

ASICs
iii. Algorithm:
a. Creative algorithms has appeared that changed the way NN works.
For example using RELU function is so much better than using SIGMOID function in training a NN because it
helps with the vanishing gradient problem.

Neural Networks Basics

Learn to set up a machine learning problem with a neural network mindset. Learn to use vectorization to speed up your
models.

Binary classification
Mainly he is talking about how to do a logistic regression to make a binary classifier.

Image taken from [Link]

[Link] 3/16
%20Neural%20Networks%20and%20Deep%20Learning
He talked about an example of knowing if the current image contains a cat or not.
Here are some notations:
M is the number of training vectors

Nx is the size of the input vector

Ny is the size of the output vector

X(1) is the first input vector

Y(1) is the first output vector

X = [x(1) x(2).. x(M)]

Y = (y(1) y(2).. y(M))

We will use python in this course.

In NumPy we can make matrices and make operations on them in a fast and reliable time.

Logistic regression
Algorithm is used for classification algorithm of 2 classes.
Equations:
Simple equation: y = wx + b
If x is a vector: y = w(transpose)x + b
If we need y to be in between 0 and 1 (probability): y = sigmoid(w(transpose)x + b)
In some notations this might be used: y = sigmoid(w(transpose)x)
While b is w0 of w and we add x0 = 1 . but we won't use this notation in the course (Andrew said that the first
notation is better).
In binary classification Y has to be between 0 and 1 .
In the last equation w is a vector of Nx and b is a real number

Logistic regression cost function

First loss function would be the square root error: L(y',y) = 1/2 (y' - y)^2
But we won't use this notation because it leads us to optimization problem which is non convex, means it contains
local optimum points.
This is the function that we will use: L(y',y) = - (y*log(y') + (1-y)*log(1-y'))
To explain the last function lets see:
if y = 1 ==> L(y',1) = -log(y') ==> we want y' to be the largest ==> y ' biggest value is 1
if y = 0 ==> L(y',0) = -log(1-y') ==> we want 1-y' to be the largest ==> y' to be smaller as possible
because it can only has 1 value.
Then the Cost function will be: J(w,b) = (1/m) * Sum(L(y'[i],y[i]))
The loss function computes the error for a single training example; the cost function is the average of the loss functions of
the entire training set.

Gradient Descent
We want to predict w and b that minimize the cost function.

Our cost function is convex.

First we initialize w and b to 0,0 or initialize them to a random value in the convex function and then try to improve the
values the reach minimum value.

In Logistic regression people always use 0,0 instead of random.

The gradient decent algorithm repeats: w = w - alpha * dw where alpha is the learning rate and dw is the derivative of
w (Change to w ) The derivative is also the slope of w

Looks like greedy algorithms. the derivative give us the direction to improve our parameters.

[Link] 4/16
%20Neural%20Networks%20and%20Deep%20Learning
The actual equations we will implement:

w = w - alpha * d(J(w,b) / dw) (how much the function slopes in the w direction)

b = b - alpha * d(J(w,b) / db) (how much the function slopes in the d direction)

Derivatives
We will talk about some of required calculus.
You don't need to be a calculus geek to master deep learning but you'll need some skills from it.
Derivative of a linear line is its slope.
ex. f(a) = 3a d(f(a))/d(a) = 3
if a = 2 then f(a) = 6
if we move a a little bit a = 2.001 then f(a) = 6.003 means that we multiplied the derivative (Slope) to the moved
area and added it to the last result.

More Derivatives examples

f(a) = a^2 ==> d(f(a))/d(a) = 2a
a = 2 ==> f(a) = 4

a = 2.0001 ==> f(a) = 4.0004 approx.

f(a) = a^3 ==> d(f(a))/d(a) = 3a^2

f(a) = log(a) ==> d(f(a))/d(a) = 1/a

To conclude, Derivative is the slope and slope is different in different points in the function thats why the derivative is a
function.

Computation graph
Its a graph that organizes the computation from left to right.

Derivatives with a Computation Graph

Calculus chain rule says: If x -> y -> z (x effect y and y effects z) Then d(z)/d(x) = d(z)/d(y) * d(y)/d(x)
The video illustrates a big example.
We compute the derivatives on a graph from right to left and it will be a lot more easier.
dvar means the derivatives of a final output variable with respect to various intermediate quantities.

Logistic Regression Gradient Descent

In the video he discussed the derivatives of gradient decent example for one sample with two features x1 and x2 .

Gradient Descent on m Examples

Lets say we have these variables:

X1 Feature
X2 Feature
W1 Weight of the first feature.
W2 Weight of the second feature.
B Logistic Regression parameter.
M Number of training examples
Y(i) Expected output of i

So we have:

6/16
Then from right to left we will calculate derivations compared to the result:

d(a) = d(l)/d(a) = -(y/a) + ((1-y)/(1-a))

d(z) = d(l)/d(z) = a - y
d(W1) = X1 * d(z)
d(W2) = X2 * d(z)
d(B) = d(z)

From the above we can conclude the logistic regression pseudo code:

J = 0; dw1 = 0; dw2 =0; db = 0; # Devs.

w1 = 0; w2 = 0; b=0; # Weights
for i = 1 to m
# Forward pass
z(i) = W1*x1(i) + W2*x2(i) + b
a(i) = Sigmoid(z(i))
J += (Y(i)*log(a(i)) + (1-Y(i))*log(1-a(i)))

# Backward pass
dz(i) = a(i) - Y(i)
dw1 += dz(i) * x1(i)
dw2 += dz(i) * x2(i)
db += dz(i)
J /= m
dw1/= m
dw2/= m
db/= m

# Gradient descent
w1 = w1 - alpa * dw1
w2 = w2 - alpa * dw2
b = b - alpa * db

The above code should run for some iterations to minimize error.

So there will be two inner loops to implement the logistic regression.

Vectorization is so important on deep learning to reduce loops. In the last code we can make the whole loop in one step
using vectorization!

Vectorization
Deep learning shines when the dataset are big. However for loops will make you wait a lot for a result. Thats why we need
vectorization to get rid of some of our for loops.
NumPy library (dot) function is using vectorization by default.
The vectorization can be done on CPU or GPU thought the SIMD operation. But its faster on GPU.
Whenever possible avoid for loops.
Most of the NumPy library methods are vectorized version.

Vectorizing Logistic Regression

We will implement Logistic Regression using one for loop then without any for loop.

As an input we have a matrix X and its [Nx, m] and a matrix Y and its [Ny, m] .

We will then compute at instance [z1,z2...zm] = W' * X + [b,b,...b] . This can be written in python as:

Z = [Link](W.T,X) + b # Vectorization, then broadcasting, Z shape is (1, m)

A = 1 / 1 + [Link](-Z) # Vectorization, A shape is (1, m)

Vectorizing Logistic Regression's Gradient Output:

7/16
dz = A - Y # Vectorization, dz shape is (1, m)
dw = [Link](X, dz.T) / m # Vectorization, dw shape is (Nx, 1)
db = [Link]() / m # Vectorization, dz shape is (1, 1)

Notes on Python and NumPy

In NumPy, [Link](axis = 0) sums the columns while [Link](axis = 1) sums the rows. In

NumPy, [Link](1,4) changes the shape of the matrix by broadcasting the values.

Reshape is cheap in calculations so put it everywhere you're not sure about the calculations.

Broadcasting works when you do a matrix operation with matrices that doesn't match for the operation, in this case
NumPy automatically makes the shapes ready for the operation by broadcasting the values.

In general principle of broadcasting. If you have an (m,n) matrix and you add(+) or subtract(-) or multiply(*) or divide(/)
with a (1,n) matrix, then this will copy it m times into an (m,n) matrix. The same with if you use those operations with a (m ,
1) matrix, then this will copy it n times into (m, n) matrix. And then apply the addition, subtraction, and multiplication of
division element wise.

Some tricks to eliminate all the strange bugs in the code:

If you didn't specify the shape of a vector, it will take a shape of (m,) and the transpose operation won't work. You
have to reshape it to (m, 1)
Try to not use the rank one matrix in ANN
Don't hesitate to use assert([Link] == (5,1)) to check if your matrix shape is the required one.
If you've found a rank one matrix try to run reshape on it.

Jupyter / IPython notebooks are so useful library in python that makes it easy to integrate code and document at the same
time. It runs in the browser and doesn't need an IDE to run.

To open Jupyter Notebook, open the command line and call: jupyter-notebook It should be installed to work.

To Compute the derivative of Sigmoid:

s = sigmoid(x)
ds = s * (1 - s) # derivative using calculus

To make an image of (width,height,depth) be a vector, use this:

v = [Link]([Link][0][Link][1][Link][2],1) #reshapes the image.

Gradient descent converges faster after normalization of the input matrices.

General Notes
The main steps for building a Neural Network are:
Define the model structure (such as number of input features and outputs)
Initialize the model's parameters.
Loop.
Calculate current loss (forward propagation)
Calculate current gradient (backward propagation)
Update parameters (gradient descent)

Preprocessing the dataset is important.

Tuning the learning rate (which is an example of a "hyperparameter") can make a big difference to the algorithm.
Shallow neural networks

Learn to build a neural network with one hidden layer, using forward propagation and backpropagation.

Neural Networks Overview

In logistic regression we had:

X1 \
X2 ==> z = XW + B ==> a = Sigmoid(z) ==> l(a,Y)
X3 /

In neural networks with one layer we will have:

X1 \
X2 => z1 = XW1 + B1 => a1 = Sigmoid(z1) => z2 = a1W2 + B2 => a2 = Sigmoid(z2) => l(a2,Y)
X3 /

X is the input vector (X1, X2, X3) , and Y is the output variable (1x1)

NN is stack of logistic regression objects.

Neural Network Representation

We will define the neural networks that has one hidden layer.
NN contains of input layers, hidden layers, output layers.

Hidden layer means we cant see that layers in the training set.
a0 = x (the input layer)

a1 will represent the activation of the hidden neurons.

a2 will represent the output layer.

We are talking about 2 layers NN. The input layer isn't counted.

Computing a Neural Network's Output

Equations of Hidden layers:

Here are some informations about the last image:

noOfHiddenNeurons = 4

Nx = 3

Shapes of the variables:

W1 is the matrix of the first hidden layer, it has a shape of (noOfHiddenNeurons,nx)

b1 is the matrix of the first hidden layer, it has a shape of (noOfHiddenNeurons,1)

z1 is the result of the equation z1 = W1*X + b , it has a shape of (noOfHiddenNeurons,1) a1

is the result of the equation a1 = sigmoid(z1) , it has a shape of (noOfHiddenNeurons,1) W2

is the matrix of the second hidden layer, it has a shape of (1,noOfHiddenNeurons)
b2 is the matrix of the second hidden layer, it has a shape of (1,1)
z2 is the result of the equation z2 = W2*a1 + b , it has a shape of (1,1)

a2 is the result of the equation a2 = sigmoid(z2) , it has a shape of (1,1)

Vectorizing across multiple examples

Pseudo code for forward propagation for the 2 layers NN:

9/16
for i = 1 to m
z[1, i] = W1*x[i] + b1 # shape of z[1, i] is (noOfHiddenNeurons,1)
a[1, i] = sigmoid(z[1, i]) # shape of a[1, i] is (noOfHiddenNeurons,1)
z[2, i] = W2*a[1, i] + b2 # shape of z[2, i] is (1,1)
a[2, i] = sigmoid(z[2, i]) # shape of a[2, i] is (1,1)

Lets say we have X on shape (Nx,m) . So the new pseudo code:

Z1 = W1X + b1 # shape of Z1 (noOfHiddenNeurons,m)

A1 = sigmoid(Z1) # shape of A1 (noOfHiddenNeurons,m)
Z2 = W2A1 + b2 # shape of Z2 is (1,m)
A2 = sigmoid(Z2) # shape of A2 is (1,m)

If you notice always m is the number of columns.

In the last example we can call X = A0 . So the previous step can be rewritten as:

Z1 = W1A0 + b1 # shape of Z1 (noOfHiddenNeurons,m)

A1 = sigmoid(Z1) # shape of A1 (noOfHiddenNeurons,m)
Z2 = W2A1 + b2 # shape of Z2 is (1,m)
A2 = sigmoid(Z2) # shape of A2 is (1,m)

Activation functions
So far we are using sigmoid, but in some cases other functions can be a lot better.
Sigmoid can lead us to gradient decent problem where the updates are so low.

Sigmoid activation function range is [0,1] A = 1 / (1 + [Link](-z)) # Where z is the input matrix
Tanh activation function range is [-1,1] (Shifted version of sigmoid function)

In NumPy we can implement Tanh using one of these methods: A = ([Link](z) - [Link](-z)) / ([Link](z) +
[Link](-z)) # Where z is the input matrix

Or A = [Link](z) # Where z is the input matrix

It turns out that the tanh activation usually works better than sigmoid activation function for hidden units because the
mean of its output is closer to zero, and so it centers the data better for the next layer.
Sigmoid or Tanh function disadvantage is that if the input is too small or too high, the slope will be near zero which will
cause us the gradient decent problem.
One of the popular activation functions that solved the slow gradient decent is the RELU function. RELU = max(0,z) # so
if z is negative the slope is 0 and if z is positive the slope remains linear.
So here is some basic rule for choosing activation functions, if your classification is between 0 and 1, use the output
activation as sigmoid and the others as RELU.
Leaky RELU activation function different of RELU is that if the input is negative the slope will be so small. It works as RELU
but most people uses RELU. Leaky_RELU = max(0.01z,z) #the 0.01 can be a parameter for your algorithm.
In NN you will decide a lot of choices like:
No of hidden layers.
No of neurons in each hidden layer.
Learning rate. (The most important parameter)
Activation functions.

And others..
It turns out there are no guide lines for that. You should try all activation functions for example.

Why do you need non-linear activation functions?

If we removed the activation function from our algorithm that can be called linear activation function.
Linear activation function will output linear activations
Whatever hidden layers you add, the activation will be always linear like logistic regression (So its useless in a lot of
complex problems)
You might use linear activation function in one place - in the output layer if the output is real numbers (regression
problem). But even in this case if the output value is non-negative you could use RELU instead.

Derivatives of activation functions

Derivation of Sigmoid activation function:

g(z) = 1 / (1 + [Link](-z))
g'(z) = (1 / (1 + [Link](-z))) * (1 - (1 / (1 + [Link](-z))))
g'(z) = g(z) * (1 - g(z))

Derivation of Tanh activation function:

g(z) = (e^z - e^-z) / (e^z + e^-z)

g'(z) = 1 - [Link](z)^2 = 1 - g(z)^2

Derivation of RELU activation function:

g(z) = [Link](0,z)
g'(z) = { 0 if z < 0
1 if z >= 0 }

Derivation of leaky RELU activation function:

g(z) = [Link](0.01 * z, z)
g'(z) = { 0.01 if z < 0
1 if z >= 0 }

Gradient descent for Neural Networks

In this section we will have the full back propagation of the neural network (Just the equations with no explanations).

Gradient descent algorithm:

NN parameters:

n[0] = Nx
n[1] = NoOfHiddenNeurons
n[2] = NoOfOutputNeurons = 1

W1 shape is (n[1],n[0])

b1 shape is (n[1],1)
W2 shape is (n[2],n[1])

b2 shape is (n[2],1)

Cost function I = I(W1, b1, W2, b2) = (1/m) * Sum(L(Y,A2))

Then Gradient descent:

Repeat:
Compute predictions (y'[i], i = 0,...m)
Get derivatives: dW1, db1, dW2, db2
Update: W1 = W1 - LearningRate * dW1
b1 = b1 - LearningRate * db1

12/16
W2 = W2 - LearningRate * dW2
b2 = b2 - LearningRate * db2

Forward propagation:

Z1 = W1A0 + b1 # A0 is X
A1 = g1(Z1)
Z2 = W2A1 + b2
A2 = Sigmoid(Z2) # Sigmoid because the output is between 0 and 1

Backpropagation (derivations):

dZ2 = A2 - Y # derivative of cost function we used * derivative of the sigmoid function

dW2 = (dZ2 * A1.T) / m
db2 = Sum(dZ2) / m
dZ1 = (W2.T * dZ2) * g'1(Z1) # element wise product (*)
dW1 = (dZ1 * A0.T) / m # A0 = X
db1 = Sum(dZ1) / m
# Hint there are transposes with multiplication because to keep dimensions correct

How we derived the 6 equations of the backpropagation:

Random Initialization
In logistic regression it wasn't important to initialize the weights randomly, while in NN we have to initialize them
randomly.

If we initialize all the weights with zeros in NN it won't work (initializing bias with zero is OK):

all hidden units will be completely identical (symmetric) - compute exactly the same function
on each gradient descent iteration all the hidden units will always update the same

To solve this we initialize the W's with a small random numbers:

W1 = [Link]((2,2)) * 0.01 # 0.01 to make it small enough

b1 = [Link]((2,1)) # its ok to have b as zero, it won't get us to the symmetry breaking
problem

We need small values because in sigmoid (or tanh), for example, if the weight is too large you are more likely to end up
even at the very start of training with very large values of Z. Which causes your tanh or your sigmoid activation function to
be saturated, thus slowing down learning. If you don't have any sigmoid or tanh activation functions throughout your
neural network, this is less of an issue.

Constant 0.01 is alright for 1 hidden layer networks, but if the NN is deep this number can be changed but it will always be
a small number.

Deep Neural Networks

Understand the key computations underlying deep learning, use them to build and train deep neural networks, and apply
it to computer vision.

Deep L-layer neural network

Shallow NN is a NN with one or two layers.
Deep NN is a NN with three or more layers.

We will use the notation L to denote the number of layers in a NN.

12/16
n[l] is the number of neurons in a specific layer l .

n[0] denotes the number of neurons input layer. n[L] denotes the number of neurons in output layer.

g[l] is the activation function.

a[l] = g[l](z[l])
w[l] weights is used for z[l]

x = a[0] , a[l] = y'

These were the notation we will use for deep neural network.
So we have:
A vector n of shape (1, NoOfLayers+1)
A vector g of shape (1, NoOfLayers)
A list of different shapes w based on the number of neurons on the previous and the current layer.
A list of different shapes b based on the number of neurons on the current layer.

Forward Propagation in a Deep Network

Forward propagation general rule for one input:

z[l] = W[l]a[l-1] + b[l]

a[l] = g[l](a[l])

Forward propagation general rule for m inputs:

Z[l] = W[l]A[l-1] + B[l]

A[l] = g[l](A[l])

We can't compute the whole layers forward propagation without a for loop so its OK to have a for loop here.

The dimensions of the matrices are so important you need to figure it out.

Getting your matrix dimensions right

The best way to debug your matrices dimensions is by a pencil and paper.
Dimension of W is (n[l],n[l-1]) . Can be thought by right to left.

Dimension of b is (n[l],1)
dw has the same shape as W , while db is the same shape as b

Dimension of Z[l], A[l] , dZ[l] , and dA[l] is (n[l],m)

Why deep representations?

Why deep NN works well, we will discuss this question in this section.
Deep NN makes relations with data from simpler to complex. In each layer it tries to make a relation with the previous
layer. E.g.:
a. Face recognition application:
Image ==> Edges ==> Face parts ==> Faces ==> desired face
b. Audio recognition application:
Audio ==> Low level sound features like (sss,bb) ==> Phonemes ==> Words ==> Sentences
Neural Researchers think that deep neural networks "think" like brains (simple ==> complex)
Circuit theory and deep learning:

When starting on an application don't start directly by dozens of hidden layers. Try the simplest solutions (e.g. Logistic
Regression), then try the shallow neural network and so on.

Building blocks of deep neural networks

13/16
Forward and back propagation for a layer l:

Deep NN blocks:

Forward and Backward Propagation

Pseudo code for forward propagation for layer l:

Input A[l-1]
Z[l] = W[l]A[l-1] + b[l]
A[l] = g[l](Z[l])
Output A[l], cache(Z[l])

Pseudo code for back propagation for layer l:

Input da[l], Caches

dZ[l] = dA[l] * g'[l](Z[l])
dW[l] = (dZ[l]A[l-1].T) / m
db[l] = sum(dZ[l])/m # Dont forget axis=1, keepdims=True
dA[l-1] = w[l].T * dZ[l] # The multiplication here are a dot product.
Output dA[l-1], dW[l], db[l]

If we have used our loss function then:

dA[L] = (-(y/a) + ((1-y)/(1-a)))

Parameters vs Hyperparameters
Main parameters of the NN is W and b
Hyper parameters (parameters that control the algorithm) are like:
Learning rate.
Number of iteration.
Number of hidden layers L .
Number of hidden units n .

Choice of activation functions.

You have to try values yourself of hyper parameters.
In the earlier days of DL and ML learning rate was often called a parameter, but it really is (and now everybody call it) a
hyperparameter.
On the next course we will see how to optimize hyperparameters.

What does this have to do with the brain

The analogy that "It is like the brain" has become really an oversimplified explanation.
There is a very simplistic analogy between a single logistic unit and a single neuron in the brain.
No human today understand how a human brain neuron works.

No human today know exactly how many neurons on the brain.

Deep learning in Andrew's opinion is very good at learning very flexible, complex functions to learn X to Y mappings, to
learn input-output mappings (supervised learning).
The field of computer vision has taken a bit more inspiration from the human brains then other disciplines that also apply
deep learning.
NN is a small representation of how brain work. The most near model of human brain is in the computer vision (CNN)

Deep Learning
100% (4)
Deep Learning
100 pages
Deep Learning 1754476648
No ratings yet
Deep Learning 1754476648
100 pages
Deep Learning Guide
No ratings yet
Deep Learning Guide
3 pages
Deep Learning Andrew NG
100% (4)
Deep Learning Andrew NG
173 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
13 - Neural Network (Perceptrons)
No ratings yet
13 - Neural Network (Perceptrons)
31 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
Neural Networks & Backpropagation
No ratings yet
Neural Networks & Backpropagation
77 pages
Deep Learning and Logistic Regression Guide
No ratings yet
Deep Learning and Logistic Regression Guide
42 pages
Neural Networks Skimmed - Ipynb - Colab
No ratings yet
Neural Networks Skimmed - Ipynb - Colab
8 pages
Deep Learning for Beginners
No ratings yet
Deep Learning for Beginners
42 pages
Logistic Regression
No ratings yet
Logistic Regression
51 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
W2 Ann
No ratings yet
W2 Ann
12 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
DL Notes
No ratings yet
DL Notes
652 pages
03-Lecture Notes-Mid
No ratings yet
03-Lecture Notes-Mid
23 pages
Computation Graphs
No ratings yet
Computation Graphs
56 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
Neural Networks Optional
No ratings yet
Neural Networks Optional
96 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
79 pages
Lecture 8 - Logistic Regression
No ratings yet
Lecture 8 - Logistic Regression
58 pages
Computation Graph - 1
No ratings yet
Computation Graph - 1
65 pages
Neural Network
No ratings yet
Neural Network
14 pages
Intro To DL
No ratings yet
Intro To DL
28 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
22 pages
Lec 02 Computation Graphs
No ratings yet
Lec 02 Computation Graphs
64 pages
Deep Learning for Beginners
100% (1)
Deep Learning for Beginners
87 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Tut 01
No ratings yet
Tut 01
39 pages
Neural Network Calculation Walkthrough
No ratings yet
Neural Network Calculation Walkthrough
69 pages
L3 Backpropagation
No ratings yet
L3 Backpropagation
61 pages
Introduction to Deep Learning Techniques
No ratings yet
Introduction to Deep Learning Techniques
299 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Deep Learning Course Overview
No ratings yet
Deep Learning Course Overview
298 pages
Deep Feedforward Neural Networks Guide
No ratings yet
Deep Feedforward Neural Networks Guide
97 pages
CI DeepLearningFundamentals
No ratings yet
CI DeepLearningFundamentals
45 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Advanced Machine Learning (AML)
No ratings yet
Advanced Machine Learning (AML)
70 pages
Week 1 - Artificial Neural Networks - Part I - Justin
No ratings yet
Week 1 - Artificial Neural Networks - Part I - Justin
56 pages
Neural Networks: Gradients & Backpropagation
No ratings yet
Neural Networks: Gradients & Backpropagation
83 pages
Unit II
No ratings yet
Unit II
12 pages
Neural Networks: 10-601B Introduction To Machine Learning
No ratings yet
Neural Networks: 10-601B Introduction To Machine Learning
78 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Machine Learning Concepts Explained
No ratings yet
Machine Learning Concepts Explained
17 pages
AML - Lecture 3 Logistic Regression. Neural Networks
No ratings yet
AML - Lecture 3 Logistic Regression. Neural Networks
59 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Germanene Nanoribbon Insights
No ratings yet
Germanene Nanoribbon Insights
11 pages
Nickel
No ratings yet
Nickel
46 pages
Lecture Note
No ratings yet
Lecture Note
2 pages
Advantages: Gas Turbines
No ratings yet
Advantages: Gas Turbines
5 pages
McDonald's Supply Chain
No ratings yet
McDonald's Supply Chain
21 pages
PCAB Licensed Contractors 2017-2018
100% (3)
PCAB Licensed Contractors 2017-2018
525 pages
MRB Annual Planner 2025
No ratings yet
MRB Annual Planner 2025
3 pages
DRFM Jammers For Electronic Attack White Paper
No ratings yet
DRFM Jammers For Electronic Attack White Paper
5 pages
Brownlee. The Theory of Employment and Stabilization Policy
No ratings yet
Brownlee. The Theory of Employment and Stabilization Policy
8 pages
Sustainable Construction Costs & Benefits
No ratings yet
Sustainable Construction Costs & Benefits
7 pages
Max 7032
No ratings yet
Max 7032
32 pages
4050 Insurance
No ratings yet
4050 Insurance
3 pages
Weekly Operational Activities
No ratings yet
Weekly Operational Activities
4 pages
Storage and Handling of Workplace Dangerous Goods: National Code
No ratings yet
Storage and Handling of Workplace Dangerous Goods: National Code
124 pages
342 Partners Available For Ticketing in The United Arab Emirates
No ratings yet
342 Partners Available For Ticketing in The United Arab Emirates
2 pages
Batman - C2C Crochet Blanket Pattern - PrettyThingsByKatja
100% (1)
Batman - C2C Crochet Blanket Pattern - PrettyThingsByKatja
28 pages
Summons and Warrant of Arrest.
No ratings yet
Summons and Warrant of Arrest.
7 pages
Create Performance Task
No ratings yet
Create Performance Task
5 pages
React Tutorial - Beginner To Advanced
No ratings yet
React Tutorial - Beginner To Advanced
30 pages
Aritifical inTeilLiGence
No ratings yet
Aritifical inTeilLiGence
24 pages
Written Requesting Judicial Assistance
No ratings yet
Written Requesting Judicial Assistance
2 pages
Exclusive OR/Exclusive NOR (XOR/XNOR) : XOR/XNOR Truth Table Xor B A Xnor B 0 0 0 1 0 1 1 0 1 0 1 0 1 1 0 1
100% (1)
Exclusive OR/Exclusive NOR (XOR/XNOR) : XOR/XNOR Truth Table Xor B A Xnor B 0 0 0 1 0 1 1 0 1 0 1 0 1 1 0 1
30 pages
Cisco Industrial-Security-Cvd-So
No ratings yet
Cisco Industrial-Security-Cvd-So
8 pages
Types and Selection of Chillers
No ratings yet
Types and Selection of Chillers
11 pages
GDG-450-2KW-250rpm-GREEF NEW ENERGY
No ratings yet
GDG-450-2KW-250rpm-GREEF NEW ENERGY
4 pages
A Low Power Asynchronous Viterbi Decoder Using LEDR Encoding
No ratings yet
A Low Power Asynchronous Viterbi Decoder Using LEDR Encoding
6 pages
BiDi-Awning Installation Guide
No ratings yet
BiDi-Awning Installation Guide
12 pages
Welding Inspection and Quality Control Guide
No ratings yet
Welding Inspection and Quality Control Guide
2 pages
Educ 202 Slides Template
No ratings yet
Educ 202 Slides Template
6 pages
Setup Tallyprime For GST Composition
No ratings yet
Setup Tallyprime For GST Composition
18 pages

DeepLearning Introduction

Uploaded by

DeepLearning Introduction

Uploaded by

Introduction to deep learning

What is a (Neural Network) NN?

Image taken from [Link]

Supervised learning with neural networks

Unstructured data is like images, video, audio, and text.

Why is deep learning taking off?

Neural Networks Basics

Image taken from [Link]

Nx is the size of the input vector

X(1) is the first input vector

Y(1) is the first output vector

X = [x(1) x(2).. x(M)]

Y = (y(1) y(2).. y(M))

We will use python in this course.

Logistic regression cost function

Our cost function is convex.

In Logistic regression people always use 0,0 instead of random.

More Derivatives examples

a = 2.0001 ==> f(a) = 4.0004 approx.

f(a) = log(a) ==> d(f(a))/d(a) = 1/a

Derivatives with a Computation Graph

Logistic Regression Gradient Descent

Gradient Descent on m Examples

d(a) = d(l)/d(a) = -(y/a) + ((1-y)/(1-a))

J = 0; dw1 = 0; dw2 =0; db = 0; # Devs.

So there will be two inner loops to implement the logistic regression.

Vectorizing Logistic Regression

Z = [Link](W.T,X) + b # Vectorization, then broadcasting, Z shape is (1, m)

Vectorizing Logistic Regression's Gradient Output:

Notes on Python and NumPy

Some tricks to eliminate all the strange bugs in the code:

To Compute the derivative of Sigmoid:

To make an image of (width,height,depth) be a vector, use this:

v = [Link]([Link][0]*[Link][1]*[Link][2],1) #reshapes the image.

Gradient descent converges faster after normalization of the input matrices.

Preprocessing the dataset is important.

Neural Networks Overview

In neural networks with one layer we will have:

NN is stack of logistic regression objects.

Neural Network Representation

a1 will represent the activation of the hidden neurons.

a2 will represent the output layer.

Computing a Neural Network's Output

Here are some informations about the last image:

Shapes of the variables:

b1 is the matrix of the first hidden layer, it has a shape of (noOfHiddenNeurons,1)

is the result of the equation a1 = sigmoid(z1) , it has a shape of (noOfHiddenNeurons,1) W2

a2 is the result of the equation a2 = sigmoid(z2) , it has a shape of (1,1)

Vectorizing across multiple examples

Lets say we have X on shape (Nx,m) . So the new pseudo code:

Z1 = W1X + b1 # shape of Z1 (noOfHiddenNeurons,m)

If you notice always m is the number of columns.

Z1 = W1A0 + b1 # shape of Z1 (noOfHiddenNeurons,m)

Or A = [Link](z) # Where z is the input matrix

Why do you need non-linear activation functions?

Derivatives of activation functions

Derivation of Tanh activation function:

g(z) = (e^z - e^-z) / (e^z + e^-z)

Derivation of RELU activation function:

Derivation of leaky RELU activation function:

Gradient descent for Neural Networks

Gradient descent algorithm:

Cost function I = I(W1, b1, W2, b2) = (1/m) * Sum(L(Y,A2))

Then Gradient descent:

dZ2 = A2 - Y # derivative of cost function we used * derivative of the sigmoid function

How we derived the 6 equations of the backpropagation:

To solve this we initialize the W's with a small random numbers:

W1 = [Link]((2,2)) * 0.01 # 0.01 to make it small enough

Deep Neural Networks

Deep L-layer neural network

We will use the notation L to denote the number of layers in a NN.

g[l] is the activation function.

x = a[0] , a[l] = y'

Forward Propagation in a Deep Network

z[l] = W[l]a[l-1] + b[l]

Forward propagation general rule for m inputs:

v = [Link]([Link][0][Link][1][Link][2],1) #reshapes the image.