0% found this document useful (0 votes)
24 views73 pages

1.1 Introduction

deep learning

Uploaded by

jadhavrohan7337
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views73 pages

1.1 Introduction

deep learning

Uploaded by

jadhavrohan7337
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Introduction

Revisiting Basics
The Neural Network
robotic assistants to clean our
Building Intelligent homes
cars that drive themselves
Machines microscopes that automatically
detect diseases.

Requires us to solve some of the most


complex computational problems

Limits of Traditional Computer


Programs?
Recognize?

• object recognition
• speech comprehension
• automated translation
Machine Learning Mechanics
Deep learning is a subset of a more general field of artificial
intelligence called machine learning

In machine learning, instead of teaching a computer a massive


list of rules to solve the problem, we give it a model with which
it can evaluate examples, and a small set of instructions to
modify the model when it makes a mistake

We expect that, over time, a well-suited model would be able to


solve the problem extremely accurately
Machine Learning Mechanics
• Let’s define our model to be a function h(x, θ) .
• The input x is an example expressed in vector form
▫ For example, if x were a grayscale image, the vector’s
components would be pixel intensities at each position
• The input θ is a vector of the parameters that our model
uses.
• Our machine learning program tries to perfect the values
of these parameters as it is exposed to more and more
examples
Example- TPS Activity
• Predict exam performance based on the
number of hours of sleep we get and the
number of hours we study the previous
day.
Solution
• collect data
• for each data point x=[ x1, x2]T,
• Record the number of hours of sleep we got (x1),
the number of hours we spent studying (x2)
• Goal- learn a model h(x, θ) with parameter vector
θ = [θ0 θ1 θ2]T such that:
• Then it turns
out, by selecting
θ = [−24 3 4]T,
our machine
learning model
makes the
correct
prediction on
every data point
Solution
• An optimal parameter vector θ positions
the classifier to make as many correct
Optimizatio
n
predictions as possible
• How do we even come up with an optimal
value for the parameter vector θ in the
first place?
• An optimizer aims to maximize the
performance of a machine learning model
by iteratively tweaking its parameters
until the error is minimized
Example

y= ƒ(WTx +
b)
Limitations of Linear
Perceptron

• But these situations are only the tip of the iceberg.


• More complex problems, such as object recognition
and text analysis
• Data becomes extremely high dimensional, and the
relationships we want to capture become highly
nonlinear
Deep
• So? Learning
The Neuron

• The neuron receives its inputs along antennae-


like structures called dendrites.
• Each of these incoming connections is
dynamically strengthened or weakened based on
how often it is used
• Inputs are summed together in the cell body
The Neuron
Macculloch-Pits

y= ƒ(WTx +
b)
Feed Forward Neural Networks
Sigmoid, Tanh, and ReLU
Neurons
• f( z ) = 1/1 + e−z
• Intuitively, this means that when the logit (z) is
very small, the output of a logistic neuron is very
close to 0.
• When the logit is very large, the output of the
logistic neuron is close to 1.
Sample Code
mmatrix= np.array([[1,2,3],[4,5,6]])
print(mmatrix)

def sigmoid(X):
return 1/(1+np.exp(-X))

sigmoid(mmatrix)

• output:
array([[0.73105858, 0.88079708, 0.95257413],
[0.98201379, 0.99330715, 0.99752738]])
Tanh Neuron
• Tanh neurons use a similar kind of S-
shaped nonlinearity
• The output of tanh neurons range from
−1 to 1

f(x)=(2/1+e-2x )-
1
Comparison Sigmoid & Tanh
ReLU Neuron
• f (z) = max( 0, z ) def relu(X):
return
np.maximum(0,X)
relu(mmatrix)
ReLU
• The main advantages of the ReLU activation function are as
follows:
• Sparsity:
▫ ReLU can introduce sparsity in the network by setting
negative values to zero. This means that only a subset of the
neurons is activated, which can lead to more efficient
computation and memory usage.
• Simplicity:
▫ ReLU is a simple and computationally efficient activation
function, as it involves only a single non-linear operation.
• Avoiding the vanishing gradient problem:
▫ Unlike activation functions such as sigmoid or tanh, ReLU
does not saturate for positive inputs.
▫ This property helps mitigate the vanishing gradient
problem, which can occur when the gradients become very
Softmax Output Layers
• Want your output vector to be a probability
distribution over a set of mutually exclusive
labels.
• Probability distribution gives us a better idea of
confidence in predictions
Σi =0 pi = 1
[p0 p1 p2 p3 . . . p9]

• This is achieved by using a special output layer


called a softmax layer
Softmax Output Layers
• Output of a neuron in a softmax layer
depends on the outputs of all the other
neurons in its layer
• require the sum of all the outputs to be
equal to 1
• Letting zi be the logit of the ith softmax
neuron, we can achieve this normalization
by setting its output to
yi = e /Σje
zi zj
Pseudo Code
import numpy as np

def softmax(x):
""" applies softmax to an input
x"""
e_x = np.exp(x)
return e_x / e_x.sum()

x = np.array([1, 0, 3, 5])
y = softmax(x)

[0.01578405 0.00580663 0.11662925 0.86178007]


The Fast Food Problem
• Every single day, we purchase a
restaurant meal consisting of burgers,
fries, and sodas. We buy some number of
servings for each item. We want to be
able to predict how much a meal is going
to cost us, but the items don’t have price
tags. The only thing the cashier will tell
us is the total price of the meal. We want
to train a single linear neuron to solve
this problem. How do we do it?
The Fast Food Problem
Possible Solution
• Be intelligent about picking our training
cases.
• E.g. For one meal we could buy only a single
serving of burgers, for another only buy a
single serving of fries, and then for last meal
buy a single serving of soda.
• In general, intelligently selecting training
examples is a very good idea
• Engineering a clever training set, you can
make your neural network a lot more effective.
More General Approach
• Assume large set of training examples
• calculate what the neural network will
output on the ith training example using
the simple formula
• train the neuron for optimal weights
E = ½ Σi [t(i) − y(i )]2

• What if E=0?
Gradient Descent
• Let’s say our linear neuron only has two inputs(weights,
w1 and w2).
• Imagine a three-dimensional space where the horizontal
dimensions correspond to the weights w1 and w2, and the
vertical dimension corresponds to the value of the error
function E.
Gradient Descent
Gradient Descent
• Visualize surface as a set of elliptical contours
• The minimum error is at the center of the
ellipses
• Contours correspond to settings of w1 and w2
that evaluate to the same value of E
• The closer the contours are to each other, the
steeper the slope
• The direction of the steepest descent is always
perpendicular to the contours. This direction is
expressed as a vector known as the gradient
The Delta Rule and Learning
Rates
• In practice, at each step of moving perpendicular to
the contour, we need to determine how far we want
to walk before recalculating our new direction.
• This distance needs to depend on the steepness of
the surface. Why?
• The closer we are to the minimum, the shorter we
want to step forward
• We know we are close to the minimum, because the
surface is a lot flatter, so we can use the steepness as
an indicator of how close we are to the minimum
• Learning rate, ε
Example GD
• Let’s take a simple quadratic function
defined as:

• Because it is an univariate function a


gradient function is:

• learning rate of 0.1 and starting point at


x=9 we can easily calculate each step by
hand. Let’s do it for the first 3 steps:
• GD algorithm for learning rates of 0.1
and 0.8
Vanishing Gradient Problem
• Activation functions, like the sigmoid function,
squishes a large input space into a small input
space between 0 and 1
• Large change in the input of the sigmoid
function will cause a small change in the output
Vanishing Gradient Problem

• The maximum point of the function is 1/4, and


the function horizontally asymptotes at 0.
• In other words, the output of the derivative of
the cost function is always between 0 and 1/4.
• Mathematically, it ranges between (0, 1/4]
Vanishing Gradient Problem

• By multiplying these two derivatives together, we are


multiplying two values in the range (0, 1/4].
• Any two numbers between 0 and 1 multiplied with each
other will simply result in a smaller value. For example,
1/3 × 1/3 is 1/9
Gradient Descent with Sigmoidal
Neurons
Gradient Descent with Sigmoidal
Neurons
Gradient Descent with Sigmoidal
Neurons
Gradient Descent with Sigmoidal
Neurons
Backpropagation Algorithm
We get a certain loss at the output
and we try to figure out who is
responsible for this loss
Backpropagation Algorithm
We get a certain loss at the output
and we try to figure out who is
responsible for this loss
So, we talk to the output layer and
say Hey! You are not producing the
desired output, better take
responsibility".
Backpropagation Algorithm
We get a certain loss at the output
and we try to figure out who is
responsible for this loss
So, we talk to the output layer and
say Hey! You are not producing the
desired output, better take
responsibility".
The output layer says Well, “I take
responsibility for my part but please
understand that I am only as the
good as the hidden layer and
weights below me". After all :
51
Stochastic Gradient Descent
• Stochastic gradient descent is an extension of
the gradient descent
• A recurring problem in machine learning is that
large training sets are necessary for good
generalization
• Large training sets are also more
computationally expensive.
• The negative conditional log-likelihood of the
training data can be written as
Stochastic Gradient Descent
• For additive cost functions, gradient
descent requires computing

• The computational cost of this operation


is O(m).
• As the training set size grows to billions
of examples, the time to take a single
gradient step becomes prohibitively long.
Stochastic Gradient Descent
Stochastic Gradient Descent
• Specifically, on each step of the algorithm, we
can sample a minibatch of examples B =
{x(1), . . . , x(m’)} drawn uniformly from the
training set.
• The minibatch size m’ is typically chosen to be a
relatively small number of examples.
• The estimate of the gradient is formed as

• using examples from the minibatch B. The


stochastic gradient descent algorithm then
follows the estimated gradient downhill
Mini Batch
SGD VS Mini Batch Batch Size=
10
Capacity, Overfitting and Underfitting
• The central challenge in machine learning is
that we must perform well on new, previously
unseen inputs.
• The ability to perform well on previously
unobserved inputs is called generalization
• The factors determining how well a machine
learning algorithm will perform are its ability
to:
▫ Make the training error small.
▫ Make the gap between training and test error
small.
Capacity, Overfitting and Underfitting
• Underfitting occurs when the model is not able to
obtain a sufficiently low error value on the training set.
• Overfitting occurs when the gap between the training
error and test error is too large.
• We can control whether a model is more likely to
overfit or underfit by altering its capacity.
• Model’s capacity is its ability to fit a wide variety of
functions
• Models with low capacity may struggle to fit the
training set.
• Models with high capacity can overfit by memorizing
properties of the training set
Capacity, Overfitting and Underfitting
Preventing Overfitting in Deep Neural
Networks
• Regularization modifies the objective function
that we minimize by adding additional terms
that penalize large weights.
• change the objective function so that it
becomes Error + λ f (θ)
▫ where, f (θ) grows larger as the components of θ
grow larger, and λ is the regularization strength
• The most common type of regularization in
machine learning is L2 regularization
• L2 regularization is also commonly referred to
as weight decay
Preventing Overfitting in Deep Neural
Networks
• Another common type of regularization is L1
regularization.
• We add the term λ|w| for every weight w in the
neural network.
• The L1 regularization has the property that leads
the weight vectors to become sparse during
optimization (i.e., very close to exactly zero)
• Neurons with L1 regularization end up using only
a small subset of their most important inputs
• L1 regularization is very useful when you want to
understand exactly which features are
contributing to a decision
Preventing Overfitting in Deep Neural
Networks
• Another approach –Dropout
• While training, dropout is implemented by only
keeping a neuron active with some probability p
(a hyperparameter), or setting it to zero
otherwise
Challenges Motivating Deep Learning
• The simple machine learning algorithms work
very well on a wide variety of important
problems.
• However, they have not succeeded in solving
the central problems in AI, such as recognizing
speech or recognizing objects.
• The development of deep learning was
motivated in part by the failure of traditional
algorithms to generalize well on such AI tasks
• High-dimensional spaces. Imposes high
computational costs
Challenges Motivating Deep Learning
• The Curse of Dimensionality
▫ Many machine learning problems become
exceedingly difficult when the number of
dimensions in the data is high.
▫ This phenomenon is known as the curse of
dimensionality
Challenges Motivating Deep Learning
• Local Constancy
▫ Local constancy is a concept related to the assumption
that data samples that are close to each other in the input
space should have similar output predictions.
▫ In other words, if two data points are similar in their
features or attributes, the model's output for these points
should also be similar
• Smoothness Regularization
▫ Smoothness regularization is a technique used in
machine learning models, including deep learning
models, to encourage smooth transitions in the
predictions across the input space.
▫ The objective of smoothness regularization is to penalize
models for producing sharp, erratic, or noisy predictions,
which can lead to overfitting and poor generalization on
Challenges Motivating Deep Learning
• Manifold Learning
▫ Manifold learning in deep learning refers to the
use of deep neural networks to learn low-
dimensional representations (manifolds) of high-
dimensional data.
▫ The key idea behind manifold learning is that the
data often lies on a lower-dimensional manifold
within the high-dimensional input space.
▫ By discovering this underlying manifold, deep
learning models can extract meaningful and
compact representations that capture the essential
structure of the data.
Tensorflow 2.0
• TensorFlow is widely used as a machine learning
implementation library.
• It was created by Google as part of the Google
Brain project
• Later made available as an open source product
Tensorflow 2.0
• Tensors are the building blocks of
TensorFlow, as all computations are done
using tensors
• A tensor is a generalization of vectors and
matrices to potentially higher dimensions.
Internally, TensorFlow represents tensors
as n-dimensional arrays of base
datatypes.
Tensors can be of two types:
constant or variable
Tensorflow 2.0
Example

1. TensorFlow 2. TensorFlow
3. TensorFlow
2.0 doesn’t 2.0 doesn’t
2.0 doesn’t
require the require the
make it
graph session
mandatory to
definition. execution.
4. TensorFlow
2.0 doesn’t
initialize
require variable
variables.
sharing via
scopes.
Example
g = tf.Graph() a = tf.constant([[10,10],
with g.as_default(): [11.,1.]])
a = tf.constant([[10,10],[11.,1.]])x = tf.constant([[1.,0.],[0.,1.]])
x = tf.constant([[1.,0.],[0.,1.]]) b = tf.Variable(12.)
b = tf.Variable(12.) y = tf.matmul(a, x) + b
y = tf.matmul(a, x) + b
print(y.numpy())
init_op =
tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
print(sess.run(y))
Summary
• Machine Learning Basics
• Neuron
• Feed forward Network
• Gradient Descent
• Backpropagation Algorithm
• Challenges
• Tensorflow 2.0

You might also like