0% found this document useful (0 votes)
18 views62 pages

Domnic Object Detecion Basics

The document provides an overview of deep learning and neural networks, focusing on the perceptron, multilayer perceptrons, and convolutional neural networks. It explains the learning process of perceptrons, the architecture of neural networks, and various hyperparameters that influence training. Additionally, it discusses optimization algorithms, including gradient descent, and the importance of convolutional layers for processing images.

Uploaded by

saisuraj1510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views62 pages

Domnic Object Detecion Basics

The document provides an overview of deep learning and neural networks, focusing on the perceptron, multilayer perceptrons, and convolutional neural networks. It explains the learning process of perceptrons, the architecture of neural networks, and various hyperparameters that influence training. Additionally, it discusses optimization algorithms, including gradient descent, and the importance of convolutional layers for processing images.

Uploaded by

saisuraj1510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Deep Learning and Neural

networks-Basics
What is a perceptron?
In the perceptron diagram
● Input vector: The feature vector that is fed
to the neuron. It is usually denoted with an
uppercase x to represent a vector of inputs
(x1, x2, . . ., xn).
● Weights vector: Each x1 is assigned a weight
value w1 that represents its importance to
distinguish between different input
datapoints.
● Neuron function: The calculations performed
within the neuron to modulate the input
signals: the weighted sum and step activation
function.
● Output: Controlled by the type of activation
function you choose for your network.
How does the perceptron learn?
The perceptron uses trial and error to learn from its mistakes. It uses
the weights as knobs by tuning their values up and down until the
network is trained.
The perceptron’s learning logic goes like this:
● The neuron calculates the weighted sum and applies the activation
function to make a prediction ŷ. This is called the feedforward
process:

○ ŷ = activation(Σxi · wi + b)
● It compares the output prediction with the correct label to
calculate the error:

○ error = y - ŷ
● It then updates the weight. If the prediction is too high, it
adjusts the weight to make a lower prediction the next time, and
How does the perceptron learn?
The perceptron uses trial and error to learn from its mistakes. It uses
the weights as knobs by tuning their values up and down until the
network is trained.
The perceptron’s learning logic goes like this:
● The neuron calculates the weighted sum and applies the activation
function to make a prediction ŷ. This is called the feedforward
process:

○ ŷ = activation(Σxi · wi + b)
● It compares the output prediction with the correct label to
calculate the error:

○ error = y - ŷ
● It then updates the weight. If the prediction is too high, it
adjusts the weight to make a lower prediction the next time, and
How does the perceptron learn?
The perceptron uses trial and error to learn from its mistakes. It uses
the weights as knobs by tuning their values up and down until the
network is trained.
The perceptron’s learning logic goes like this:
● The neuron calculates the weighted sum and applies the activation
function to make a prediction ŷ. This is called the feedforward
process:

○ ŷ = activation(Σxi · wi + b)
● It compares the output prediction with the correct label to
calculate the error:

○ error = y - ŷ
● It then updates the weight. If the prediction is too high, it
adjusts the weight to make a lower prediction the next time, and
Multilayer perceptrons
● Linear datasets--The data can be split with a
single straight line.
● Nonlinear datasets--The data cannot be split
with a single straight line. We need more
than one line to form a shape that splits the
data.
Multilayer perceptrons
● A single perceptron works great with simple datasets that can be
separated by a line.
● To split a nonlinear dataset, we need more than one line. This means
we need to come up with an architecture to use tens and hundreds of
neurons in our neural network
Multilayer perceptron
architecture
The main components of the neural network architecture are:

● Input layer --Contains the feature vector.

● Hidden layers --The neurons are stacked on top of each other in hidden layers. They are
called “hidden” layers

● Weight connections (edges) --Weights are assigned to each connection between the nodes to
reflect the importance of their influence on the final output prediction

● Output layer -- Depending on the setup of the neural network, the final output may be a
real-valued output (regression problem) or a set of probabilities (classification problem).
This is determined by the type of activation function we use in the neurons in the output
layer
How many layers, and how many nodes in each
layer?

● A network can have one or more hidden layers, Each


layer has one or more neurons.
● when we have two or more hidden layers, we call this a
deep neural network.
● The general rule is this: the deeper your network is,
the more it will fit the training data. But it fails
to generalize when you show it new data (overfitting);
also, it becomes more computationally expensive
● If the network is performing poorly (underfitting),
add more layers.
Fully connected layers
If each node in a layer is connected to all
nodes in the previous layer. This is called a
fully connected network.
Neural network hyperparameters:
● Number of hidden layers: More layers enable memorizing instead of learning
its features.
● Activation function --There are many types of activation functions, the most
popular being ReLU and softmax. Use ReLU activation in the hidden layers and
Softmax for the output layer (Recommended)
● Error function --Mean square error is common for regression problems, and
cross-entropy is common for classification problems.
● Optimizer --Optimization algorithms are used to find the optimum weight values
that minimize the error.(batch gradient descent, stochastic gradient descent,
and mini-batch gradient descent, Adam and RMSprop.
● Batch size --Mini-batch size is the number of sub-samples given to the
network, after which parameter update happens.
● Number of epochs --The number of times the entire training dataset is shown to
the network while training
● Learning rate --One of the optimizer’s input parameters that we tune.
Theoretically, a learning rate that is too small is guaranteed to reach the
minimum error (if you train for infinity time). A learning rate that is too
big speeds up the learning but is not guaranteed to find the minimum error
Activation functions (Self
Study)
Heaviside step function
Sigmoid/logistic function
Softmax function
Hyperbolic tangent function (tanh)
Rectified linear unit f (x) = max(0, x)
Leaky ReLU f (x) = max(0.01x, x)
Feed Forward Network and
Process
Feedforward
calculations:

layer 2
layer 3:

Simplified representation of this matrices formula:


ŷ = σ · W(3) · σ · W(2) · σ · W(1) · (x)
Error functions
Mean square error: Mean squared error (MSE) is
commonly used in regression problems:

Cross-entropy: Cross-entropy is
commonly used in classification
problems because it quantifies the where (y) is the target probability, (p)
difference between two probability is the predicted probability, and (m) is
distributions: the number of classes

Probability(cat) P(dog) P(fish) Probability(cat) P(dog) P(fish)


0.0 1.0 0.0 0.2 0.3 0.5

E = - (0.0 * log(0.2) + 1.0 * log(0.3) + 0.0 * log(0.5)) = 1.2

To calculate the cross-entropy error


across all the training examples (n), we
use this general formula:
Error functions and Weights
The network learns by adjusting weight.
When we plot the error function with
respect to weight, we get this type of
graph.
we initialize the network with random
weights. The weight lies somewhere on
this curve, and our mission is to
make it descend this curve to its
optimal value with the minimum error.
The process of finding the goal
weights of the neural network happens
by adjusting the weight values in an
iterative process using an
optimization algorithm
What is optimization?
Changing the parameters to minimize (or maximize) a value is called
optimization.
In neural networks, optimizing the error function means updating the weights
and biases until we find the optimal weights, or the best values for the
weights to produce the minimum error.

error = |ŷ - y |

= |(w · x) - y |
Why do we need an optimization algorithm?

• To avoid brute-force through a lot of weight values


until we get the minimum error.
• Most popular optimization algorithm for neural
networks: gradient descent
• Gradient descent variations: batch gradient
descent (BGD), stochastic gradient descent (SGD), and
mini-batch GD (MB-GD).
Batch gradient descent

Gradient descent simply


means updating the weights
iteratively to descend the
slope of the error curve
until we get to the point
with minimum error.

In order to descend the error


By multiplying the direction (derivative)
mountain, we need to by the step size (learning rate), we get
determine two things for the change of the weight for each step:
each step:
The step direction (gradient) wnext-step = wcurrent + Δw
The step size (learning rate)
Pitfalls of batch gradient descent
Gradient descent is a very powerful
algorithm to get to the minimum error.
But it has two major pitfalls.

First, not all cost functions look


like the simple bowls. There may be
holes, ridges, and all sorts of
irregular terrain that make reaching
the minimum error very difficult

Second, batch gradient descent uses the


entire training set to compute the
gradients at every step. That is One possible approach to solving these two
computationally very expensive and slow. problems is stochastic gradient descent
This algorithm is also called batch
gradient descent --because it uses the
entire training data in one batch
Stochastic gradient descent
In stochastic gradient descent, the
algorithm randomly selects data points
and goes through the gradient descent
one data point at a time (figure).

This provides many different weight


starting points and descends all the
mountains to calculate their local
minimas.

Then the minimum value of all these


local minimas is the global minima.
Batch gradient descent Vs. Stochastic gradient descent
Too many weights!
Neural networks are densely connected

But is this really what we want when


processing images?
Too many weights!
Would rather have sparse connections
Fewer weights
Nearby regions - related
Far apart - not related

How can we do this?


Too many weights!
Would rather have sparse connections
Fewer weights
Nearby regions - related
Far apart - not related

Convolutions!
Just weighted sums of
small areas in image

Weight sharing in different


locations in image
Convolutional neural networks
Use convolutions instead of dense connections
to process images

Takes advantage of structure in our data!

Imposes an assumption on our model:


Nearby pixels are related, far apart ones
are less related.

What does this do to our bias/variance??


Convolutional Layer
Input: an image
Processing: convolution with multiple filters
Output: an image, # channels = # filters

Output still weighted sum of input (w/


activation)
Kernel size
How big the filter for a layer is

Typically 1 x 1 <-> 11 x 11 (most common: 1x1


& 3x3)

1 x 1 is just linear combination of channels


in previous image (no spatial processing)

Filters usually have same


number of channels as
input image.
Padding
Convolutions have problems on edges

Do nothing: output a little smaller than input

Pad: add extra pixels on edge

Most common: zero pad to preserve


Stride
How far to move filter between applications

We’ve done stride 1 convolutions up until now,


approximately preserves image size

Could move filter further, downsample image

Stride of 2: downsamples by factor of 2


Implement using matrices!
We want convolution to be matrix operations
because matrices are fast. How?
Im2col: rearrange image
Take spatial blocks of image and put them into
columns of a matrix.

Im2col handles kernel size, stride,


padding, etc.

Makes matrix with all relevant


pixels in the image.
Im2col: rearrange image
Take spatial blocks of image and put them into
columns of a matrix.

Im2col handles kernel size, stride,


padding, etc.

Makes matrix with all relevant


pixels in the image.

Multiple channel image stacked by channel.


Im2col: rearrange image
Now we just multiply by
our filter matrix to do
convolutions.
Im2col: rearrange image
Can calculate our weight
updates as well by
multiplying the delta of
a layer by the input.
Im2col: rearrange image
Finally, can calculate
backpropagated delta by
multiplying the current
delta by weights and
doing the reverse of
im2col.
Images are BIG
Even a 256 x 256 images has hundreds of
thousands of pixels and that’s considered a
small image!

Convolution:

Aggregate information, maybe we don’t need all


of the image, can subsample without throwing
Pooling Layer
Input: an image
Processing: pool pixel values over region
Output: an image, shrunk by a factor of the
stride

Hyperparameters:
What kind of pooling? Average, mean, max, min
How big of stride? Controls downsampling
How big of region? Usually not much bigger than stride

Most common: 2x2 or 3x3 maxpooling, stride of 2


Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7

-3 -9 1 8 -8 9 -1 -5

-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6
-3 -9 1 8 -8 9 -1 -5

-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6
-3 -9 1 8 -8 9 -1 -5

-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7
-3 -9 1 8 -8 9 -1 -5

-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7
-3 -9 1 8 -8 9 -1 -5

-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10
-3 -9 1 8 -8 9 -1 -5

-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5

-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2
-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8
-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8 9
-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8 9 7
-7 10 -9 -5 9 -8 -7 10

-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8 9 7
-7 10 -9 -5 9 -8 -7 10
10
-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8 9 7
-7 10 -9 -5 9 -8 -7 10
10 9
-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8 9 7
-7 10 -9 -5 9 -8 -7 10
10 9 10
-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8 9 7
-7 10 -9 -5 9 -8 -7 10
10 9 10 10
-5 5 9 4 10 -8 7 6

-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8 9 7
-7 10 -9 -5 9 -8 -7 10
10 9 10 10
-5 5 9 4 10 -8 7 6
8
-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8 9 7
-7 10 -9 -5 9 -8 -7 10
10 9 10 10
-5 5 9 4 10 -8 7 6
8 7
-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8 9 7
-7 10 -9 -5 9 -8 -7 10
10 9 10 10
-5 5 9 4 10 -8 7 6
8 7 4
-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
Maxpooling Layer, 2x2 stride 2
-7 6 -1 3 9 9 6 -9

3 -8 0 7 10 8 -3 10

-4 2 -6 4 -7 5 5 7
6 7 10 10
-3 -9 1 8 -8 9 -1 -5
2 8 9 7
-7 10 -9 -5 9 -8 -7 10
10 9 10 10
-5 5 9 4 10 -8 7 6
8 7 4 10
-3 8 0 2 2 -3 -2 5

4 -6 7 -3 1 4 10 0
(Fully) Connected Layer
The standard neural network layer where every
input neuron connects to every output neuron

Often used to go from image feature map ->


final output or map image features to a single
vector

Eliminates spatial information


Convnet Building Blocks
Convolutional layers:
Connections are convolutions
Used to extract features

Pooling layers:
Used to downsample feature maps, make processing more
efficient
Most common: maxpool, avgpool sometimes used at end

Connected layers:
Often used as last layer, to map image features -> prediction
No spatial information
Inefficient: lots of weights, no weight sharing
Convnet for Image
Classification
Object Localization
● But for Localization, to get the bounding
box, we need 4 outputs per class.

● Idea: Regression or prediction of four


values and modify the last layer
Object Localization Network
Bounding Box Regression

You might also like