DLVS Unit 1
DLVS Unit 1
DLVS Unit 1
Figure2a: The human vision system uses the eye and brain to sense and interpret an image.
KR21 Deep Learning & Vision Systems (DLVS) - UNIT-I CSE (AI & ML) IIIyr/Isem
human vision system helps to interpret the image of dogs as shown in figure2a looks at it and
directly understands that the image consists of a bunch of dogs (three, to be specific). It
comes pretty natural to us to classify and detect objects in this image because we have been
trained over the years to identify dogs.
b. AI Vision Systems:
To replicate human vision, two key components are required: a sensing device that mimics
the eye's functionality, and a sophisticated algorithm that replicates the brain's ability to
interpret and categorize visual content as shown in figure2b.
Figure2b : The components of the computer vision system are a sensing device and an interpreting device.
For example, the main goal of the autonomous vehicle (AV) vision system is to allow the car
to understand the environment around it and move from point A to point B safely and in a
timely manner. To fulfill this goal, vehicles are equipped vehicles the role of cameras and
sensors are to detect 360 degrees of movement—pedestrians, cyclists, vehicles, roadwork,
and other objects— from up to three football fields away.
Example2 for sensing devices used in self-driving cars is to perceive the surrounding area:
• LIDAR, a radar-like technique, uses invisible pulses of light to create a high
resolution 3D map of the surrounding area.
• Cameras can see street signs and road markings but cannot measure distance.
• Radar can measure distance and velocity but cannot see in fine detail.
Medical diagnosis applications use X-rays or CT scans as sensing devices.
b. Interpreting devices
Interpreting devices in vision systems take the output image from the sensing device and
learn features and patterns to identify objects. So, we need to build a brain. Thus, we have
ANNs and CNNs
Artificial neural networks (ANNs) as shown in Fig.3. a. The analogy between the biological
neurons and artificial systems has a main processing element, a neuron, with input signals
(x1, x2, …, xn) and an output.
Another system that connects millions of these neurons stacked in layers that is each neuron
is connected to thousands of other neurons, yielding a learning behaviour.
Building a multilayer neural network is called Deep Learning as shown in Fig.3. a. The DL
methods learn the representations through a sequence of transformations of data through
layers of neurons. The different DL architectures are ANNs and Convolutional neural
networks CNNs.
AI and DL have managed to achieve best performance on many complex visual perception
tasks like image or video search, captioning classification, and object detection.
a. Image classification is the task of assigning to an image a label from a predefined set
of categories. A Convolutional neural network is a neural network type that truly shines in
processing and classifying images in many different applications:
Lung cancer diagnosis—Lung cancer is a growing problem. The main reason lung
cancer is very dangerous is that when it is diagnosed, it is usually in the middle or of late
stages. When diagnosing lung cancer, doctors typically use their eyes to examine CT scan
images, looking for small nodules in the lungs. In the early stages, the nodules are usually
very small and hard to spot. (figure 4a).
Figure4b : Vision systems can detect traffic signs with very high performance.
c. Object detection and localization:
Localization involves determining the object's position and outlining it with a bounding box.
On the other hand, object detection goes beyond by identifying and categorizing all objects
within the image. Each object is assigned a class label and enclosed with a bounding box. To
do that, we can build object detection systems like YOLO (you only look once), SSD (single-
shot detector), and Faster R-CNN, which not only classify images but also can locate and
detect each object in images that contain multiple objects as shown in (figure 4c).
Figure.4d: Style transfer onto the original image, producing a piece of art that feels
KR21 Deep Learning & Vision Systems (DLVS) - UNIT-I CSE (AI & ML) IIIyr/Isem
e. Creating images
A new DL model that can imagine new things called generative adversarial networks
(GANs). GANs are sophisticated DL models that generate stunningly accurate synthesized
images of objects, people, and places, among other things. If you give them a set of images,
they can make entirely new, realistic-looking images. For example, StackGAN is one of the
GAN architecture variations that can use a textual description of an object to generate a high-
resolution image of the object matching that description. This is not just running an image
search on a database. These “photos” have never been seen before and are totally imaginary
(figure 4e)
AI-generated artwork featuring a fictional person named Edmond de Belamy sold for
$432,500. The artwork was created by a team of three 25-year-old French students using
GANs. The network was trained on a dataset of 15,000 portraits painted between the
fourteenth and twentieth century’s, and then it created one of its own. The team printed the
image, framed it, and signed it with part of a GAN algorithm.
f. Face recognition:
Face recognition (FR) allows us to exactly identify or tag an image of a person. Day-to day
applications include searching for celebrities on the web and auto-tagging friends and family
in images. Face recognition is a form of fine-grained classification.
Face Recognition system is categorized in two modes:
Face identification involves one-to-many matches that compare a query face image against
all the template images in the database to determine the identity of the query face. Another
face recognition scenario involves a watchlist check by city authorities, where a query face is
matched to a list of suspects (one-to-few matches).
KR21 Deep Learning & Vision Systems (DLVS) - UNIT-I CSE (AI & ML) IIIyr/Isem
Face verification involves a one-to-one match that compares a query face image against a
template face image whose identity is being claimed (figure below)
Fig.5.b. Predict the probability of the motorcycle object from the motorcycle, car, and dog classes
i) Grayscale Image:
In gray scale images are two-dimensional and lie on the x-y plane. The gray scale image has
pixel values range from 0 to 255. Since the pixel value represents the intensity of light, the
value 0 represents very dark pixels (black), 255 is very bright (white), and the values in
between represent the intensity on the grayscale. (Fig.5.c)
Example: motorcycle and the grid of pixel
b. Image Pre-processing:
Image data pre-processing converts image data into a form that allows machine learning
algorithms to solve it. It is often used to increase a model’s accuracy, as well as reduce its
complexity. The techniques used to pre-process image data are image resizing, converting
images to grayscale, and image augmentation etc., when the images exist in different formats,
i.e., natural, fake, grayscale, etc., we need to standardize them before feeding them into a
neural network.
The important steps in image pre-processing techniques are
• Grayscale conversion
• Normalization
• Data Augmentation
• Image standardization
i) Grayscale conversion simply converting images from colored to black and white as shown
in Fig.5.f. It is normally used to reduced number of pixels that need to be processed and
reduce computation complexity in machine learning algorithms.
This could be a not good approach for applications depend on color information because it
losses information in conversion. Since most pictures don’t need color to be recognized, it is
wise to use grayscale, which reduces the number of pixels in an image, thus, reducing the
computations required.
KR21 UNIT-I-Deep Learning & Vision Systems (DLVS) CSE (AI & ML) IIIyr/Isem
Converting images to grayscale might not always be practical in solving some problems.
For examples where it would be impractical to use grayscale include: trafficlights, healthcare
diagnosis, autonomous vehicles, agriculture, etc. The best way to know whether to use it or
not depends on your human visual ability to identify an object without color.
ii) Normalization
Image normalization is a typical process in image processing that changes the range of pixel
intensity values.
For example, when we perform a function that produces a normalization of an input image
(grayscale or RGB). Then, we understand a representation of the range of values of the scale
of the image represented between 0 and 255. i.e., very dark images become clearer.
The linear normalization of a digital image is performed according to the formula
Output_channel = 255 * (Input_channel - min) / (max-min)
For grayscale image, normalize using one channel and color images normalize a RGB (3
channels)
Examples:
The left image depicts the original image is too dark and results clear after the normalization
process.
Example 2: The left image depicts the right-side original image is very bright results with
better contrast after the normalization process.
KR21 UNIT-I-Deep Learning & Vision Systems (DLVS) CSE (AI & ML) IIIyr/Isem
This is a phrase that was introduced by David Wolpert and William Macready in “No Free
Lunch Theorems for Optimizations” (IEEE Transactions on Evolutionary Computation 1,
67).
KR21 UNIT-I-Deep Learning & Vision Systems (DLVS) CSE (AI & ML) IIIyr/Isem
• When working on ML projects, you will need to make many choices like building
your neural network architecture, tuning hyperparameters, and applying the
appropriate data preprocessing techniques. While there are some rule-of-thumb
approaches to tackle certain problems, there is really no single recipe that is
guaranteed to work well in all situations.
• You must make certain assumptions about the dataset and the problem you are trying
to solve. For some datasets, it is best to convert the colored images to grayscale, while
for other datasets, you might need to keep or adjust the color images.
• The good news is that, unlike traditional machine learning, DL algorithms require
minimum data preprocessing because, as you will see soon, neural networks do most
of the heavy lifting in processing an image and extracting features.
C. Feature extraction
Features are parts or patterns of an object in an image that help to identify image. The entire
DL model works around the idea of extracting useful features that clearly define the objects
in the image.
A raw data (image) is transformed into a feature vector using learning algorithm, which can
learn the characteristics of the object.
For example — a square has 4 corners and 4 edges, they can be called features of the square,
and they help us humans identify it’s a square. Features include properties like corners,
edges, regions of interest points, ridges, etc.
Example: when we feed the raw input image of a motorcycle into a feature extraction
algorithm. the extraction algorithm produces a vector that contains a list of features as shown
in figure below. This feature vector is a 1D array that makes a robust representation of the
object.
The Process relies on domain knowledge (or partner with domain experts) to extract features
that make ML algorithms work better. Feeding the produced features to a classifier like a
support vector machine (SVM) or AdaBoost to predict the output (Fig 5.l).
KR21 UNIT-I-Deep Learning & Vision Systems (DLVS) CSE (AI & ML) IIIyr/Isem
Fig 5.l: Traditional machine learning algorithms require handcrafted feature extraction .
Fig.5.m : A DNN the input image through its layers to automatically extract features
D. Classification model:
The features vectors are fed into a classification model predicts the class of the image.
• First you see a wheel feature; could this be a car, a motorcycle, or a dog? Clearly it is
not a dog, because dogs don’t have wheels (at least, normal dogs, not robots). Then this
could be an image of a car or a motorcycle.
• You move on to the next feature, the headlights. There is a higher probability that this
is a motorcycle than a car.
• The next feature is rear mudguards—again, there is a higher probability that it is a
motorcycle.
• The object has only two wheels; this is closer to a motorcycle.
• And you keep going through all the features like the body shape, pedal, and so on, until
you arrive at a best guess of the object in the image.
The output of this process is the probability of each class., the dog has the lowest probability,
1%, whereas there is an 85% probability that this is a motorcycle. Although the model was
able to predict the right class with the highest probability, it is still a little having confusion
between cars and motorcycles because it was predicted as 14% chance of car. Since it is a
motorcycle, we can say that our ML classification algorithm is 85% accurate.
The different approaches used to improve the accuracy of our model follow either step:
• acquire more training images, or
• more processing to remove noise, or
• extract better features, or
• change the classifier algorithm and tune some hyperparameters, or
• even allow more training time.
6. Classifier learning algorithm:
The classification task is done by two ways:
KR21 UNIT-I-Deep Learning & Vision Systems (DLVS) CSE (AI & ML) IIIyr/Isem
• By traditional ML algorithms like SVMs which might give result for some problems.
Deeplearning Flow:
Fig 6.b. plot for best fit between Age and height
The perceptron uses trial and error to learn from its mistakes. It uses the weights as knobs by
tuning their values up and down until the network is trained. That is the Weights are tuned up
and down during the learning process to optimize the value of the loss function.
Steps of perceptron learning
1 The neuron calculates the weighted sum and applies the activation function to make a
prediction yˆ . This is called the feed forward process.
2 It compares the output prediction with the correct label to calculate the error
3 It then updates the weight. If the prediction is too high, it adjusts the weight to make a
lower prediction the next time, and vice versa.
4 Repeat!
This process is repeated many times, and the neuron continues to update the weights to
improve its predictions until step 2 produces a very small error (close to zero),which means
the neuron’s prediction is very close to the correct value. At this point, we can stop the
training and save the weight values that yielded the best results to apply to future cases where
the outcome is unknown.
KR21 Deep Learning & Vision Systems- Unit-I CSE(AI & ML)III/I
Problem:
A Neural network which takes binary valued inputs x1, x2
€ {0,1} and the activation function is the threshold
function h(x)=1 if x>0 else 0. Find which logical
operation is done by neural network function
f(x)=8x1+8x2-4? (hint: OR, AND, NAND, NOR, XOR)
C. How to Train a perceptron model to predict whether a player will be accepted into
the college squad?
The first step is to collect all the data from previous years and train the perceptron to predict
whether players will be accepted based on only two features (height and weight). The trained
perceptron will find the best weights and bias values to produce the straight line that best
separates the accepted from non-accepted (best fit).
The line has this equation: z = height · w1 + age · w2 + b. After the training is complete on
the training data, we can start using the perceptron to predict with new players. When we get a
player who is 150 cm in height and 12 years old, we compute the previous equation with the
values (150, 12)as shown in Fig .6.c
Fig 6.c plot for best fit between Age and height
KR21 Deep Learning & Vision Systems- Unit-I CSE(AI & ML)III/I
The single perceptron works fine because our data was linearly separable. This means the
training data can be separated by a straight line. But in real time isn’t always that simple.
What happens when we have a more complex dataset that cannot be separated by a straight
line (nonlinear dataset fig 7.a.))? We use Multi layer Perceptrons .
Hidden Layer
It is the heart of all Artificial neural networks. This layer comprises all computations of the
neural network. The edges of the hidden layer have weights multiplied by the node values.
This layer uses the activation function. There can be one or two hidden layers in the
model. Several hidden layer nodes should be accurate as few nodes in the hidden layer make
the model unable to work efficiently with complex data. More nodes will result in an
overfitting problem.
Output Layer
This layer gives the estimated output of the Neural Network as shown in fig 7.d. The number
of nodes in the output layer depends on the type of problem. For a single targeted variable,
use one node. N classification problem, ANN uses N nodes in the output layer.
Problem: For training a binary classification model with three independent variables,
you choose to use neural networks. You apply one hidden layer with three neurons.
What are the number of parameters to be estimated? Consider the bias term as a
parameter)
Therefore 73 weights are required to design the simple MLP network. The values of these
weights are randomly initialized, and then the network performs feed forward and
Backpropagation to learn the best values of weights that most fit our model to the training
data.
KR21 Deep Learning & Vision Systems- Unit-I CSE(AI & ML)III/I
ii) Advantages and Disadvantages of Activation Functions are shown in Fig 8.b
Activation
Functions
Sigmoid Advantages Disadvantages
• Sigmoid functions has a well-defined nonzero • suffers from vanishing gradient
derivative everywhere unlike step function, problem.
which allows the optimizer to make some • Slow due to its exponential
progress at every step during training. operation and also, not zero centric.
Tanh Advantages Disadvantages
• Derivatives are steeper compared to Sigmoid. • Suffers from vanishing gradient
• Output is zero centric. problem.
ReLU Advantages Disadvantages
• Rectifies the vanishing gradient problem. • Suffers from dying ReLU problem
• ReLU is less computationally expensive due to negative input. Weights
compared to tanh and sigmoid due to its updation stops and no information
simpler mathematical operations. is passed to next layers.
Leaky Advantages Disadvantages
ReLU • Leaky ReLU fixed dying ReLU problem by • As it processes linearity, it can’t be
introducing small negative slope. used for complex Classification.
• Suffers from vanishing gradient
problem.
ELU Advantages Disadvantages
• Unlike ReLU,ELU can produce negative • For x>0, it can blow up the
outputs. activation with the output range of
• Fixes both dying ReLU and Vanishing gradient [0,∞].
problem. • Computationally intensive.
Softmax Advantages Disadvantages
• Helps for Multiclass prediction. • Usually used in last layer.
Fig 8.b advantages and Disadvantages activation Functions
KR21 Deep Learning & Vision Systems CSE(AI & ML)III/I
In short, the forward pass is the calculations through the layers to make a prediction.
Let’s us consider a simple three-layer neural network as shown in figure Fig.9.a ,its
components:
Layers—This network consists of an input layer with three input features, and three hidden
layers with 3, 4, 1 neurons in each layer
KR21 Deep Learning & Vision Systems CSE(AI & ML)III/I
Weights and biases (w, b)—The edges between nodes are assigned random weights denoted
as Wab(n) , where (n) indicates the layer number and (ab) indicates the weighted edge
connecting the ath neuron in layer (n) to the bth neuron in the previous layer (n – 1). For
example, W23(2) is the weight that connects the second node in layer 2 to the third node in
layer 1 (a22 to a13). (Note that you can see different denotations of Wab(n) in other DL
literature, which is fine as long as you follow one convention for your entire network.) The
biases are treated similarly to weights because they are randomly initialized, and their values
are learned during the training process. So, for convenience, from this point forward we are
going to represent the basis with the same notation that we gave for the weights (w). In DL
literature, you will mostly find all weights and biases represented as (w) for simplicity.
Activation functions σ(x)—In this example, we are using the sigmoid function σ(x) as an
activation function.
Node values (a)—We will calculate the weighted sum, apply the activation function, and
assign this value to the node amn, where n is the layer number and m is the node index in the
layer. For example, a23 means node number 2 in layer 3.
Feed forward calculations:
Step1: start the feed forward calculations
b. Error Function:
The error function is a measure of how “wrong” the neural network prediction is with respect
to the expected output (the label). It quantifies how far we are from the correct solution.
example, if we have a high loss, then our model is not doing a good job. The smaller the loss,
the better the job the model is doing. The larger the loss, the more our model needs to be
trained to increase its accuracy.
The error in simplest form, is calculated by comparing the prediction yˆ and the actual label y.
A visualization of loss functions of two separate models plotted over time is shown in figure
9.b. You can see that model #1 is doing a better job of minimizing error, whereas model #2
starts off better until epoch 6 and then plateaus
Different loss functions will give different errors for the same prediction, and thus have a
considerable effect on the performance of the model. two most commonly used loss functions:
mean squared error (and its variations), usually used for regression problems, and cross-
entropy, used for classification problems.
a. Mean squared error (MSE) is commonly used in regression problems that require the
output to be a real value (like house pricing). Instead of just comparing the prediction output
with the label (yˆi – yi), the error is squared and averaged over the number of data points, as
you see in this equation
KR21 Deep Learning & Vision Systems CSE(AI & ML)III/I
MSE is a good choice for a few reasons. The square ensures the error is always positive, and
larger errors are penalized more than smaller errors. MSE is quite sensitive to outliers, since it
squares the error value. This might not be an issue for the specific problem that you are
solving. In fact, this sensitivity to outliers might be beneficial in some cases. For example, if
you are predicting a stock price, you would want to take outliers into account, and sensitivity
to outliers would be a good thing. In other scenarios, you wouldn’t want to build a model that
is skewed by outliers, such as predicting a house price in a city. In that case, you are more
interested in the median and less in the mean. mean absolute error (MAE) was developed just
for this purpose. It averages the absolute error over the entire dataset without taking the
square of the error.
Problem: Calculate the feed forward Error between actual and predicted value
We can interpret this “true” distribution to mean that the training instance has 0% probability
of being class A, 100% probability of being class B, and 0% probability of being class C.
Now, suppose our machine learning algorithm predicts the following probability distribution:
KR21 Deep Learning & Vision Systems CSE(AI & ML)III/I
How close is the predicted distribution to the true distribution? That is what the crossentropy
loss function determines. We can use this formula:
where (y) is the target probability, (p) is the predicted probability, and (m) is the number of
classes. The sum is over the three classes: cat, dog. the loss is 1.2
To calculate the cross-entropy error across all the training examples (n), we use this general
formula:
C. Optimization:
In neural networks, optimizing the error function means updating the weights and biases until
we find the optimal weights, or the best values for the weights to produce the minimum error.
Optimization algorithms are responsible for reducing losses and provide most accurate results
possible. The weight is initialized using some initialization strategies and is updated with
each epoch according to the equation. The best results are achieved using some optimization
strategies or algorithms called Optimizer.
Different types of optimizers a used to work to minimize the loss function are
1. Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini Batch Stochastic Gradient Descent (MB-SGD)
4. SGD with momentum
5. Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7. AdaDelta
8. RMSprop
9. Adam
Gradient descent simply means updating the weights iteratively to descend the slope of the
error curve until we get to the point with minimum error as shown in 2D visualization below.
Fig9.e Gradient descent
The GD hyperparameters are learning rate and batch_size . To visualize how gradient descent
works, the plot the error function in a 3D graph (figure below) and go through the process
step by step. The random initial weight (starting weight) is at point A, and our goal is to
descend this error mountain to the goal w1 and w2 weight values, which produce the
minimum error value. The way we do that is by taking a series of steps down the curve until
we get the minimum error. In order to descend the error mountain, we need to determine two
things for each step: The step direction (gradient ), The step size (learning rate).(Fig9.e)
iii) Mini-batch GD is a compromise between batch and stochastic, using neither all the data
nor a single instance. Instead, it takes a group of training instances (called a mini-batch),
computes the gradient on them and updates the weights, and then repeats until it processes all
the training data. In most cases, MB-GD is a good starting point.
D. What is Backpropagation?
Backpropagation is the core of how neural networks learn. Up until this point, you learned
that training a neural network typically happens by the repetition of the following three steps:
• Feedforward: get the linear combination (weighted sum), and apply the activation
function to get the output prediction (yˆ):
yˆ = σ · W(3) · σ · W(2) · σ · W(1) · (x)
KR21 Deep Learning & Vision Systems CSE(AI & ML)III/I
• Compare the prediction with the label to calculate the error or loss function:
E(W, b) = |yˆi – yi |
• Use a gradient descent optimization algorithm to compute the Δw that optimizes the
error function:
Δwi = –α dE/dwi
• Backpropagate the Δw through the network to update the weights:
Backpropagation, or backward pass, means propagating derivatives of the error with respect
to each specific weight dE/dWi .
from the last layer (output) back to the first layer (inputs) to adjust weights. By propagating
the change in weights Δw backward from the prediction node (yˆ) all the way through the
hidden layers and back to the input layer, the weights get updated:
(wnext – step = wcurrent + Δw)
This will take the error one step down the error mountain. Then the cycle starts again (steps 1
to 3) to update the weights and take the error another step down, until we get to the minimum
error.
Backpropagation might sound clearer when we have only one weight. We simply adjust the
weight by adding the Δw to the old weight wnew = w – α dE/dwi.
But it gets complicated when we have a multilayer perceptron (MLP) network with many
weight variables. To make this clearer, consider the scenario in figure 2.35.
How do we compute the change of the total error with respect to dE/dw13?
How much will the total error change when we change the parameter w13?
how to compute by applying the derivative rules on the error function.
That is straightforward because w21 is directly connected to the error function. But to
compute the derivatives of the total error with respect to the weights all the way back to the
input, we need a calculus rule called the chain rule.
Figure 10.i shows how backpropagation uses the chain rule to flow the gradients in the
backward direction through the network. Let’s apply the chain rule to calculate the derivative
of the error with respect to the third weight on the first input w1,3 (1) , where the (1) means
layer 1, and w1,3 means node number 1 and weight number 3:
---→Chain rule
The error back propagated to the edge w1,3 (1) = effect of error on edge 4 · effect on edge 3 ·
effect on edge 2 · effect on target edge.
Thus, the backpropagation technique is used by neural networks to update the weights to
solve the best fit problem.