Unit 1 - Deep Learning
Unit 1 - Deep Learning
Kuthambakkam, Chennai
Probability Theory(Part B)
First, the laws of probability tell us how AI systems should reason, so we design our algorithms to
compute or approximate various expressions derived using probability theory.
Second, we can use probability and statistics to theoretically analyze the behavior of proposed AI
systems. Probability theory is a fundamental tool of many disciplines of science and engineering.
The probability theory allows us to make uncertain statements and reason in the presence
of uncertainty, information theory allows us to quantify the amount of uncertainty in a probability
distribution.
1. Inherent stochasticity in the system being modeled. For example, most interpretations of
quantum mechanics describe the dynamics of subatomic particles as being probabilistic. We can
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
also create theoretical scenarios that we postulate to have random dynamics, such as a
hypothetical card game where we assume that the cards are truly shuffled into a random order.
2. Incomplete observability. Even deterministic systems can appear stochastic when we cannot
observe all of the variables that drive the behavior of the system. For example, in the Monty Hall
problem, a game show contestant is asked to choose between three doors and wins a prize held
behind the chosendoor. Two doors lead to a goat while a third leads to a car. The outcome given
the contestant’s choice is deterministic, but from the contestant’s point of view, the outcome is
uncertain.
3. Incomplete modeling. When we use a model that must discard some of the information we
have observed, the discarded information results in uncertainty in the model’s predictions. For
example, suppose we build a robot that can exactly observe the location of every object around
it. If the robot discretizes space when predicting the future location of these objects,then the
discretization makes the robot immediately become uncertain about the precise position of
objects: each object could be anywhere within the discrete cell that it was observed to occupy. In
many cases, it is more practical to use a simple but uncertain rule rather than a complex but
certain one, even if the true rule is deterministic and our modeling system has the fidelity to
accommodate a complex rule. For example, the simple rule “Most birds fly” is cheap to develop
and is broadly useful, while a rule of the form, “Birds fly, except for very young birds that have not
yet learned to fly, sick or injured birds that have lost the ability to fly, flightless species of birds
including the cassowary, ostrich and kiwi. . .” is expensive to develop, maintain and communicate
.
Random Variables
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
A random variable is a variable that can take on different values randomly. We typically denote
the random variable itself with a lower case letter in plain typeface, and the values it can take on
with lower case script letters. For example, x1 and x2 are both possible values that the random
variable x can take on. For vector-valued variables, we would write the random variable as x and
one of its values as x. On
its own, a random variable is just a description of the states that are possible; it must be coupled
with a probability distribution that specifies how likely each of these states are. Random variables
may be discrete or continuous. A discrete random variable is one that has a finite or countably
infinite number of states. Note that these states are not necessarily the integers; they can also
just be named states that
are not considered to have any numerical value. A continuous random variable is associated with
a real value.
Probability Distributions
A probability distribution over discrete variables may be described using a probability mass
function (PMF). We typically denote probability mass functions with a capital P. Often we
associate each random variable with a different probability
A probability density function p(x) does not give the probability of a specific state directly, instead
1.Marginal probability :
Sometimes we know the probability distribution over a set of variables and we want to know the
probability distribution over just a subset of them. The probability distribution over the subset is
known as the marginal probability distribution.
The name “marginal probability” comes from the process of computing marginal probabilities on
paper. When the values of P(x, y ) are written in a grid with different values of x in rows and
different values of y in columns, it is natural to sum across a row of the grid, then write P(x) in the
margin of the paper just to the right of the row. For continuous variables, we need to use
integration instead of summation:
For example, suppose we have discrete random variables x and y, and we know
2.Conditional probability:
In many cases, we are interested in the probability of some event, given that some other event
has happened. This is called a conditional probability. We denote the conditional probability that
y = y given x = x as P(y = y | x = x). This conditional probability can be computed with the formula
The conditional probability is only defined when P(x = x) > 0. We cannot compute the conditional
probability conditioned on an event that never happens. It is important not to confuse conditional
probability with computing what would happen if some action were undertaken. The conditional
probability that a person is from Germany given that they speak German is quite high, but if a
randomly selected person is taught to speak German, their country of origin does not change.
Computing the consequences of an action is called making an intervention query. 3.The chain rule
of the conditional probabilities.
Any joint probability distribution over many random variables may be decomposed into
conditional distributions over only one variable:
This observation is known as the chain rule or product rule of probability. It follows immediately
from the definition of conditional probability in equation
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
Two random variables x and y are independent if their probability distribution can be expressed
as a product of two factors, one involving only x and one involving only y:
Two random variables x and y are conditionally independent given a random variable z if the
conditional probability distribution over x and y factorizes in this way for every value of z:
We can denote independence and conditional independence with compact notation: x⊥y means
that x and y are independent, while x⊥y | z means that x and y are conditionally independent
given z.
-----------------------------------------------------------------------------------------------------------------------
Neural Networks are a type of function that connects inputs with outputs. In theory, neural
networks should be able to estimate any sort of function, no matter how complex it is.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
Nonetheless, supervised learning entails learning a function that translates a given X to a specified
Y and then utilising that function to determine the proper Y for a fresh X. If that’s the case, how
do neural networks differ from typical machine learning methods? Inductive Bias, a psychological
phenomenon, is the answer. The phrase may appear to be fresh. However, before applying a
machine learning model to it, it is nothing more than our assumptions about the relationship
between X and Y.
The linear relationship between X and Y is the Inductive Bias of linear regression. As a result, it fits
the data to a line or a hyperplane.
However, depending on the function’s complexity, we may need to manually set the number of
neurons in each layer and the total number of layers in the network. This is usually accomplished
through trial and error methods as well as experience. As a result, these parameters are referred
to as hyperparameters.
A weight is assigned to each input to an artificial neuron. First, the inputs are multiplied by their
weights, and then a bias is applied to the outcome. After that, the weighted sum is passed via an
activation function, being a non-
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
linear function.
A weight is being applied to each input to an artificial neuron. First, the inputs are multiplied
by their weights, and then a bias is applied to the outcome. This is called the weighted sum. After
that, the weighted sum is processed via an activation function, as a non-linear function.
The first layer is the input layer, which appears to have six neurons but is only the data that is sent
into the neural network. The output layer is the final layer. The dataset and the type of challenge
determine the number of neurons in the final layer and the first layer. Trial and error will be used
to determine the number of neurons in the hidden layers and the number of hidden layers.
All of the inputs from the previous layer will be connected to the first neuron from the first hidden
layer. The second neuron in the first hidden layer will be connected to all of the preceding layer’s
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
inputs, and so forth for all of the first hidden layer’s neurons. The outputs of the previously hidden
layer are regarded inputs for neurons in the second hidden layer, and each of these neurons is
coupled to all of the preceding neurons.
The neural network can compare the outputs of its nodes with the desired values using a
property known as the delta rule, allowing the network to alter its weights through training to
create more accurate output values. This training and learning procedure results in gradient
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
descent. The technique of updating weights in multi-layered perceptrons is virtually the same,
however, the process is referred to as back-propagation. In such circumstances, the output values
provided by the final layer are used to alter each hidden layer inside the network.
Let us create the respective sample weights which are to be applied in the input layer, the first &
the second hidden layer
Python code implementation for the propagation of the input signal through different layers
towards the output layer
# Forward propagation of input signals to 6 neurons in first hidden layer activation is calculated
based tanh function
z1 = X.dot(W1) + b1
a1 = np.tanh(z1)
# Forward propagation of activation signals from first hidden layer to 6 neurons in second
hidden layer activation is calculated based tanh function
z2 = a1.dot(W2) + b2
a2 = np.tanh(z2)
z3 = a2.dot(W3) + b3
We’ve seen how the computation works so far. But the major purpose of this blog is to explain
why this strategy works. Neural networks should theoretically be able to estimate any continuous
function, no matter how complex or non-linear it is.
When two or more linear objects, such as a line, plane, or hyperplane, are combined, the
outcome is also a linear object: line, plane, or hyperplane. No matter how many of these linear
things we add, we’ll still end up with a linear object.
However, this is not the case when adding non-linear objects. When two separate curves are
combined, the result is likely to be a more complex curve.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
We’re introducing non-linearity at every layer using these activation functions, in addition to
just adding non-linear objects or hyper-curves like hyperplanes. In other words, we’re applying
a nonlinear function on an already nonlinear object.
Suppose if neural networks didn’t have an activation function, they’d just be a huge linear unit
that a single linear regression model could easily replace.
a = m*x + d
A. Feed-forward refers to a neural network architecture where information flows in one direction,
from input to output, with no feedback loops. Deep feed-forward, commonly known as a deep
neural network, consists of multiple hidden layers between input and output layers, enabling the
network to learn complex hierarchical features and patterns, enhancing its ability to model
intricate relationships in data.
--------------------------------------------------------------------------------------------------------------------------
Stochastic gradient descent is a very popular and common algorithm used in various Machine
Learning algorithms, most importantly forms the basis of Neural Networks. In this article, I have
tried my best to explain it in detail, yet in simple terms. I highly recommend going through linear
regression before proceeding with this article.
Gradient, in plain terms means slope or slant of a surface. So gradient descent literally means
descending a slope to reach the lowest point on that surface. Let us imagine a two dimensional
graph, such as a parabola in the figure below.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
In the above graph, the lowest point on the parabola occurs at x = 1. The objective of gradient
descent algorithm is to find the value of “x” such that “y” is minimum. “y” here is termed as the
objective function that the gradient descent algorithm operates upon, to descend to the lowest
point.
The objective of regression, to minimize the sum of squared residuals. We know that a function
reaches its minimum value when the slope is equal to 0. By using this technique, we solved the
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
linear regression problem and learnt the weight vector. The same problem can be solved by
gradient descent technique.
“Gradient descent is an iterative algorithm, that starts from a random point on a function and
travels down its slope in steps until it reaches the lowest point of that function.”
This algorithm is useful in cases where the optimal points cannot be found by equating the slope
of the function to 0. In the case of linear regression, you can mentally map the sum of squared
residuals as the function “y” and the weight vector as “x” in the parabola above.
This is the crux of the algorithm. The general idea is to start with a random point (in our parabola
example start with a random “x”) and find a way to update this point with each iteration such
that we descend the slope.
1. Find the slope of the objective function with respect to each parameter/feature. In
other words, compute the gradient of the function.
2. Pick a random initial value for the parameters. (To clarify, in the parabola example,
differentiate “y” with respect to “x”. If we had more features like x1, x2 etc., we take
the partial derivative of “y” with respect to each of the features.)
4. Calculate the step sizes for each feature as : step size = gradient * learning rate.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
5. Calculate the new parameters as : new params = old params -step size
The “learning rate” mentioned above is a flexible parameter which heavily influences the
convergence of the algorithm. Larger learning rates make the algorithm take huge steps down
the slope and it might jump across the minimum point thereby missing it. So, it is always good to
stick to low learning rate such as 0.01. It can also be mathematically shown that gradient descent
algorithm takes larger steps down the slope if the starting point is high above and takes baby
steps as it reaches closer to the destination to be careful not to miss it and also be quick enough.
There are a few downsides of the gradient descent algorithm. We need to take a closer look at
the amount of computation we make for each iteration of the algorithm.
Say we have 10,000 data points and 10 features. The sum of squared residuals consists of as
many terms as there are data points, so 10000 terms in our case. We need to compute the
derivative of this function with respect to each of the features, so in effect we will be doing
10000 * 10 = 100,000 computations per iteration. It is common to take 1000 iterations, in effect
we have 100,000 * 1000 = 100000000 computations to complete the algorithm. That is pretty
much an overhead and hence gradient descent is slow on huge data.
The cost function used by a machine learning algorithm often decomposes as a sum over training
examples of some per-example loss function. For example, the negative conditional log-likelihood
of the training data can be written as
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
The computational cost of this operation is O(m). As the training set size grows to billions of
examples, the time to take a single gradient step becomes prohibitively long. The insight of SGD
is that the gradient is an expectation. The expectation may be approximately estimated using a
small set of samples. Specifically, on each step of the algorithm, we can sample a minibatch of
examples B = {x (1) , . . . , x (m0 )} drawn uniformly from the training set. The minibatch size m0 is
typically chosen to be a relatively small number of examples, ranging from one to a few hundred.
Crucially, m0 is usually held fixed as the training set size m grows. We may fit a training set with
billions of examples using updates computed on only a hundred examples. The estimate of the
gradient is formed as
using examples from the minibatch B. The stochastic gradient descent algorithm then follows the
estimated gradient downhill:
--------------------------------------------------------------------------------------------------------------------------
Most deep learning algorithms involve optimization of some sort. Optimization refers to the task
of either minimizing or maximizing some function f (x) by altering x. We usually phrase most
optimization problems in terms of minimizing f (x). Maximization may be accomplished via a
minimization algorithm by minimizing−f(x).The function we want to minimize or maximize is called
the objective function or criterion. When we are minimizing it, we may also call it the cost
function, loss function, or error function. In this book, we use these terms interchangeably,
though some machine learning publications assign special meaning to some of these terms.
We assume the reader is already familiar with calculus, but provide a brief review of how calculus
concepts relate to optimization here. Suppose we have a function y = f (x), where both x and y are
real numbers.
dx. The derivative gives the slope of f (x) at the point x. In other words, it specifies how to
scale a small change in the input in order to obtain the corresponding change in the output:
The derivative is therefore useful for minimizing a function because it tells us how to change x in
order to make a small improvement in y. For example,
we know that (x))) is less than f (x) for small enough . We can thus reduce f
(x) by moving x in small steps with opposite sign of the derivative. (x) = 0, the derivative provides
no information about which direction. When the derivative provides no information about
which direction to move. Points where =0 known as critical points or stationary points. A
local minimum is a point where f (x) is lower than at all neighboring points, so it is no longer
possible to decrease f(x) by making infinitesimal steps. A local maximum is a point where f (x) is
higher than at all neighboring points,
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
A point that obtains the absolute lowest value of f (x) is a global minimum.It is possible for there
to be only one global minimum or multiple global minima of the function. It is also possible for
there to be local minima that are not globally optimal. In the context of deep learning, we optimize
functions that may have many local minima that are not optimal, and many saddle points
surrounded by very flat regions. All of this makes optimization very difficult, especially when the
input to the function is multidimensional. We therefore usually settle for finding a value of f that
is very low, but not necessarily minimal in any formal sense.
-------------------------------------------------------------------------------------------------------------------------
Learning Algorithm:
A machine learning algorithm is an algorithm that is able to learn from data. “A computer
program is said to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P,improves with experience E.” One
can imagine a very wide variety of experiences E, tasks T, and performance measures P , and we
do not make any attempt in this book to provide a formal definition of what may be used for each
of these entities. Instead, the following sections provide intuitive descriptions and examples of
the different kinds of tasks, performance measures and experiences that can be used to construct
machine learning algorithms.
The Task, T
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
Machine learning allows us to tackle tasks that are too difficult to solve with fixed programs
written and designed by human beings. From a scientific and philosophical point of view, machine
learning is interesting because developing our understanding of machine learning entails
developing our understanding of the principles that underlie intelligence. In this relatively formal
definition of the word “task,” the process of learning itself is not the task. Learning is our means
of attaining the ability to perform the task. For example, if we want a robot to be able to walk,
then walking is the task. We could program the robot to learn to walk, or we could attempt to
directly write a program that specifies how to walk manually.
Machine learning tasks are usually described in terms of how the machine learning system should
process an example. An example is a collection of features that have been quantitatively
measured from some object or event that we want the machine learning system to process. We
typically represent an example as a vector x ∈ Rn where each entry xi of the vector is another
feature. For example, the features of an image are usually the values of the pixels in the image.
1. Classification: In this type of task, the computer program is asked to specify which of k
categories some input belongs to. To solve this task, the learning algorithm is usually asked
to produce a function f : Rn → {1, . . . , k}. When y = f (x), the model assigns an input
described by vector x to a category identified by numeric code y. There are other variants
of the classification task, for example, where f outputs a probability distribution over
classes. An example of a classification task is object recognition, where the input is an image
(usually described as a set of pixel brightness values), and the output is a numeric code
identifying the object in the image
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
2. Classification with missing inputs: Classification becomes more chal-lenging if the computer
program is not guaranteed that every measurement in its input vector will always be
provided. In order to solve the classification task, the learning algorithm only has to define
a single function mapping from a vector input to a categorical output. When some of the
inputs may
be missing, rather than providing a single classification function, the learning algorithm
must learn a set of functions. Each function corresponds to classi- fying x with a different
subset of its inputs missing. This kind of situation arises frequently in medical diagnosis,
because many kinds of medical tests are expensive or invasive. One way to efficiently define
such a large set of functions is to learn a probability distribution over all of the relevant
variables, then solve the classification task by marginalizing out the missing variables. With
n input variables, we can now obtain all 2n different classifi-cation functions needed for
each possible set of missing inputs, but we only need to learn a single function describing
the joint probability distribution.
3. Regression : In this type of task, the computer program is asked to predict a numerical value
given some input. To solve this task, the learning algorithm is asked to output a function f :
Rn → R. This type of task is similar to classification, except that the format of output is
different. An example of a regression task is the prediction of the expected claim amount
that an insured person will make (used to set insurance premiums), or the prediction of
future prices of securities. These kinds of predictions are also used for algorithmic trading.
4. Transcription: In this type of task, the machine learning system is asked to observe a
relatively unstructured representation of some kind of data and transcribe it into discrete,
textual form. For example, in optical character recognition, the computer program is shown
a photograph containing an image of text and is asked to return this text in the form of a
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
sequence of characters (e.g., in ASCII or Unicode format). Google Street View uses deep
learning to process address numbers in this way (Goodfellow et al., 2014d). Another
example is speech recognition, where the computer program is provided an audio
waveform and emits a sequence of characters or word ID codes describing the words that
were spoken in the audio recording. Deep learning is a crucial component of modern speech
recognition system.
5. Machine translation: In a machine translation task, the input alreadyconsists of a sequence of
symbols in some language, and the computer program must convert this into a sequence of
symbols in another language. This is commonly applied to natural languages, such as
translating from English to French. Deep learning has recently begun to have an important
impact on this kind of task.
6.Structured output: Structured output tasks involve any task where the output is a vector (or
other data structure containing multiple values) with important relationships between the
different elements. This is a broad category, and subsumes the transcription and translation
tasks described
above, but also many other tasks. One example is parsing—mapping a natural language
sentence into a tree that describes its grammatical structure and tagging nodes of the trees as
being verbs, nouns, or adverbs, and so on. See Collobert (2011) for an example of deep learning
applied to a parsing task. Another example is pixel-wise segmentation of images, where the
computer program assigns every pixel in an image to a specific category.
6. • Anomaly detection: In this type of task, the computer program sifts through a set of
events or objects, and flags some of them as being unusual or atypical. An example of an
anomaly detection task is credit card fraud detection. By modeling your purchasing habits,
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
a credit card company can detect misuse of your cards. If a thief steals your credit card or
credit card information, the thief’s purchases will often come from a different probability
distribution over purchase types than your own. The credit card company can prevent fraud
by placing a hold on an account as soon as that card has been used for an uncharacteristic
purchase.
7. Synthesis and sampling: In this type of task, the machine learning al-gorithm is asked to
generate new examples that are similar to those in the training data. Synthesis and
sampling via machine learning can be useful for media applications where it can be
expensive or boring for an artist to generate large volumes of content by hand. For example,
video games can automatically generate textures for large objects or landscapes in some
cases cases, we want the sampling or synthesis procedure to generate some specific kind
of output given the input. For example, in a speech synthesis task, we provide a written
sentence and ask the program to emit an audio waveform containing a spoken version of
that sentence. This is a kind of structured output task, but with the added qualification that
there is no single correct output for each input, and we explicitly desire a large amount of
variation in the output, in order for the output to seem more natural and realistic.
8. Imputation of missing values: In this type of task, the machine learning algorithm is given a
new example x ∈ Rn, but with some entries xi of x missing. The algorithm must provide a
prediction of the values of the missing entries.
9. Denoising: In this type of task, the machine learning algorithm is given in input a corrupted
example x ̃ ∈ Rn obtained by an unknown corruption process from a clean example x ∈ Rn.
The learner must predict the clean example x from its corrupted version x ,̃ or more
generally predict the conditional probability distribution p(x | x )̃ .
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
--------------------------------------------------------------------------------------------------------------------
Challenges Motivating Deep Learning The simple machine learning algorithms described in this
chapter work well on a wide variety of important problems. They have not succeeded, however,
in solving the central problems in AI, such as recognizing speech or recognizing objects.
The development of deep learning was motivated in part by the failure of traditional algorithms
to generalize well on such AI tasks. This section is about how the challenge of generalizing to new
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
examples becomes exponentially more difficult when working with high-dimensional data, and
how the mechanisms used to achieve generalization in traditional machine learning are
insufficient to learn complicated functions in high-dimensional spaces. Such spaces also often
impose high computational costs. Deep learning was designed to overcome these and other
obstacles.
Many machine learning problems become exceedingly difficult when the number of dimensions
in the data is high. This phenomenon is known as the curse of dimensionality. Of particular
concern is that the number of possible distinct configurations of a set of variables increases
exponentially as the number of variables increases. The curse of dimensionality arises in many
places in computer science, especially in machine learning.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
A statistical challenge arises because the number of possible configurations of x is much larger
than the number of training examples. To understand the issue, let us consider that the input
space is organized into a grid, as in the figure. We can describe low-dimensional space with a small
number of grid cells that are mostly occupied by the data. When generalizing to a new data point,
we can usually tell what to do simply by inspecting the training examples that lie in the same cell
as the new input. For example, if estimating the probability density at some point x, we can just
return the number of training examples in the same unit volume cell as x, divided by the total
number of training examples.
Local Constancy and Smoothness Regularization To generalize well, machine learning algorithms
need to be guided by prior beliefs about what kind of function they should learn. We have seen
these priors incorporated as explicit beliefs in the form of probability distributions over
parameters of the model. More informally, we may also discuss prior beliefs as directly influencing
the function itself and influencing the parameters only indirectly, as a result of the relationship
between the parameters and the function. Additionally, we informally discuss prior beliefs as
being expressed implicitly by choosing algorithms that are biased toward choosing some class of
functions over another, even though these biases may not be expressed (or even be possible to
express) in terms of a probability distribution representing our degree of belief in various
functions.
Among the most widely used of these implicit “priors” is the smoothness prior, or local constancy
prior. This prior states that the function we learn should not change very much within a small
region.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
There are many different ways to implicitly or explicitly express a prior belief that the learned
function should be smooth or locally constant. All these different methods are designed to
encourage the learning process to learn a function f ∗ that satisfies the condition
for most configurations x and small change . In other words, if we know a good answer for an
input x (for example, if x is a labeled training example), then that answer is probably good in the
neighborhood of x. If we have several good answers in some neighborhood, we would combine
them (by some form of averaging or interpolation) to produce an answer that agrees with as many
of them as much as possible
Manifold Learning(Part A)
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
An important concept underlying many ideas in machine learning is that of a manifold. A manifold
is a connected region. Mathematically, it is a set of points associ ated with a neighborhood
around each point. From any given point, the manifold locally appears to be a Euclidean space. In
everyday life, we experience the surface of the world as a 2-D plane, but it is in fact a spherical
manifold in 3-D space.
The first observation in favor of the manifold hypothesis is that the proba bility distribution over
images, text strings, and sounds that occur in real life is highly concentrated. Uniform noise
essentially never resembles structured inputs from these domains.
The first observation in favor of the manifold hypothesis is that the probability distribution over
images, text strings, and sounds that occur in real life is highly concentrated. Uniform noise
essentially never resembles structured inputs from these domains. F
-----------------------------------------------------------------------------------------------------------------------------
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
REGULARIZATION?(Part B)
Bias vs variance tradeoff graph here sheds a bit more light on the nuances of this
topic and demarcation:
Fitting F2, our ML model, onto F1, our true data generation process is almost like fitting a
square-shaped toy in a round hole by closed approximations.
In practical deep learning training and scenarios, we mostly find that the bestfitting
model (in the sense of least generalization error) is often a large modelthat has been
regularized appropriately.
We will now dive deep to study one type of regularization technique that helps to
create a large but deeply regularized model using parameter-based penalties.
Under this kind of regularization technique, the capacity of the models like neural
networks, linear or logistic regression is limited by adding a parameternorm penalty
Ω(θ) to the objective function J. The equation can be represented as the following:
where α lies within [0, ∞) is a hyperparameter that weights the relative contribution
of a norm penalty term, Ω, pertinent to the standard objective function J.
This type of regularization penalizes only the weights of the affine transformation
at each layer of the network which leaves the biases unregularized. This is done
with the notion in mind that it typically requires lesser data to fit the biases than
the weights.
L1 Parameter Regularization:
Lasso shrinks the less important feature’s coefficient to zero; thus, removingsome
feature altogether. So, this works well for feature selection in case we have a huge
number of features.
The L1 regularizer basically looks for the parameter vectors that minimize thenorm
of the parameter vector (the length of the vector). This is essentially the problem
of how to optimize the parameters of a single neuron, a single layer neural network
in general, and a single layer feed-forward neural network in particular.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
parameter vector as possible. Key points that should be noted for L1 regularization:
To understand the above mentioned point, let us go through the following example
and try to understand what it means when an algorithm is said to be sensitive to
outliers
1. For instance we are trying to classify images of various birds ofdifferent species
and have a neural network with a few hundred parameters.
2. We find a sample of birds of one species, which we have no reason tobelieve are
of any different species from all the others.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
3. We add this image to the training set and try to train the neural network. This
is like throwing an outlier into the mix of all the others.By looking at the edge
of the hyperspace where the hyperplane is closest to, we pick up on this outlier,
but by the time we’ve got to thehyperplane it’s quite far from the plane and is
hence an outlier.
Along with shrinking coefficients, the lasso performs feature selection, as well.
(Remember the ‘selection‘ in the lasso full-form?) Because some of the coefficients
become exactly zero, which is equivalent to the particular feature being excluded
from the model.
L2 Parameter Regularization:
Lambda is a hyperparameter.
Ordinary Least Square or OLS, is a stats model which also helps us in identifying
more significant features that can have a heavy influence on the output.
But if lambda is very large, then it will add too much weight, and it will lead to
under-fitting. Important points to be considered about L2 can be listed below:
1. Ridge regularization forces the weights to be small but does not make them
zero and does not give the sparse solution.
2. Ridge is not robust to outliers as square terms blow up the error differences of
the outliers, and the regularization term tries to fix it by penalizing the weights.
3. Ridge regression performs better when all the input features influence the
output, and all with weights are of roughly equal size.
-----------------------------------------------------------------------------------------------------------------------------
------
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
WHAT IS OVERFITTING?(Part B)
When a model performs very well for training data but has poor performance with
test data (new data), it is known as overfitting. In this case, the machine learning
model learns the details and noise in the training data such that it negatively affects
the performance of the model on test data. Overfitting can happen due to low bias
and high variance.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
• Data used for training is not cleaned and contains noise (garbage values) in it
• The model has a high variance
• The size of the training dataset used is not enough
• The model is too complex
WHAT IS UNDERFITTING?(Part B)
When a model has not learned the patterns in the training data well and is unable
to generalize well on the new data, it is known as underfitting. An underfit model
has poor performance on the training data and will result in unreliable predictions.
Underfitting occurs due to high bias and low variance.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
• Data used for training is not cleaned and contains noise (garbage values) in it
• The model has a high bias
• The size of the training dataset used is not enough
• The model is too simple
Now that you have understood what overfitting and underfitting are, let’s see what is a good fit
model in this tutorial on overfitting and underfitting in machine learning.
Example:
In the above plot, x is the independent variable, and y is the dependent variable.
The objective is to fit a regression line to the data. This line(the model) is then used
to predict the y-value for unseen values of x. Here, m is the slope and c is the
intercept of the line. These two parameters(m and c) are estimated by fitting a
straight line to the data by minimizing the RMSE(root mean squared error). Hence,
these parameters are called the model parameters.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
Example:
In the above plot x-axis represents the number of epochs and y-axis represents the
number of epochs. We can see after a certain point when epochs are more than
then although the training accuracy increases but the validation and test accuracy
starts decreasing. This is a case of overfitting. Here number of epochs is a
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai
hyperparameter andis set manually. Setting this number to a small value may cause
underfitting and high value may cause overfitting.
They are required for making They are required for estimating the
predictions modelparameters
1. Model Parameters: These are the parameters in the model that must be
determined using the training data set. These are the fitted parameters.
where X is the predictor matrix, and w are the weights. Here w_0, w_1, w_2,
…,w_m are the model parameters. If the model uses the gradient
descent algorithm to minimize the objective function in order to
determine the weights w_0, w_1, w_2, …,w_m, then we can have
an optimizer such as GradientDescent(eta, n_iter). Here eta
(learning rate) and n_iter (number of iterations) are the
hyperparameters that would have to be adjusted in order to
obtain the best values for the model parameters w_0, w_1, w_2,
…,w_m.