0% found this document useful (0 votes)
8 views

Unit 1 - Deep Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Unit 1 - Deep Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

RAJALAKSHMI INSTITUTE OF TECHNOLOGY

Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

UNIT I DEEP NETWORKS BASICS

Linear Algebra: Scalars -- Vectors -- Matrices and tensors; Probability Distributions --


Gradientbased Optimization – Machine Learning Basics: Capacity -- Overfitting and underfitting
-- Hyperparameters and validation sets -- Estimators -- Bias and variance -- Stochastic gradient
descent -- Challenges motivating deep learning; Deep Networks: Deep feedforward networks;
Regularization -- Optimization.

Probability Theory(Part B)

Probability theory is a mathematical framework for representing uncertain statements. It


provides a means of quantifying uncertainty and axioms for deriving new uncertain statements.
In artificial intelligence applications, we use probability theory in two major ways.

First, the laws of probability tell us how AI systems should reason, so we design our algorithms to
compute or approximate various expressions derived using probability theory.

Second, we can use probability and statistics to theoretically analyze the behavior of proposed AI
systems. Probability theory is a fundamental tool of many disciplines of science and engineering.
The probability theory allows us to make uncertain statements and reason in the presence
of uncertainty, information theory allows us to quantify the amount of uncertainty in a probability
distribution.

There are three possible sources of uncertainty:

1. Inherent stochasticity in the system being modeled. For example, most interpretations of
quantum mechanics describe the dynamics of subatomic particles as being probabilistic. We can
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

also create theoretical scenarios that we postulate to have random dynamics, such as a
hypothetical card game where we assume that the cards are truly shuffled into a random order.

2. Incomplete observability. Even deterministic systems can appear stochastic when we cannot
observe all of the variables that drive the behavior of the system. For example, in the Monty Hall
problem, a game show contestant is asked to choose between three doors and wins a prize held
behind the chosendoor. Two doors lead to a goat while a third leads to a car. The outcome given
the contestant’s choice is deterministic, but from the contestant’s point of view, the outcome is
uncertain.

3. Incomplete modeling. When we use a model that must discard some of the information we
have observed, the discarded information results in uncertainty in the model’s predictions. For
example, suppose we build a robot that can exactly observe the location of every object around
it. If the robot discretizes space when predicting the future location of these objects,then the
discretization makes the robot immediately become uncertain about the precise position of
objects: each object could be anywhere within the discrete cell that it was observed to occupy. In
many cases, it is more practical to use a simple but uncertain rule rather than a complex but
certain one, even if the true rule is deterministic and our modeling system has the fidelity to
accommodate a complex rule. For example, the simple rule “Most birds fly” is cheap to develop
and is broadly useful, while a rule of the form, “Birds fly, except for very young birds that have not
yet learned to fly, sick or injured birds that have lost the ability to fly, flightless species of birds
including the cassowary, ostrich and kiwi. . .” is expensive to develop, maintain and communicate
.

Random Variables
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

A random variable is a variable that can take on different values randomly. We typically denote
the random variable itself with a lower case letter in plain typeface, and the values it can take on
with lower case script letters. For example, x1 and x2 are both possible values that the random
variable x can take on. For vector-valued variables, we would write the random variable as x and
one of its values as x. On

its own, a random variable is just a description of the states that are possible; it must be coupled
with a probability distribution that specifies how likely each of these states are. Random variables
may be discrete or continuous. A discrete random variable is one that has a finite or countably
infinite number of states. Note that these states are not necessarily the integers; they can also
just be named states that

are not considered to have any numerical value. A continuous random variable is associated with
a real value.

Probability Distributions

A probability distribution is a description of how likely a random variable or set of random


variables is to take on each of its possible states. The way we describe probability distributions
depends on whether the variables are discrete or continuous.

Discrete Variables and Probability Mass Functions:

A probability distribution over discrete variables may be described using a probability mass
function (PMF). We typically denote probability mass functions with a capital P. Often we
associate each random variable with a different probability

The domain of p must be the set of all possible states of x.


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

• ∀x ∈ x, p(x) ≥ 0. Note that we do not require p(x) ≤ 1.

A probability density function p(x) does not give the probability of a specific state directly, instead

the probability of landing inside an infinitesimal region with volume δx is given by

The different types of probability are

1.Marginal probability :

Sometimes we know the probability distribution over a set of variables and we want to know the
probability distribution over just a subset of them. The probability distribution over the subset is
known as the marginal probability distribution.

The name “marginal probability” comes from the process of computing marginal probabilities on
paper. When the values of P(x, y ) are written in a grid with different values of x in rows and
different values of y in columns, it is natural to sum across a row of the grid, then write P(x) in the
margin of the paper just to the right of the row. For continuous variables, we need to use
integration instead of summation:

For example, suppose we have discrete random variables x and y, and we know

P(x, y). We can find P(x) with the sum rule:


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

2.Conditional probability:

In many cases, we are interested in the probability of some event, given that some other event
has happened. This is called a conditional probability. We denote the conditional probability that
y = y given x = x as P(y = y | x = x). This conditional probability can be computed with the formula

The conditional probability is only defined when P(x = x) > 0. We cannot compute the conditional
probability conditioned on an event that never happens. It is important not to confuse conditional
probability with computing what would happen if some action were undertaken. The conditional
probability that a person is from Germany given that they speak German is quite high, but if a
randomly selected person is taught to speak German, their country of origin does not change.
Computing the consequences of an action is called making an intervention query. 3.The chain rule
of the conditional probabilities.

3.Chain Rule of the Conditional Probability:

Any joint probability distribution over many random variables may be decomposed into
conditional distributions over only one variable:

This observation is known as the chain rule or product rule of probability. It follows immediately
from the definition of conditional probability in equation
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Independence and Conditional Independence

Two random variables x and y are independent if their probability distribution can be expressed
as a product of two factors, one involving only x and one involving only y:

Two random variables x and y are conditionally independent given a random variable z if the
conditional probability distribution over x and y factorizes in this way for every value of z:

We can denote independence and conditional independence with compact notation: x⊥y means
that x and y are independent, while x⊥y | z means that x and y are conditionally independent
given z.

-----------------------------------------------------------------------------------------------------------------------

Feed-Forward Network in Deep Learning (Part B)

Neural Networks are a type of function that connects inputs with outputs. In theory, neural
networks should be able to estimate any sort of function, no matter how complex it is.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Nonetheless, supervised learning entails learning a function that translates a given X to a specified
Y and then utilising that function to determine the proper Y for a fresh X. If that’s the case, how
do neural networks differ from typical machine learning methods? Inductive Bias, a psychological
phenomenon, is the answer. The phrase may appear to be fresh. However, before applying a
machine learning model to it, it is nothing more than our assumptions about the relationship
between X and Y.

The linear relationship between X and Y is the Inductive Bias of linear regression. As a result, it fits
the data to a line or a hyperplane.

However, depending on the function’s complexity, we may need to manually set the number of
neurons in each layer and the total number of layers in the network. This is usually accomplished
through trial and error methods as well as experience. As a result, these parameters are referred
to as hyperparameters.

Neural Network Architecture and Operation


Before we look at why neural networks work, it’s important to understand what neural
networks do. Before we can grasp the design of a neural network, we must first understand
what a neuron performs.

A weight is assigned to each input to an artificial neuron. First, the inputs are multiplied by their
weights, and then a bias is applied to the outcome. After that, the weighted sum is passed via an
activation function, being a non-
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

linear function.

A weight is being applied to each input to an artificial neuron. First, the inputs are multiplied
by their weights, and then a bias is applied to the outcome. This is called the weighted sum. After
that, the weighted sum is processed via an activation function, as a non-linear function.

The first layer is the input layer, which appears to have six neurons but is only the data that is sent
into the neural network. The output layer is the final layer. The dataset and the type of challenge
determine the number of neurons in the final layer and the first layer. Trial and error will be used
to determine the number of neurons in the hidden layers and the number of hidden layers.

All of the inputs from the previous layer will be connected to the first neuron from the first hidden
layer. The second neuron in the first hidden layer will be connected to all of the preceding layer’s
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

inputs, and so forth for all of the first hidden layer’s neurons. The outputs of the previously hidden
layer are regarded inputs for neurons in the second hidden layer, and each of these neurons is
coupled to all of the preceding neurons.

What is a Feed-Forward Neural Network and how does it work?


In its most basic form, a Feed-Forward Neural Network is a single layer perceptron. A sequence
of inputs enter the layer and are multiplied by the weights in this model. The weighted input
values are then summed together to form a total. If the sum of the values is more than a
predetermined threshold, which is normally set at zero, the output value is usually 1, and if the
sum is less than the threshold, the output value is usually -1. The single-layer perceptron is a
popular feed-forward neural network model that is frequently used for classification. Single-layer
perceptrons can also contain machine learning features.

The neural network can compare the outputs of its nodes with the desired values using a
property known as the delta rule, allowing the network to alter its weights through training to
create more accurate output values. This training and learning procedure results in gradient
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

descent. The technique of updating weights in multi-layered perceptrons is virtually the same,
however, the process is referred to as back-propagation. In such circumstances, the output values
provided by the final layer are used to alter each hidden layer inside the network.

Representing the feed-forward neural network using Python

Let us create the respective sample weights which are to be applied in the input layer, the first &
the second hidden layer

Python code implementation for the propagation of the input signal through different layers
towards the output layer

# Forward propagation of input signals to 6 neurons in first hidden layer activation is calculated
based tanh function
z1 = X.dot(W1) + b1
a1 = np.tanh(z1)
# Forward propagation of activation signals from first hidden layer to 6 neurons in second
hidden layer activation is calculated based tanh function

z2 = a1.dot(W2) + b2
a2 = np.tanh(z2)

# Forward propagation of activation signals from second hidden layer


to 3 neurons in output layer

z3 = a2.dot(W3) + b3

# Probability is calculated as an output of softmax function

probs = np.exp(z3) / np.sum(np.exp(z3), axis=1, keepdims=True)


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Why does this Strategy Work?


As we’ve seen, the function of each neurone in the network is similar to that of linear regression.
The neuron also has an activation function at the end, and each neuron has its weight vector.

We’ve seen how the computation works so far. But the major purpose of this blog is to explain
why this strategy works. Neural networks should theoretically be able to estimate any continuous
function, no matter how complex or non-linear it is.

Importance of the Non-Linearity

When two or more linear objects, such as a line, plane, or hyperplane, are combined, the
outcome is also a linear object: line, plane, or hyperplane. No matter how many of these linear
things we add, we’ll still end up with a linear object.

However, this is not the case when adding non-linear objects. When two separate curves are
combined, the result is likely to be a more complex curve.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

We’re introducing non-linearity at every layer using these activation functions, in addition to
just adding non-linear objects or hyper-curves like hyperplanes. In other words, we’re applying
a nonlinear function on an already nonlinear object.

What if activation functions were not used in neural networks?

Suppose if neural networks didn’t have an activation function, they’d just be a huge linear unit
that a single linear regression model could easily replace.

a = m*x + d

Z= k*a + t => k*(m*x+d) + t => k*m*x + k*d + t => (k*m)*x + (k*c+t)

Applications of the Feed Forward Neural Networks


A Feed Forward Neural Network is an artificial neural network in which the nodes are connected
circularly. A feed-forward neural network, in which some routes are cycled, is the polar opposite
of a recurrent neural network. The feed-forward model is the simplest type of neural network
because the input is only processed in one direction. The data always flows in one direction and
never backwards, regardless of how many buried nodes it passes through.

Q1. What is feed-forward vs deep feed-forward?

A. Feed-forward refers to a neural network architecture where information flows in one direction,
from input to output, with no feedback loops. Deep feed-forward, commonly known as a deep
neural network, consists of multiple hidden layers between input and output layers, enabling the
network to learn complex hierarchical features and patterns, enhancing its ability to model
intricate relationships in data.

Q2. What is feed-forward vs feedback neural network?


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

A. Feed-forward neural networks transmit data in one direction—from input to output—without


feedback loops, making them suitable for tasks like pattern recognition and classification.
Feedback neural networks, on the other hand, incorporate feedback connections, allowing output
to affect subsequent processing. Recurrent Neural Networks (RNNs) are a common type of
feedback network, useful for sequential data tasks like language modeling, where context
matters.

--------------------------------------------------------------------------------------------------------------------------

GRADIENT & STOCHASTIC GRADIENT DESCENT(PART B)

Stochastic gradient descent is a very popular and common algorithm used in various Machine
Learning algorithms, most importantly forms the basis of Neural Networks. In this article, I have
tried my best to explain it in detail, yet in simple terms. I highly recommend going through linear
regression before proceeding with this article.

What is the objective of Gradient Descent?

Gradient, in plain terms means slope or slant of a surface. So gradient descent literally means
descending a slope to reach the lowest point on that surface. Let us imagine a two dimensional
graph, such as a parabola in the figure below.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

A parabolic function with two dimensions (x,y)

In the above graph, the lowest point on the parabola occurs at x = 1. The objective of gradient
descent algorithm is to find the value of “x” such that “y” is minimum. “y” here is termed as the
objective function that the gradient descent algorithm operates upon, to descend to the lowest
point.

It is important to understand the above before proceeding further.

Gradient Descent — the algorithm

The objective of regression, to minimize the sum of squared residuals. We know that a function
reaches its minimum value when the slope is equal to 0. By using this technique, we solved the
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

linear regression problem and learnt the weight vector. The same problem can be solved by
gradient descent technique.

“Gradient descent is an iterative algorithm, that starts from a random point on a function and
travels down its slope in steps until it reaches the lowest point of that function.”

This algorithm is useful in cases where the optimal points cannot be found by equating the slope
of the function to 0. In the case of linear regression, you can mentally map the sum of squared
residuals as the function “y” and the weight vector as “x” in the parabola above.

How to move down in steps?

This is the crux of the algorithm. The general idea is to start with a random point (in our parabola
example start with a random “x”) and find a way to update this point with each iteration such
that we descend the slope.

The steps of the algorithm are

1. Find the slope of the objective function with respect to each parameter/feature. In
other words, compute the gradient of the function.

2. Pick a random initial value for the parameters. (To clarify, in the parabola example,
differentiate “y” with respect to “x”. If we had more features like x1, x2 etc., we take
the partial derivative of “y” with respect to each of the features.)

3. Update the gradient function by plugging in the parameter values.

4. Calculate the step sizes for each feature as : step size = gradient * learning rate.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

5. Calculate the new parameters as : new params = old params -step size

6. Repeat steps 3 to 5 until gradient is almost 0.

The “learning rate” mentioned above is a flexible parameter which heavily influences the
convergence of the algorithm. Larger learning rates make the algorithm take huge steps down
the slope and it might jump across the minimum point thereby missing it. So, it is always good to
stick to low learning rate such as 0.01. It can also be mathematically shown that gradient descent
algorithm takes larger steps down the slope if the starting point is high above and takes baby
steps as it reaches closer to the destination to be careful not to miss it and also be quick enough.

Stochastic Gradient Descent (SGD)

There are a few downsides of the gradient descent algorithm. We need to take a closer look at
the amount of computation we make for each iteration of the algorithm.

Say we have 10,000 data points and 10 features. The sum of squared residuals consists of as
many terms as there are data points, so 10000 terms in our case. We need to compute the
derivative of this function with respect to each of the features, so in effect we will be doing
10000 * 10 = 100,000 computations per iteration. It is common to take 1000 iterations, in effect
we have 100,000 * 1000 = 100000000 computations to complete the algorithm. That is pretty
much an overhead and hence gradient descent is slow on huge data.

The cost function used by a machine learning algorithm often decomposes as a sum over training
examples of some per-example loss function. For example, the negative conditional log-likelihood
of the training data can be written as
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

The computational cost of this operation is O(m). As the training set size grows to billions of
examples, the time to take a single gradient step becomes prohibitively long. The insight of SGD
is that the gradient is an expectation. The expectation may be approximately estimated using a
small set of samples. Specifically, on each step of the algorithm, we can sample a minibatch of
examples B = {x (1) , . . . , x (m0 )} drawn uniformly from the training set. The minibatch size m0 is
typically chosen to be a relatively small number of examples, ranging from one to a few hundred.
Crucially, m0 is usually held fixed as the training set size m grows. We may fit a training set with
billions of examples using updates computed on only a hundred examples. The estimate of the
gradient is formed as

using examples from the minibatch B. The stochastic gradient descent algorithm then follows the
estimated gradient downhill:

--------------------------------------------------------------------------------------------------------------------------

GRADIENT BASED OPTIMIZATION (PART B)


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Most deep learning algorithms involve optimization of some sort. Optimization refers to the task
of either minimizing or maximizing some function f (x) by altering x. We usually phrase most
optimization problems in terms of minimizing f (x). Maximization may be accomplished via a
minimization algorithm by minimizing−f(x).The function we want to minimize or maximize is called
the objective function or criterion. When we are minimizing it, we may also call it the cost
function, loss function, or error function. In this book, we use these terms interchangeably,
though some machine learning publications assign special meaning to some of these terms.

We often denote the value that minimizes or maximizes a function with a

superscript ∗. For example, we might say x∗ = arg min f(x).

Fig 1 Uphill and the Groundhill of the gradient problem


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

We assume the reader is already familiar with calculus, but provide a brief review of how calculus
concepts relate to optimization here. Suppose we have a function y = f (x), where both x and y are
real numbers.

The derivative of this function is denoted as or as dy

dx. The derivative gives the slope of f (x) at the point x. In other words, it specifies how to
scale a small change in the input in order to obtain the corresponding change in the output:

The derivative is therefore useful for minimizing a function because it tells us how to change x in
order to make a small improvement in y. For example,

we know that (x))) is less than f (x) for small enough . We can thus reduce f
(x) by moving x in small steps with opposite sign of the derivative. (x) = 0, the derivative provides
no information about which direction. When the derivative provides no information about
which direction to move. Points where =0 known as critical points or stationary points. A
local minimum is a point where f (x) is lower than at all neighboring points, so it is no longer
possible to decrease f(x) by making infinitesimal steps. A local maximum is a point where f (x) is
higher than at all neighboring points,
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Fig representing Minimum ,maximum saddle Point

A point that obtains the absolute lowest value of f (x) is a global minimum.It is possible for there
to be only one global minimum or multiple global minima of the function. It is also possible for
there to be local minima that are not globally optimal. In the context of deep learning, we optimize
functions that may have many local minima that are not optimal, and many saddle points
surrounded by very flat regions. All of this makes optimization very difficult, especially when the
input to the function is multidimensional. We therefore usually settle for finding a value of f that
is very low, but not necessarily minimal in any formal sense.

-------------------------------------------------------------------------------------------------------------------------

MACHINE LEARNING BASICS(Part B)

Learning Algorithm:

A machine learning algorithm is an algorithm that is able to learn from data. “A computer

program is said to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P,improves with experience E.” One
can imagine a very wide variety of experiences E, tasks T, and performance measures P , and we
do not make any attempt in this book to provide a formal definition of what may be used for each
of these entities. Instead, the following sections provide intuitive descriptions and examples of
the different kinds of tasks, performance measures and experiences that can be used to construct
machine learning algorithms.

The Task, T
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Machine learning allows us to tackle tasks that are too difficult to solve with fixed programs
written and designed by human beings. From a scientific and philosophical point of view, machine
learning is interesting because developing our understanding of machine learning entails
developing our understanding of the principles that underlie intelligence. In this relatively formal
definition of the word “task,” the process of learning itself is not the task. Learning is our means
of attaining the ability to perform the task. For example, if we want a robot to be able to walk,
then walking is the task. We could program the robot to learn to walk, or we could attempt to
directly write a program that specifies how to walk manually.

Machine learning tasks are usually described in terms of how the machine learning system should
process an example. An example is a collection of features that have been quantitatively
measured from some object or event that we want the machine learning system to process. We
typically represent an example as a vector x ∈ Rn where each entry xi of the vector is another
feature. For example, the features of an image are usually the values of the pixels in the image.

The machine learning tasks are

1. Classification: In this type of task, the computer program is asked to specify which of k
categories some input belongs to. To solve this task, the learning algorithm is usually asked
to produce a function f : Rn → {1, . . . , k}. When y = f (x), the model assigns an input
described by vector x to a category identified by numeric code y. There are other variants
of the classification task, for example, where f outputs a probability distribution over
classes. An example of a classification task is object recognition, where the input is an image
(usually described as a set of pixel brightness values), and the output is a numeric code
identifying the object in the image
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

2. Classification with missing inputs: Classification becomes more chal-lenging if the computer
program is not guaranteed that every measurement in its input vector will always be
provided. In order to solve the classification task, the learning algorithm only has to define
a single function mapping from a vector input to a categorical output. When some of the
inputs may
be missing, rather than providing a single classification function, the learning algorithm
must learn a set of functions. Each function corresponds to classi- fying x with a different
subset of its inputs missing. This kind of situation arises frequently in medical diagnosis,
because many kinds of medical tests are expensive or invasive. One way to efficiently define
such a large set of functions is to learn a probability distribution over all of the relevant
variables, then solve the classification task by marginalizing out the missing variables. With
n input variables, we can now obtain all 2n different classifi-cation functions needed for
each possible set of missing inputs, but we only need to learn a single function describing
the joint probability distribution.
3. Regression : In this type of task, the computer program is asked to predict a numerical value
given some input. To solve this task, the learning algorithm is asked to output a function f :
Rn → R. This type of task is similar to classification, except that the format of output is
different. An example of a regression task is the prediction of the expected claim amount
that an insured person will make (used to set insurance premiums), or the prediction of
future prices of securities. These kinds of predictions are also used for algorithmic trading.
4. Transcription: In this type of task, the machine learning system is asked to observe a
relatively unstructured representation of some kind of data and transcribe it into discrete,
textual form. For example, in optical character recognition, the computer program is shown
a photograph containing an image of text and is asked to return this text in the form of a
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

sequence of characters (e.g., in ASCII or Unicode format). Google Street View uses deep
learning to process address numbers in this way (Goodfellow et al., 2014d). Another
example is speech recognition, where the computer program is provided an audio
waveform and emits a sequence of characters or word ID codes describing the words that
were spoken in the audio recording. Deep learning is a crucial component of modern speech
recognition system.
5. Machine translation: In a machine translation task, the input alreadyconsists of a sequence of
symbols in some language, and the computer program must convert this into a sequence of
symbols in another language. This is commonly applied to natural languages, such as
translating from English to French. Deep learning has recently begun to have an important
impact on this kind of task.

6.Structured output: Structured output tasks involve any task where the output is a vector (or
other data structure containing multiple values) with important relationships between the
different elements. This is a broad category, and subsumes the transcription and translation
tasks described

above, but also many other tasks. One example is parsing—mapping a natural language
sentence into a tree that describes its grammatical structure and tagging nodes of the trees as
being verbs, nouns, or adverbs, and so on. See Collobert (2011) for an example of deep learning
applied to a parsing task. Another example is pixel-wise segmentation of images, where the
computer program assigns every pixel in an image to a specific category.

6. • Anomaly detection: In this type of task, the computer program sifts through a set of
events or objects, and flags some of them as being unusual or atypical. An example of an
anomaly detection task is credit card fraud detection. By modeling your purchasing habits,
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

a credit card company can detect misuse of your cards. If a thief steals your credit card or
credit card information, the thief’s purchases will often come from a different probability
distribution over purchase types than your own. The credit card company can prevent fraud
by placing a hold on an account as soon as that card has been used for an uncharacteristic
purchase.
7. Synthesis and sampling: In this type of task, the machine learning al-gorithm is asked to
generate new examples that are similar to those in the training data. Synthesis and
sampling via machine learning can be useful for media applications where it can be
expensive or boring for an artist to generate large volumes of content by hand. For example,
video games can automatically generate textures for large objects or landscapes in some
cases cases, we want the sampling or synthesis procedure to generate some specific kind
of output given the input. For example, in a speech synthesis task, we provide a written
sentence and ask the program to emit an audio waveform containing a spoken version of
that sentence. This is a kind of structured output task, but with the added qualification that
there is no single correct output for each input, and we explicitly desire a large amount of
variation in the output, in order for the output to seem more natural and realistic.
8. Imputation of missing values: In this type of task, the machine learning algorithm is given a
new example x ∈ Rn, but with some entries xi of x missing. The algorithm must provide a
prediction of the values of the missing entries.
9. Denoising: In this type of task, the machine learning algorithm is given in input a corrupted
example x ̃ ∈ Rn obtained by an unknown corruption process from a clean example x ∈ Rn.
The learner must predict the clean example x from its corrupted version x ,̃ or more
generally predict the conditional probability distribution p(x | x )̃ .
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

10.Density estimation or probability mass function estimation: In the density estimation


problem, the machine learning algorithm is asked to learn a function pmodel : Rn → R,
where pmodel(x) can be interpreted as a probability density function (if x is continuous) or
a probability mass function (if x is discrete) on the space that the examples were drawn
from To do such a task well (we will specify exactly what that means when we discuss
performance measures P), the algorithm needs to learn the structure of the data it has
seen. It must know where examples cluster tightly and where they are unlikely to occur.
Most of the tasks described above require the learning algorithm to at least implicitly
capture the structure of the probability distribution. Density estimation allows us to
explicitly capture that distribution. In principle, we can then perform computations on that
distribution in order to solve the other tasks as well. For example, if we have performed
density estimation to obtain a probability distribution p(x),we can use that distribution to
solve the missing value imputation task. If a value xi is missing and all of the other values,
denoted x−i, are given, then we know the distribution over it is given by p(xi | x −i). In
practice, density estimation does not always allow us to solve all of these related tasks,
because in many cases the required operations on p(x) are computationally intractable.

--------------------------------------------------------------------------------------------------------------------

CHALLENGES MOTIVATING DEEP LEARNING(Part B)

Challenges Motivating Deep Learning The simple machine learning algorithms described in this
chapter work well on a wide variety of important problems. They have not succeeded, however,
in solving the central problems in AI, such as recognizing speech or recognizing objects.

The development of deep learning was motivated in part by the failure of traditional algorithms
to generalize well on such AI tasks. This section is about how the challenge of generalizing to new
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

examples becomes exponentially more difficult when working with high-dimensional data, and
how the mechanisms used to achieve generalization in traditional machine learning are
insufficient to learn complicated functions in high-dimensional spaces. Such spaces also often
impose high computational costs. Deep learning was designed to overcome these and other
obstacles.

The Curse of Dimensionality(Part A)

Many machine learning problems become exceedingly difficult when the number of dimensions
in the data is high. This phenomenon is known as the curse of dimensionality. Of particular
concern is that the number of possible distinct configurations of a set of variables increases
exponentially as the number of variables increases. The curse of dimensionality arises in many
places in computer science, especially in machine learning.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

A statistical challenge arises because the number of possible configurations of x is much larger
than the number of training examples. To understand the issue, let us consider that the input
space is organized into a grid, as in the figure. We can describe low-dimensional space with a small
number of grid cells that are mostly occupied by the data. When generalizing to a new data point,
we can usually tell what to do simply by inspecting the training examples that lie in the same cell
as the new input. For example, if estimating the probability density at some point x, we can just
return the number of training examples in the same unit volume cell as x, divided by the total
number of training examples.

Local Constancy and Smoothness Regularization

Local Constancy and Smoothness Regularization To generalize well, machine learning algorithms
need to be guided by prior beliefs about what kind of function they should learn. We have seen
these priors incorporated as explicit beliefs in the form of probability distributions over
parameters of the model. More informally, we may also discuss prior beliefs as directly influencing
the function itself and influencing the parameters only indirectly, as a result of the relationship
between the parameters and the function. Additionally, we informally discuss prior beliefs as
being expressed implicitly by choosing algorithms that are biased toward choosing some class of
functions over another, even though these biases may not be expressed (or even be possible to
express) in terms of a probability distribution representing our degree of belief in various
functions.

Among the most widely used of these implicit “priors” is the smoothness prior, or local constancy
prior. This prior states that the function we learn should not change very much within a small
region.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

There are many different ways to implicitly or explicitly express a prior belief that the learned
function should be smooth or locally constant. All these different methods are designed to
encourage the learning process to learn a function f ∗ that satisfies the condition

for most configurations x and small change . In other words, if we know a good answer for an
input x (for example, if x is a labeled training example), then that answer is probably good in the
neighborhood of x. If we have several good answers in some neighborhood, we would combine
them (by some form of averaging or interpolation) to produce an answer that agrees with as many
of them as much as possible

Manifold Learning(Part A)
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

An important concept underlying many ideas in machine learning is that of a manifold. A manifold
is a connected region. Mathematically, it is a set of points associ ated with a neighborhood
around each point. From any given point, the manifold locally appears to be a Euclidean space. In
everyday life, we experience the surface of the world as a 2-D plane, but it is in fact a spherical
manifold in 3-D space.

The first observation in favor of the manifold hypothesis is that the proba bility distribution over
images, text strings, and sounds that occur in real life is highly concentrated. Uniform noise
essentially never resembles structured inputs from these domains.

The first observation in favor of the manifold hypothesis is that the probability distribution over
images, text strings, and sounds that occur in real life is highly concentrated. Uniform noise
essentially never resembles structured inputs from these domains. F

-----------------------------------------------------------------------------------------------------------------------------
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

REGULARIZATION?(Part B)

Regularization in Deep Learning:

In the context of deep learning models, most regularization strategies revolve


around regularizing estimators. So now the question arises what does regularizing
an estimator means?

Bias vs variance tradeoff graph here sheds a bit more light on the nuances of this
topic and demarcation:

Bias vs Variance tradeoff graph


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Regularization of an estimator works by trading increased bias for reduced


variance. An effective regularize will be the one that makes the best trade
between bias and variance, and the end-product of the tradeoff should be a
significant reduction in variance at minimum expense to bias. In simpler terms,
this would mean low variance without immensely increasing the bias value.

We consider two scenarios:

1. The true data-generating process/function: F1, which created thedataset

2. Creating a generating process/function: F2 that mimics F1 but alsoexplores


other possible generating scenarios/functions

The work of regularization techniques is to help take our model from F2 to F1


without overly complicating F2. Deep learning algorithms are mostly used in more
complicated domains like images, audio, text sequences or simulating complex
decision making tasks. The True data-generation process: F1 isalmost impossible
to be correctly mapped, hence with regularization, we aim to bring our model
with F2 function as close as possible to the original F1 function. The following
analogy helps understand it more clearly:
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Fitting F2, our ML model, onto F1, our true data generation process is almost like fitting a
square-shaped toy in a round hole by closed approximations.

In practical deep learning training and scenarios, we mostly find that the bestfitting
model (in the sense of least generalization error) is often a large modelthat has been
regularized appropriately.

We will now dive deep to study one type of regularization technique that helps to
create a large but deeply regularized model using parameter-based penalties.

Parameter Norm Penalties:

Under this kind of regularization technique, the capacity of the models like neural
networks, linear or logistic regression is limited by adding a parameternorm penalty
Ω(θ) to the objective function J. The equation can be represented as the following:

Parameter Norm penalties


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

where α lies within [0, ∞) is a hyperparameter that weights the relative contribution
of a norm penalty term, Ω, pertinent to the standard objective function J.

Setting α to 0 results in no regularization while larger values correspond to a


greater regularization.

This type of regularization penalizes only the weights of the affine transformation
at each layer of the network which leaves the biases unregularized. This is done
with the notion in mind that it typically requires lesser data to fit the biases than
the weights.

For deep learning, it sometimes feels desirable to use a separate parameter to


induce the same affect.

L1 Parameter Regularization:

L1 regularization is a method of doing regularization. It tends to be more specific


than gradient descent, but it is still a gradient descent optimization problem.

Formula and high level meaning over here:


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Formula for L1 regularization terms

Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds


“Absolute value of magnitude” of coefficient, as penalty term to the loss function.

Lasso shrinks the less important feature’s coefficient to zero; thus, removingsome
feature altogether. So, this works well for feature selection in case we have a huge
number of features.

The L1 regularizer basically looks for the parameter vectors that minimize thenorm
of the parameter vector (the length of the vector). This is essentially the problem
of how to optimize the parameters of a single neuron, a single layer neural network
in general, and a single layer feed-forward neural network in particular.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Building a ML model while accounting for outliers to be incorporated in the cost


penalization is not a trivial task. The image shows visualization of one dummy
dataset where L1 helps to identify the outliers in the distant vicinity

A good way of conceptualizing about it is that it is a method of maximizing the area


of the parameter hyperspace that the true parameter vector is within. To do this,
it finds the “sharpest” edge, one that is as close to the
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

parameter vector as possible. Key points that should be noted for L1 regularization:

1. L1 regularization is that it is easy to implement and can be trained as a one-


shot thing, meaning that once it is trained you are done with it and can just use
the parameter vector and weights.

2. L1 regularization is robust in dealing with outliers. It creates sparsity in the


solution (most of the coefficients of the solution are zero), which means the
less important features or noise terms will be zero. It makes L1 regularization
robust to outliers.

To understand the above mentioned point, let us go through the following example
and try to understand what it means when an algorithm is said to be sensitive to
outliers

1. For instance we are trying to classify images of various birds ofdifferent species
and have a neural network with a few hundred parameters.

2. We find a sample of birds of one species, which we have no reason tobelieve are
of any different species from all the others.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

3. We add this image to the training set and try to train the neural network. This
is like throwing an outlier into the mix of all the others.By looking at the edge
of the hyperspace where the hyperplane is closest to, we pick up on this outlier,
but by the time we’ve got to thehyperplane it’s quite far from the plane and is
hence an outlier.

4. The solution in such cases is to perform iterative dropout. L1 regularization is a


one-shot solution, but in the end we are going to have to make some kind of
hard decision on where to cut off the edges of the hyperspace.

5. Iterative dropout is a method of deciding exactly where to cut off. It is a little


more expensive in terms of training time, but in the end it might give us an
easier decision about how far the hyperspace edgesare.

Along with shrinking coefficients, the lasso performs feature selection, as well.
(Remember the ‘selection‘ in the lasso full-form?) Because some of the coefficients
become exactly zero, which is equivalent to the particular feature being excluded
from the model.

L2 Parameter Regularization:

The Regression model that uses L2 regularization is called Ridge Regression.


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Formula for Ridge Regression

Regularization adds the penalty as model complexity increases. Theregularization


parameter (lambda) penalizes all the parameters except intercept so that the
model generalizes the data and won’t overfit. Ridge regression adds “squared
magnitude of the coefficient” as penalty term to the loss function. Here the box
part in the above image represents the L2 regularization element/term.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Lambda is a hyperparameter.

If lambda is zero, then it is equivalent to OLS.

Ordinary Least Square or OLS, is a stats model which also helps us in identifying
more significant features that can have a heavy influence on the output.

But if lambda is very large, then it will add too much weight, and it will lead to
under-fitting. Important points to be considered about L2 can be listed below:

1. Ridge regularization forces the weights to be small but does not make them
zero and does not give the sparse solution.

2. Ridge is not robust to outliers as square terms blow up the error differences of
the outliers, and the regularization term tries to fix it by penalizing the weights.

3. Ridge regression performs better when all the input features influence the
output, and all with weights are of roughly equal size.

4. L2 regularization can learn complex data patterns

-----------------------------------------------------------------------------------------------------------------------------
------
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

WHAT IS OVERFITTING?(Part B)

When a model performs very well for training data but has poor performance with
test data (new data), it is known as overfitting. In this case, the machine learning
model learns the details and noise in the training data such that it negatively affects
the performance of the model on test data. Overfitting can happen due to low bias
and high variance.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Reasons for Overfitting

• Data used for training is not cleaned and contains noise (garbage values) in it
• The model has a high variance
• The size of the training dataset used is not enough
• The model is too complex

Ways to Tackle Overfitting

• Using K-fold cross-validation


• Using Regularization techniques such as Lasso and Ridge
• Training model with sufficient data
• Adopting ensembling techniques

WHAT IS UNDERFITTING?(Part B)

When a model has not learned the patterns in the training data well and is unable
to generalize well on the new data, it is known as underfitting. An underfit model
has poor performance on the training data and will result in unreliable predictions.
Underfitting occurs due to high bias and low variance.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Reasons for Underfitting

• Data used for training is not cleaned and contains noise (garbage values) in it
• The model has a high bias
• The size of the training dataset used is not enough
• The model is too simple

Ways to Tackle Underfitting

• Increase the number of features in the dataset


• Increase model complexity
• Reduce noise in the data
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

• Increase the duration of training the data


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Now that you have understood what overfitting and underfitting are, let’s see what is a good fit
model in this tutorial on overfitting and underfitting in machine learning.

WHAT IS A MODEL PARAMETER?(Part B)

A model parameter is a variable of the selected model which can be estimated by


fittingthe given data to the model.

Example:

In the above plot, x is the independent variable, and y is the dependent variable.
The objective is to fit a regression line to the data. This line(the model) is then used
to predict the y-value for unseen values of x. Here, m is the slope and c is the
intercept of the line. These two parameters(m and c) are estimated by fitting a
straight line to the data by minimizing the RMSE(root mean squared error). Hence,
these parameters are called the model parameters.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Model parameters in different models:


• m(slope) and c(intercept) in Linear Regression
• weights and biases in Neural Networks

What is a Model Hyperparameter?(Part B)


A model hyperparameter is the parameter whose value is set before the
model starttraining. They cannot be learned by fitting the model to the data.

Example:

In the above plot x-axis represents the number of epochs and y-axis represents the
number of epochs. We can see after a certain point when epochs are more than
then although the training accuracy increases but the validation and test accuracy
starts decreasing. This is a case of overfitting. Here number of epochs is a
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

hyperparameter andis set manually. Setting this number to a small value may cause
underfitting and high value may cause overfitting.

Model hyperparameters in different models:


• Learning rate in gradient descent
• Number of iterations in gradient descent
• Number of layers in a Neural Network
• Number of neurons per layer in a Neural Network
• Number of clusters(k) in k means clustering

Table of difference between Model Parameters and


HyperParameters
PARAMETERS HYPERPARAMETER

They are required for making They are required for estimating the
predictions modelparameters

They are estimated by


optimization They are estimated by hyperparameter tuning
algorithms(Gradient Descent,
Adam, Adagrad)
They are not set manually They are set manually

The final parameters found The choice of hyperparameters decide how


after training will decide how efficient the training is. In gradient descent the
the model will perform on learning rate decide how efficient and accurate
unseen data the optimization process is in estimating the
parameters
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

In a machine learning model, there are 2 types of parameters:

1. Model Parameters: These are the parameters in the model that must be
determined using the training data set. These are the fitted parameters.

2. Hyperparameters: These are adjustable parameters that must be tuned in


order to obtain a model with optimal performance.
RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA


SCIENCE

For example, suppose you want to build a simple linear regression


model using an m-dimensional training data set. Then your model
can be written as:

where X is the predictor matrix, and w are the weights. Here w_0, w_1, w_2,

…,w_m are the model parameters. If the model uses the gradient
descent algorithm to minimize the objective function in order to
determine the weights w_0, w_1, w_2, …,w_m, then we can have
an optimizer such as GradientDescent(eta, n_iter). Here eta
(learning rate) and n_iter (number of iterations) are the
hyperparameters that would have to be adjusted in order to
obtain the best values for the model parameters w_0, w_1, w_2,
…,w_m.

EPOCH IN MACHINE LEARNING(Part B)

Machine learning is a field where the learning aspect of Artificial


Intelligence (AI) is the focus. This learning aspect is developed by
algorithms that represent a set of data. Machine learning models
are trained with specific datasets passed through the algorithm.

Each time a dataset passes through an algorithm, it is said to have


completed an epoch. Therefore, Epoch, in machine learning,
refers to the one entire passing of training data through the
algorithm. It's a hyperparameter that determines the process of
training the machine learning model.

The training data is always broken down into small batches to


RAJALAKSHMI INSTITUTE OF TECHNOLOGY
Kuthambakkam, Chennai

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA


SCIENCE

overcome the issue that could arise due to storage space


limitations of a computer system. These smaller batches can be
easily fed into the machine learning model to train it. This process
of breaking it down to smaller bits is called batch in machine
learning. This procedure is known as an epoch when all the
batches are fed into the model to train at once.

You might also like