0% found this document useful (0 votes)
22 views29 pages

DL Unit 2

Scalars represent single values and are 0th order tensors. Vectors represent ordered arrays of values and are 1st order tensors. Matrices are 2nd order tensors represented as rectangular arrays of numbers. Statistics is the science of collecting, organizing, analyzing and presenting quantitative data, and encompasses techniques for summarizing data, making inferences, and drawing conclusions from data.

Uploaded by

Jyoti Godara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views29 pages

DL Unit 2

Scalars represent single values and are 0th order tensors. Vectors represent ordered arrays of values and are 1st order tensors. Matrices are 2nd order tensors represented as rectangular arrays of numbers. Statistics is the science of collecting, organizing, analyzing and presenting quantitative data, and encompasses techniques for summarizing data, making inferences, and drawing conclusions from data.

Uploaded by

Jyoti Godara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Scalars:

Scalars are single numbers and are an example of a 0th-order tensor. In


mathematics it is necessary to describe the set of values to which a scalar belongs.
The notation 𝑥∈𝑅 states that the (lowercase) scalar value 𝑥 is an element of (or
member of) the set of real-valued numbers, 𝑅.

There are various sets of numbers of interest within machine learning. 𝑁 represents
the set of positive integers (1,2,3,…). 𝑍 represents the integers, which include
positive, negative and zero values. 𝑄 represents the set of rational numbers that
may be expressed as a fraction of two integers.

Vectors:

Vectors are ordered arrays of single numbers and are an example of 1st-order
tensor. Vectors are members of objects known as vector spaces. A vector space
can be thought of as the entire collection of all possible vectors of a particular
length (or dimension). The three-dimensional real-valued vector space, denoted
by 𝑅3 is often used to represent our real-world notion of three-dimensional space
mathematically.

More formally a vector space is an 𝑛-dimensional Cartesian product of a set with


itself, along with proper definitions on how to add vectors and multiply them with
scalar values. If all of the scalars in a vector are real-valued then the
notation 𝑥∈𝑅𝑛 states that the (boldface lowercase) vector value 𝑥 is a member of
the 𝑛-dimensional vector space of real numbers, 𝑅𝑛.

Sometimes it is necessary to identify the components of a vector explicitly. The 𝑖th


scalar element of a vector is written as 𝑥𝑖. Notice that this is non-bold lowercase
since the element is a scalar. An 𝑛-dimensional vector itself can be explicitly
written using the following notation:
[]
x1
x= x 2

xn

Given that scalars exist to represent values why are vectors necessary? One of the
primary use cases for vectors is to represent physical quantities that have both
a magnitude and a direction. Scalars are only capable of representing magnitudes.

For instance scalars and vectors encode the difference between the speed of a car
and its velocity. The velocity contains not only its speed but also its direction of
travel.

Matrices:

Matrices are rectangular arrays consisting of numbers and are an example of 2nd-
order tensors. If 𝑚 and 𝑛 are positive integers, that is 𝑚,∈𝑁 then the 𝑚×𝑛 matrix
contains 𝑚𝑛 numbers, with 𝑚 rows and 𝑛 columns.

If all of the scalars in a matrix are real-valued then a matrix is denoted with
uppercase boldface letters, such as 𝐴∈𝑅𝑚×𝑛. That is the matrix lives in a 𝑚×𝑛-
dimensional real-valued vector space. Hence matrices are really vectors that are
just written in a two-dimensional table-like manner.

Its components are now identified by two indices 𝑖 and 𝑗. 𝑖 represents the index to
the matrix row, while 𝑗 represents the index to the matrix column. Each component
of 𝐴 is identified by 𝑎𝑖𝑗.

The full 𝑚×𝑛 matrix can be written as:

[ ]
a11 a12 a13 … a1 n
a21 a22 a23 … a2 n
A= a31 a32 a33 … a3 n
⋮ ⋮ ⋮ ⋱ ⋮
a m 1 am 2 am 3 … amn
It is often useful to abbreviate the full matrix component display into the following
expression:

𝐴=[𝑎𝑖𝑗]𝑚×𝑛

Where 𝑎𝑖𝑗 is referred to as the (𝑖,)-element of the matrix 𝐴. The subscript


of 𝑚×𝑛 can be dropped if the dimension of the matrix is clear from the context.

Note that a column vector is a size 𝑚×1 matrix, since it has 𝑚 rows and 1 column.
Unless otherwise specified all vectors will be considered to be column vectors.

Matrices represent a type of function known as a linear map. It is possible to define


multiplication operations between matrices or between matrices and vectors. Such
operations are immensely important across the physical sciences, quantitative
finance, computer science and machine learning.

Matrices can encode geometric operations such as rotation, reflection and


transformation.

In deep learning neural network weights are stored as matrices, while feature
inputs are stored as vectors.

What is Statistics?
Statistics is the science of collecting, organizing, analyzing, interpreting,
and presenting data. It encompasses a wide range of techniques for
summarizing data, making inferences, and drawing conclusions.
Statistical methods help quantify uncertainty and variability in data,
allowing researchers and analysts to make data-driven decisions with
confidence.
We need to highlight some basic concepts in statistics, such as the following:

• Probabilities

• Distributions

• Likelihood

There are also some other basic relationships we’d like to highlight in descriptive statistics and
inferential statistics. Descriptive statistics include the following:
• Histograms

• Boxplots

• Scatterplots

• Mean

• Standard deviation

• Correlation coefficient

This contrasts with how inferential statistics are concerned with techniques for generalizing from a
sample to a population. Here are some examples of inferential statistics:

• p-values

• credibility intervals

The relationship between probability and inferential statistics:

• Probability reasons from the population to the sample (deductive reasoning)

• Inferential statistics reason from the sample to the population

Probability is a fundamental concept in mathematics that deals with the likelihood of an event
occurring. It allows us to predict how likely certain outcomes are. We define probability of an event E
as a number always between 0 and 1. In this con‐ text, the value 0 infers that the event E has no
chance of occurring, and the value 1 means that the event E is certain to occur. Many times we’ll see
this probability expressed as a floating-point number, but we also can express it as a percentage
between 0 and 100 percent; we will not see valid probabilities lower than 0 percent and greater than
100 percent. An example would be a probability of 0.35 expressed as 35 percent (e.g., 0.35 x 100 ==
35 percent).

The canonical example of measuring probability is observing how many times a fair coin flipped
comes up heads or tails (e.g., 0.5 for each side). The probability of the sample space is always 1
because the sample space represents all possible outcomes for a given trial. As we can see with the
two outcomes (“heads” and its complement,“tails”) for the flipped coin, 0.5 + 0.5 == 1.0 because the
total probability of the sample space must always add up to 1. We express the probability of an
event as follows:

P( E ) = 0.5

And we read this like so:

The probability of an event E is 0.5

Conditional Probabilities
When we want to know the probability of a given event based on the existing presence of another
event occurring, we express this as a conditional probability. This isexpressed in literature in the
form:

P( E | F )
where:

E is the event for which we’re interested in a probability.

F is the event that has already occurred.

An example would be expressing how a person with a healthy heart rate has a lower probability of
ICU death during a hospital visit:

P(ICU Death | Poor Heart Rate) > P(ICU Death | Healthy Heart Rate)

What Is Posterior Probability?


Posterior probability is a concept in probability theory that represents the updated
probability of an event occurring after accounting for new information. It takes into account
prior beliefs (a priori probabilities) and adjusts them based on observed data or evidence. In
other words, it reflects our updated belief about an event given additional information.
For example, let there be two vases, vase A having 5 black balls and 10
red balls and vase B having 10 black balls and 5 red balls. Now if an vase
is selected at random, the probability that vase A is chosen is 0.5. This is
the a priori probability. If we are given an additional piece of information
that a ball was drawn at random from the selected vase, and that ball was
black, what is the probability that the chosen vase is vase A? Posterior
probability takes into account this additional information and revises the
probability downward from 0.5 to 0.333 according to Bayes´ theorem,
because a black ball is more probable from vase B than vase A.

Distributions

A probability distribution is a specification of the stochastic structure of random variables. In


statistics, we rely on making assumptions about how the data is distributedto make inferences about
the data. We want a formula that specifies how frequent values of observations in the distribution
are and how values can be taken by points in the distribution. A common distribution is known as
the normal distribution (also called Gaussian distribution, or the bell curve). We like to fit a dataset
to a distribution because if the dataset is reasonably close to the distribution, we can make
assumptions based on the theoretical distribution in how we operate with the data.

We classify distributions as continuous or discrete. A discrete distribution has datathat can assume
only certain values. In a continuous distribution, data can be anyvalue within the range. An example
of a continuous distribution would be normaldistribution.

An example of a discrete distribution would be binomial distribution.Normal distribution allows us to


assume sampling distributions of statistics (e.g.,“sample mean”) are normally distributed under
specified conditions.
Other relevant distributions in machine learning include the following:

• Binomial distribution

• Inverse Gaussian distribution

• Log normal distribution

Likelihood
When we discuss the likeliness that an event will occur yet do not specifically reference its numeric
probability, we’re using the informal term, likelihood. Typically,when we use this term, we’re talking
about an event that has a reasonable probability of happening but still might not. There also might
be factors not yet observed that will influence the event, as well. Informally, likelihood is also used
as a synonym for probability.
Regression Analysis in Machine
learning
Regression analysis is a statistical method to model the relationship between a
dependent (target) and independent (predictor) variables with one or more
independent variables. More specifically, Regression analysis helps us to understand
how the value of the dependent variable is changing corresponding to an
independent variable when other independent variables are held fixed. It predicts
continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement


every year and get sales on that. The below list shows the advertisement made by the
company in the last 5 years and the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2019 and
wants to know the prediction about the sales for this year. So to solve such type
of prediction problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation


between variables and enables us to predict the continuous output variable based on
the one or more predictor variables. It is mainly used for prediction, forecasting,
time series modeling, and determining the causal-effect relationship between
variables.

In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions about
the data. In simple words, "Regression shows a line or curve that passes through
all the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum." The
distance between datapoints and line tells whether a model has captured a strong
relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression


Analysis:
o Dependent Variable: The main factor in Regression analysis which we want to
predict or understand is called the dependent variable. It is also called target
variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent
variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset
but not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all
the regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:

o Linear Regression
o Logistic Regression
o Polynomial Regression

Linear Regression:

o Linear regression is a statistical regression method which is used for predictive


analysis.
o It is one of the very simple and easy algorithms which works on regression
and shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent
variable (X-axis) and the dependent variable (Y-axis), hence called linear
regression.
o If there is only one input variable (x), then such linear regression is
called simple linear regression. And if there is more than one input variable,
then such linear regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.

Logistic Regression:

o Logistic regression is another supervised learning algorithm which is used to


solve the classification problems. In classification problems, we have
dependent variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or
1, Yes or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear
regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a
complex cost function. This sigmoid function is used to model the data in
logistic regression. The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as
follows:

o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y=dependent variables (target variables),


X=Independent variables (predictor variables),
a and b are the linear coefficients

Polynomial Regression:

o Polynomial Regression is a type of regression which models the non-linear


dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between
the value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a
non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial line.

o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b 0+ b1x, is transformed
into Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.

Classification
Classification is a supervised machine learning method where the model tries to predict the
correct label of a given input data. In classification, the model is fully trained using the
training data, and then it is evaluated on test data before being used to perform prediction on
new unseen data.
For instance, an algorithm can learn to predict whether a given email is spam or ham (no
spam), as illustrated below.
Before diving into the classification concept, we will first understand the difference between
the two types of learners in classification: lazy and eager learners. Then we will clarify the
misconception between classification and regression.
Lazy Learners Vs. Eager Learners

There are two types of learners in machine learning classification: lazy and eager
learners.

Eager learners are machine learning algorithms that first build a model from the
training dataset before making any prediction on future datasets. They spend more
time during the training process because of their eagerness to have a better
generalization during the training from learning the weights, but they require less
time to make predictions.

Most machine learning algorithms are eager learners, and below are some examples:

 Logistic Regression.
 Support Vector Machine.
 Decision Trees.
 Artificial Neural Networks.
Lazy learners or instance-based learners, on the other hand, do not create any model
immediately from the training data, and this is where the lazy aspect comes from. They just
memorize the training data, and each time there is a need to make a prediction, they search
for the nearest neighbor from the whole training data, which makes them very slow during
prediction. Some examples of this kind are:
 K-Nearest Neighbor.
 Case-based reasoning.

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have
the following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability


value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as
below table:

o
Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative


3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.

Overfitting and Underfitting in


Machine Learning
Overfitting and Underfitting are the two main problems that occur in machine
learning and degrade the performance of the machine learning models.

The main goal of each machine learning model is to generalize well.


Here generalization defines the ability of an ML model to provide a suitable output
by adapting the given set of unknown input. It means after providing training on the
dataset, it can produce reliable and accurate output. Hence, the underfitting and
overfitting are the two terms that need to be checked for the performance of the
model and whether the model is generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic
term that will help to understand this topic well:

o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.

Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our


model. It means the more we train our model, the more chances of occurring the
overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of
the linear regression output:

As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because the
goal of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.
How to avoid the Overfitting in Model
Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which
we can reduce the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant
trend in the data.

In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear
regression model:
As we can see from the above diagram, the model is unable to capture the data
points present in the plot.

How to avoid underfitting:

o By increasing the training time of the model.


o By increasing the number of features.

Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and
ideally, it makes predictions with 0 errors, but in practice, it is difficult to achieve it.

As when we train our model for a time, the errors in the training data go down, and
the same happens with test data. But if we train the model for a long duration, then
the performance of the model may decrease due to the overfitting, as the model also
learn the noise present in the dataset. The errors in the test dataset start
increasing, so the point, just before the raising of errors, is the good point, and we can
stop here for achieving a good model.

There are two other methods by which we can get a good point for our model, which
are the resampling method to estimate model accuracy and validation dataset.
Gradient Descent Overview:

Gradient Descent is known as one of the most commonly used optimization


algorithms to train machine learning models by means of minimizing errors between
actual and expected results. Further, gradient descent is also used to train Neural
Networks.

In mathematical terminology, Optimization algorithm refers to the task of


minimizing/maximizing an objective function f(x) parameterized by x.

What is Gradient Descent or Steepest


Descent?
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of
18th century. Gradient Descent is defined as one of the most commonly used
iterative optimization algorithms of machine learning to train the machine
learning and deep learning models. It helps in finding the local minimum of a
function.

The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:

o If we move towards a negative gradient or away from the gradient of the


function at the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the
function at the current point, we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to
minimize the cost function using iteration. To achieve this goal, it performs two
steps iteratively:

o Calculates the first-order derivative of the function to compute the gradient or


slope of that function.
o Move away from the direction of the gradient, which means slope increased
from the current point by alpha times, where Alpha is defined as Learning
Rate. It is a tuning parameter in the optimization process which helps to
decide the length of the steps.

What is Cost-function?
The cost function is defined as the measurement of difference or error between
actual values and expected values at the current position and present in the
form of a single real number. It helps to increase and improve machine learning
efficiency by providing feedback to this model so that it can minimize error and find
the local or global minimum. Further, it continuously iterates along the direction of
the negative gradient until the cost function approaches zero. At this steepest
descent point, the model will stop learning further. Although cost function and loss
function are considered synonymous, also there is a minor difference between them.
The slight difference between the loss function and the cost function is about the
error within the training of machine learning models, as loss function refers to the
error of one training example, while a cost function calculates the average error
across an entire training set.
How does Gradient Descent work?
Before starting the working principle of gradient descent, we should know some
basic concepts to find out the slope of a line from linear regression. The equation for
simple linear regression is given as:

1. Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-
axis.

The starting point(shown in above fig.) is used to evaluate the performance as it is


considered just as an arbitrary point. At this starting point, we will derive the first
derivative or slope and then use a tangent line to calculate the steepness of this
slope. Further, this slope will inform the updates to the parameters (weights and
bias).

The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point,
it approaches the lowest point, which is called a point of convergence.

The main objective of gradient descent is to minimize the cost function or the error
between expected and actual. To minimize the cost function, two data points are
required:
o Direction & Learning Rate

These two factors are used to determine the partial derivative calculation of future
iteration and allow it to the point of convergence or local minimum or global
minimum. Let's discuss learning rate factors in brief;

Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is
typically a small value that is evaluated and updated based on the behavior of the
cost function. If the learning rate is high, it results in larger steps but also leads to
risks of overshooting the minimum. At the same time, a low learning rate shows the
small step sizes, which compromises overall efficiency but gives the advantage of
more precision.

Types of Gradient Descent


Based on the error in various training models, the Gradient Descent learning
algorithm can be divided into Batch gradient descent, stochastic gradient
descent, and mini-batch gradient descent. Let's understand these different types
of gradient descent:

1. Batch Gradient Descent:


Batch gradient descent (BGD) is used to find the error for each point in the training
set and update the model after evaluating all training examples. This procedure is
known as the training epoch. In simple words, it is a greedy approach where we have
to sum over all examples for each update.
Advantages of Batch gradient descent:

o It produces less noise in comparison to other gradient descent.


o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent


Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration. Or in other words, it processes a training epoch for each
example within a dataset and updates each training example's parameters one at a
time. As it requires only one training example at a time, hence it is easier to store in
allocated memory. However, it shows some computational efficiency losses in
comparison to batch gradient systems as it shows frequent updates that require
more detail and speed. Further, due to frequent updates, it is also treated as a noisy
gradient. However, sometimes it can be helpful in finding the global minimum and
also escaping the local minimum.

Advantages of Stochastic gradient descent:

In Stochastic gradient descent (SGD), learning happens on every example, and it


consists of a few advantages over other gradient descent.

o It is easier to allocate in desired memory.


o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.

3. MiniBatch Gradient Descent:


Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent. It divides the training datasets into small batch sizes
then performs the updates on those batches separately. Splitting training datasets
into smaller batches make a balance to maintain the computational efficiency of
batch gradient descent and speed of stochastic gradient descent. Hence, we can
achieve a special type of gradient descent with higher computational efficiency and
less noisy gradient descent.

Advantages of Mini Batch gradient descent:

o It is easier to fit in allocated memory.


o It is computationally efficient.
o It produces stable gradient descent convergence.

Activation Functions in Neural


Networks
A pattern for information processing that draws inspiration from the brain is called
an artificial neural network (ANN). ANNs learn via imitation just like people do.
Through a learning process, an ANN is tailored for a particular purpose, including
such pattern classification or data classification.

What input layer to employ with in hidden layer and at the input level of the network
is one of the decisions you get to make while creating a neural network. This article
discusses a few of the alternatives.

The nerve impulse in neurology serves as a model for activation functions within
computer science. A chain reaction permits a neuron to "fire" and send a signal to
nearby neurons if the induced voltage between its interior and exterior exceeds a
threshold value known as the action potential. The next series of activations, known
as a "spike train," enables motor neurons to transfer commands from of the brain to
the limbs and sensory neurons too transmit sensation from the digits to the brain.

Neural Network Components


Layers are the vertically stacked parts that make up a neural network. The image's
dotted lines each signify a layer. A NN has three different types of layers.

Input Layer
The input layer is first. The data will be accepted by this layer and forwarded to the
remainder of the network. This layer allows feature input. It feeds the network with
data from the outside world; no calculation is done here; instead, nodes simply
transmit the information (features) to the hidden units.

Hidden Layer
Since they are a component of the abstraction that any neural network provides, the
nodes in this layer are not visible to the outside world. Any features entered through
to the input layer are processed by the hidden layer in any way, with the results
being sent to the output layer. The concealed layer is the name given to the second
kind of layer. For a neural network, either there are one or many hidden layers. The
number inside the example above is 1. In reality, hidden layers are what give neural
networks their exceptional performance. They carry out several tasks concurrently,
including data transformation and automatic feature generation.

Output Layer
This layer raises the knowledge that the network has acquired to the outside world.
The output layer is the final kind of layer The output layer contains the answer to the
issue. We receive output from the output layer after passing raw photos to the input
layer.

Data science makes extensive use of the rectified unit (ReLU) functional or the
category of sigmoid processes, which also includes the logistic regression model,
logistic hyperbolic tangent, and arctangent function.

Activation Function
Definition
In artificial neural networks, an activation function is one that outputs a smaller value
for tiny inputs and a higher value if its inputs are greater than a threshold. An
activation function "fires" if the inputs are big enough; otherwise, nothing happens.
An activation function, then, is a gate that verifies how an incoming value is higher
than a threshold value.

Because they introduce non-linearities in neural networks and enable the neural
networks can learn powerful operations, activation functions are helpful. A
feedforward neural network might be refactored into a straightforward linear
function or matrix transformation on to its input if indeed the activation functions
were taken out.

By generating a weighted total and then including bias with it, the activation function
determines whether a neuron should be turned on. The activation function seeks to
boost a neuron's output's nonlinearity.

Explanation: As we are aware, neurons in neural networks operate in accordance


with weight, bias, and their corresponding activation functions. Based on the mistake,
the values of the neurons inside a neural network would be modified. This process is
known as back-propagation. Back-propagation is made possible by activation
functions since they provide the gradients and error required to change the biases
and weights.

Linear
A linear transform (see Figure 2-11) is basically the identity function, and f(x) = Wx ,where the
dependent variable has a direct, proportional relationship with the independent variable. In
practical terms, it means the function passes the signal through unchanged.

We see this activation function used in the input layer of neural networks.

Sigmoid
Like all logistic transforms, sigmoids can reduce extreme values or outliers in data without
removing them. The vertical line in Figure 2-12 is the decision boundary.
A sigmoid function is a machine that converts independent variables of near infinite range
into simple probabilities between 0 and 1, and most of its output will be very close to 0 or 1.

Understanding Sigmoid Output

A sigmoid activation function outputs an independent probability for each class.

Tanh
Pronounced “tanch,” tanh is a hyperbolic trigonometric function (see Figure 2-13).Just as the
tangent represents a ratio between the opposite and adjacent sides of a right triangle, tanh
represents the ratio of the hyperbolic sine to the hyperbolic cosine: tanh(x) = sinh(x) /
cosh(x). Unlike the Sigmoid function, the normalized range of tanh is –1 to 1. The advantage
of tanh is that it can deal more easily with negative numbers.
Hard Tanh
Similar to tanh, hard tanh simply applies hard caps to the normalized range. Anything more
than 1 is made into 1, and anything less than –1 is made into –1. This allows for a more
robust activation function that allows for a limited decision boundary.

Softmax

Softmax is a generalization of logistic regression inasmuch as it can be applied to continuous


data (rather than classifying binary) and can contain multiple decision boundaries. It handles
multinomial labeling systems. Softmax is the function you will often find at the output layer
of a classifier.

Understanding Softmax Output

The softmax activation function returns the probability distribution over mutually exclusive
output classes.

To further illustrate the idea of the softmax output layer and how to use it, let’s consider two
use cases. If we have a multiclass modeling problem yet we care only about the best score
across these classes, we’d use a softmax output layer with an argmax() function to get the
highest score of all the classes.
Dealing with Multiple Classi€cations

If we want to get multiple classifications per output (e.g., “person +car”), we do not want
softmax as an output layer. Instead, we’d usethe sigmoid output layer giving us a probability
for every classindependently.

For the case in which we have a large set of labels (e.g., thousands of labels), we’d use the
variant of the softmax activation function called the hierarchical so ˆmax activation function.
This variant decomposes the labels into a tree structure, and the softmax classifier is trained
at each node of the tree to direct the branching forclassification.

You might also like