0% found this document useful (0 votes)
70 views24 pages

Linear Models (Unit II) Chapter III 1

The document discusses linear models and optimization techniques for machine learning. It introduces linear classifiers like perceptrons and discusses how to optimize non-convex loss functions using convex surrogate loss functions. It describes four common surrogate loss functions and explains how weight regularization and gradient descent can be used to minimize the regularized objective function. Gradient descent algorithms like batch, stochastic, and mini-batch gradient descent are also summarized.

Uploaded by

Anil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views24 pages

Linear Models (Unit II) Chapter III 1

The document discusses linear models and optimization techniques for machine learning. It introduces linear classifiers like perceptrons and discusses how to optimize non-convex loss functions using convex surrogate loss functions. It describes four common surrogate loss functions and explains how weight regularization and gradient descent can be used to minimize the regularized objective function. Gradient descent algorithms like batch, stochastic, and mini-batch gradient descent are also summarized.

Uploaded by

Anil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

LINEAR MODELS

The term linear model implies that the model is specified as a linear combination
of features. Based on training data, the learning process computes one weight for
each feature to form a model that can predict or estimate the target value

Perceptron is Linear Classifier that it runs a particular algorithm until a linear


separator is found.

The perceptron algorithm will find an optimal w if the data is separable ‣


efficiency depends on the margin and norm of the data However, if the data is not
separable, optimizing this is -hard ‣ i.e., there is no efficient way to minimize this .

Optimization provides a way to minimize the loss function. In addition to


minimizing training error, we want a simpler model ‣

Remember our goal is to minimize generalization error ‣

Recall the bias and variance tradeoff for learners

We can add a regularization term R(w,b) that prefers simpler models ‣

For example we may prefer decision trees of shallow depth

Here λ is a hyperparameter of optimization problem

Convex Surrogate Loss Functions:-

Zero/one loss is hard to optimize

Small changes in w can cause large changes in the loss


Surrogate loss: replace Zero/one loss by a smooth function ‣

Smooth version of the threshold function Known as a sigmoid/logistic function –


Smooth transition between 0-1
hw(x) = g(wT x)
z = wT x
1
g(z) =
1+ e−z
g is sigmoid function

Decision Boundary –
y=1: h(x) >0.5 wtx>=0
Y=0: h(x)< 0.5 wtx<0
The benefit of using such an S-function is that it is smooth, and potentially easier
to optimize. The difficulty is that it is not convex

Error surface:- Linear classifier Hypothesis space is parameterized by W

Easier to optimize if the surrogate loss is convex

a convex function is one that looks like a happy face ,On the other hand, a
concave function is one that looks like a sad face an easy mnemonic There are
two equivalent definitions of a convex function.

1. Second derivative is always non-negative.

2. Any chord of the function lies above it.


The convex function minimizes the loss This leads to the idea of convex surrogate
loss functions. Since zero/one loss is hard to optimize, you want to optimize
something else, instead. Since convex functions are easy to optimize, we want to
approximate zero/one loss with a convex function. This approximating function
will be called a surrogate loss. The surrogate losses we construct will always be
upper bounds on the true loss function: this guarantees that if you minimize the
surrogate loss, you are also pushing down the real loss.

There are four common surrogate loss functions, each with their own properties:
hinge loss, logistic loss, exponential loss and squared loss.
Zero/one: ` (0/1)(y, yˆ) = 1[yyˆ ≤ 0]

Hinge: ` (hin)(y, yˆ) = max{0, 1 − yyˆ}

1
Logistic: ` (log)(y, yˆ) = log (1 + exp[−yyˆ])
log 2

Exponential: ` (exp)(y, yˆ) = exp[−yyˆ]

Squared: ` (sqr)(y, yˆ) = (y − yˆ) 2

Weight regularization:-

Regularization is the technique in which slight modifications are made to learning


algorithm such that the model generalizes better. This in turn results in the
improvement of the model’s performance on the test data or unseen data. In weight
regularization, It penalizes the weight matrices of nodes. Weight regularization
results in simpler linear network and slight underfitting of training data

we have R(w,b) regularization added to the margin

We can say R(w,b) as good regularization function when it

 It is having weights t o be small —

➡ Change in the features cause small change to the score

➡ Robustness to noise

‣ To be sparse —

➡ Use as few features as possible

➡ Similar to controlling the depth of a decision tree


In L1 weight regularization, the sum of squared values of the weights are used to
calculate size of the weights

In L2 weight regularization, the sum of absolute values of the weights is used to


calculate size of the weights. L2 regularization use in linear regression and logistic
regression is often referred as Ridge Regression or Tikhonov regularization

=w1,w2 p=2=>(|w1|2+|w2|2)1/2
General optimization framework:-

Select a suitable:

‣ convex surrogate loss

‣ convex regularization

Select the hyperparameter λ Minimize the regularized objective with respect to w

This framework for optimization is called Tikhonov regularization or generally


Structural Risk Minimization (SRM)

Optimization with Gradient Descent:-

Gradient Descent is known as one of the most commonly used optimization


algorithms to train machine learning models by means of minimizing errors
between actual and expected results. Further, gradient descent is also used to train
Neural Networks. The main objective of gradient descent is to minimize the
convex function using iteration of parameter updates

Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of


18th century. Gradient Descent is defined as one of the most commonly used
iterative optimization algorithms of machine learning to train the machine learning
and deep learning models. It helps in finding the local minimum of a function.
Gradient is slope, Gradient descent is moving down i.e opposite to the
direction of maximum slope

Gradient is how much the output of a function(L) changes with respect to


change in input(W).

Mathematically Gradient is partial derivative of the function with respect to


weight,

Change in weights with regard to the change in error

In machine learning gradient is a partial derivative function that has more


than one input variable known as slope of the function

Gradient descent is the most commonly used algorithm in machine learning


and deep learning algorithms

It is used to train a machine and model and it is based on convex function, it is


iterative optimization algorithm used in minimize the value of the function

Learning rates:-Ƞ is called learning rate

Learning rate is hyper parameter in the optimization algorithm that


determines step size for iterative solution to find minimum of the loss
Learning rate is nothing but step size

Higher learning rates allows the algorithm to learn faster, i.e. update weights
and biases faster at the rate the cost of arriving at sub-optimal solution

A smaller learning rate means results more optimal solution but it may take
significantly longer to reach that optimal solution

We can use adaptive learning rate also. In adaptive learning rate algorithm
starts with longer It reduces the training time as compared to larger learning
rate

choose exponential loss as a loss function and the 2-norm as a regularizer

exp[−yyˆ]

The only “strange” thing in this objective is that we have replaced λ with λ. The
reason for this change is just to make the gradients cleaner. We can first compute
derivatives with respect to b:
The update is of the form w ← w − η∇wL.

For poorly classified points, the gradient points in the direction −ynxn, so the
update is of the form

w ← w + cynxn, where c is some constant

Note that c is large for very poorly classified points and small for relatively well
classified points. By looking at the part of the gradient related to the regularization,

the update says: w ← w − λw = (1 − λ)w. This has the effect of shrinking the
weights toward zero
Types of Gradient Descent:-

Based on the error in various training models, the Gradient Descent learning
algorithm can be divided into

Batch gradient descent

Stochastic gradient descent

Mini-batch gradient descent

1. Batch Gradient Descent: Batch gradient descent (BGD) is used to find the
error for each point in the training set and update the model after evaluating
all training examples.

Let’s say there are a total of ‘m’ observations in a data set and we use all these
observations to calculate the loss function, then this is known as Batch Gradient
Descent.

 Forward propagation and backward propagation are performed and the parameters
are updated. In batch Gradient Descent since we are using the entire training set,
the parameters will be updated only once per epoch.

This procedure is known as the training epoch. In simple words, it is a greedy


approach where we have to sum over all examples for each update.

Advantages of Batch gradient descent:

It produces less noise in comparison to other gradient descent.


It produces stable gradient descent convergence.
It is Computationally efficient as all resources are used for all training
samples.
2. Stochastic gradient descent Stochastic gradient descent (SGD) is a type of
gradient descent that runs one training example per iteration. Or in other words,
it processes a training epoch for each example within a dataset and updates
each training example's parameters one at a time.
As it requires only one training example at a time, hence it is easier to store in
allocated memory. However, it shows some computational efficiency losses in
comparison to batch gradient systems as it shows frequent updates that require
more detail and speed.
Further, due to frequent updates, it is also treated as a noisy gradient. However,
sometimes it can be helpful in finding the global minimum and also escaping
the local minimum.

Let’s say we have 5 observations and each observation has three features and the
values that I’ve taken are completely random.
Now if we use the SGD, will take the first observation, then pass it through the
neural network, calculate the error and then update the parameters.

Advantages of Stochastic gradient descent:


In Stochastic gradient descent (SGD), learning happens on every example,
and it consists of a few advantages over other gradient descent.
It is easier to allocate in desired memory.
It is relatively fast to compute than batch gradient descent.
It is more efficient for large datasets.
3. Mini-Batch Gradient Descent: Mini Batch gradient descent is the
combination of both batch gradient descent and stochastic gradient descent. It
divides the training datasets into small batch sizes then performs the updates on
those batches separately. Splitting training datasets into smaller batches make a
balance to maintain the computational efficiency of batch gradient descent and
speed of stochastic gradient descent. Hence, we can achieve a special type of
gradient descent with higher computational efficiency and less noisy gradient
descent.

Again let’s take the same example. Assume that the batch size is 2. So we’ll
take the first two observations, pass them through the neural network, calculate
the error and then update the parameters.

Then we will take the next two observations and perform similar steps i.e will
pass through the network, calculate the error and update the parameters.

Now since we’re left with the single observation in the final iteration, there will
be only a single observation and will update the parameters using this
observation.
Advantages of Mini Batch gradient descent:
It is easier to fit in allocated memory.
It is computationally efficient.
It produces stable gradient descent convergence.

Challenges with the Gradient Descent

Although we know Gradient Descent is one of the most popular methods for
optimization problems, it still also has some challenges. There are a few challenges
as follows:

1. Local Minima and Saddle Point: For convex problems, gradient


descent can find the global minimum easily, while for non-convex
problems, it is sometimes difficult to find the global minimum,
where the machine learning models achieve the best results.
Whenever the slope of the cost function is at zero or just close to
zero, this model stops learning further. Apart from the global
minimum, there occur some scenarios that can show this slop,
which is saddle point and local minimum.
Local minima generate the shape similar to the global minimum,
where the slope of the cost function increases on both sides of the
current point

Whenever the slope of the cost function is at zero or just


close to zero, this model stops learning further.
Apart from the global minimum, there occur some scenarios that
can show this slop, which is saddle point and local minimum.
Local minima generate the shape similar to the global minimum,
where the slope of the cost function increases on both sides of the
current points.
2. Vanishing and Exploding Gradient
In a deep neural network, if the model is trained with gradient descent
and backpropagation, there can occur two more issues other than local
minima and saddle point.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected.
During backpropagation, this gradient becomes smaller that causing the
decrease in the learning rate of earlier layers than the later layer of the
network. Once this happens, the weight parameters update until they
become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when
the Gradient is too large and creates a stable model. Further, in this scenario,
model weight increases, and they will be represented as NaN.

This problem can be solved using the dimensionality reduction technique,


which helps to minimize complexity within the model.
Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
The objective of the support vector machine algorithm is to find a
hyperplane in an N-dimensional space(N — the number of features) that
distinctly classifies the data points.

To separate the two classes of data points, there are many possible hyperplanes that
could be chosen.
Our objective is to find a plane that has the maximum margin, i.e the maximum
distance between data points of both classes.
Maximizing the margin distance provides some reinforcement so that future data points
can be classified with more confidence.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the
KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can
be created by using the SVM algorithm.
We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this
strange creature.
So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:

SVM algorithm can be used for Face detection, image


classification, text categorization, etc.

Types of SVM

SVM can be of two types:


o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight line,
then such data is termed as non-linear data and classifier used is called as
Non- linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes
in n-dimensional space, but we need to find out the best decision boundary that
helps to classify the data points. This best boundary is known as the hyperplane
of SVM.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be
a straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Linear SVM:
o The working of the SVM algorithm can be understood by using an
example. Suppose we have a dataset that has two tags (green and blue),
and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider
the below image:

So as it is 2-d space so by just using a straight line, we can easily separate


these two classes. But there can be multiple lines that can separate these
classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary;
this best boundary or region is called as a hyperplane.
SVM algorithm finds the closest point of the lines from both the classes.
These points are called support vectors.
The distance between the vectors and the hyperplane is called as margin. And
the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line,
but for non-linear data, we cannot draw a single straight line. Consider the
below image:
So to separate these data points, we need to add one more dimension.

Consider the following sample of 2D


We can not separate using linear line

For linear `data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below
image:
So now, SVM will divide the datasets into classes in the way such that all
data points are classified properly

since we are in 3-d Space, hence it is looking like a plane parallel to the x-

axis.

Hence we get a circumference of radius 1 in case of non-linear data.

You might also like