Linear Models (Unit II) Chapter III 1
Linear Models (Unit II) Chapter III 1
The term linear model implies that the model is specified as a linear combination
of features. Based on training data, the learning process computes one weight for
each feature to form a model that can predict or estimate the target value
Decision Boundary –
y=1: h(x) >0.5 wtx>=0
Y=0: h(x)< 0.5 wtx<0
The benefit of using such an S-function is that it is smooth, and potentially easier
to optimize. The difficulty is that it is not convex
a convex function is one that looks like a happy face ,On the other hand, a
concave function is one that looks like a sad face an easy mnemonic There are
two equivalent definitions of a convex function.
There are four common surrogate loss functions, each with their own properties:
hinge loss, logistic loss, exponential loss and squared loss.
Zero/one: ` (0/1)(y, yˆ) = 1[yyˆ ≤ 0]
1
Logistic: ` (log)(y, yˆ) = log (1 + exp[−yyˆ])
log 2
Weight regularization:-
➡ Robustness to noise
‣ To be sparse —
=w1,w2 p=2=>(|w1|2+|w2|2)1/2
General optimization framework:-
Select a suitable:
‣ convex regularization
Higher learning rates allows the algorithm to learn faster, i.e. update weights
and biases faster at the rate the cost of arriving at sub-optimal solution
A smaller learning rate means results more optimal solution but it may take
significantly longer to reach that optimal solution
We can use adaptive learning rate also. In adaptive learning rate algorithm
starts with longer It reduces the training time as compared to larger learning
rate
exp[−yyˆ]
The only “strange” thing in this objective is that we have replaced λ with λ. The
reason for this change is just to make the gradients cleaner. We can first compute
derivatives with respect to b:
The update is of the form w ← w − η∇wL.
For poorly classified points, the gradient points in the direction −ynxn, so the
update is of the form
Note that c is large for very poorly classified points and small for relatively well
classified points. By looking at the part of the gradient related to the regularization,
the update says: w ← w − λw = (1 − λ)w. This has the effect of shrinking the
weights toward zero
Types of Gradient Descent:-
Based on the error in various training models, the Gradient Descent learning
algorithm can be divided into
1. Batch Gradient Descent: Batch gradient descent (BGD) is used to find the
error for each point in the training set and update the model after evaluating
all training examples.
Let’s say there are a total of ‘m’ observations in a data set and we use all these
observations to calculate the loss function, then this is known as Batch Gradient
Descent.
Forward propagation and backward propagation are performed and the parameters
are updated. In batch Gradient Descent since we are using the entire training set,
the parameters will be updated only once per epoch.
Let’s say we have 5 observations and each observation has three features and the
values that I’ve taken are completely random.
Now if we use the SGD, will take the first observation, then pass it through the
neural network, calculate the error and then update the parameters.
Again let’s take the same example. Assume that the batch size is 2. So we’ll
take the first two observations, pass them through the neural network, calculate
the error and then update the parameters.
Then we will take the next two observations and perform similar steps i.e will
pass through the network, calculate the error and update the parameters.
Now since we’re left with the single observation in the final iteration, there will
be only a single observation and will update the parameters using this
observation.
Advantages of Mini Batch gradient descent:
It is easier to fit in allocated memory.
It is computationally efficient.
It produces stable gradient descent convergence.
Although we know Gradient Descent is one of the most popular methods for
optimization problems, it still also has some challenges. There are a few challenges
as follows:
To separate the two classes of data points, there are many possible hyperplanes that
could be chosen.
Our objective is to find a plane that has the maximum margin, i.e the maximum
distance between data points of both classes.
Maximizing the margin distance provides some reinforcement so that future data points
can be classified with more confidence.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the
KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can
be created by using the SVM algorithm.
We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this
strange creature.
So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
Types of SVM
Linear SVM:
o The working of the SVM algorithm can be understood by using an
example. Suppose we have a dataset that has two tags (green and blue),
and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider
the below image:
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line,
but for non-linear data, we cannot draw a single straight line. Consider the
below image:
So to separate these data points, we need to add one more dimension.
For linear `data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below
image:
So now, SVM will divide the datasets into classes in the way such that all
data points are classified properly
since we are in 3-d Space, hence it is looking like a plane parallel to the x-
axis.