ML Unit-4
ML Unit-4
1.Logistic Reression
2.Log loss
4.SVM
5.Dession Tree
LOGISTIC REGRESSION
Logistic Regression is used when the dependent variable(target) is categorical. For example, To
predict whether an email is spam (1) or (0) Whether the tumor is malignant (1) or not (0)
Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based
on prior observations of a data set.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
In Logistic Regression Ŷi is a nonlinear function(Ŷ=1/1+ e-z), if we put this in the above MSE
equation it will give a non-convex function as shown:
When we try to optimize values using gradient descent it will create complications to
find global minima.
Linear Regression Logistic Regression
Linear regression is used to predict the Logistic Regression is used to predict the
continuous dependent variable using a given set categorical dependent variable using a given set
of independent variables. of independent variables.
Linear Regression is used for solving Regression Logistic regression is used for solving
problem. Classification problems.
In Linear regression, we predict the value of In logistic Regression, we predict the values of
continuous variables. categorical variables.
In linear regression, we find the best fit line, by In Logistic Regression, we find the S-curve by
which we can easily predict the output. which we can classify the samples.
Least square estimation method is used for Maximum likelihood estimation method is used
estimation of accuracy. for estimation of accuracy.
The output for Linear Regression must be a The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categorical value such as 0 or 1, Yes or No, etc.
This method of regression analysis begins with a set of data points to be plotted on an x- and y-axis
graph. An analyst using the least squares method will generate a line of best fit that explains the
potential relationship between independent and dependent variables.
If the data shows a linear relationship between two variables, the line that best fits this linear
relationship is known as a least-squares regression line, which minimizes the vertical distance from the
data points to the regression line. The term “least squares” is used because it is the smallest sum of
squares of errors, which is also called the "variance."
Support vector machine:
Support Vector Machine(SVM) is a supervised machine learning algorithm used for both
classification and regression. Though we say regression problems as well its best suited
for classification. The objective of SVM algorithm is to find a hyperplane in an N-
dimensional space that distinctly classifies the data points. The dimension of the
hyperplane depends upon the number of features. If the number of input features is two,
then the hyperplane is just a line. If the number of input features is three, then the
hyperplane becomes a 2-D plane. It becomes difficult to imagine when the number of
features exceeds three.
Let’s consider two independent variables x1, x2 and one dependent variable which is
either a blue circle or a red circle.
From the figure above its very clear that there are multiple lines (our hyperplane here is a
line because we are considering only two input features (x1, x2) that segregates our data
points or does a classification between red and blue circles. So how do we choose the
best line or in general the best hyperplane that segregates our data points.
Selecting the best hyper-plane:
One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data point on each
side is maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2.
Let’s consider a scenario like shown below
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the
data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The
SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane
that maximizes the margin. SVM is robust to outliers.
So in this type of data points what SVM does is, it finds maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin.
to the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge loss(The hinge
So the margins in these type of cases are called soft margin. When there is a soft margin
Say, our data is like shown in the figure above. SVM solves this by creating a new
variable using a kernel. We call a point x i on the line and we create a new variable y i as a
function of distance from origin o.so if we plot this we get something like as shown below
In this case, the new variable y is created as a function of distance from the origin. A non-
linear function that creates a new variable is referred to as kernel.
SVM Kernel:
The SVM kernel is a function that takes low dimensional input space and transforms it into
higher-dimensional space, ie it converts non separable problem to separable problem. It is
mostly useful in non-linear separation problems. Simply put the kernel, it does some
extremely complex data transformations then finds out the process to separate the data
based on the labels or outputs defined.
Advantages of SVM:
Effective in high dimensional cases
Its memory efficient as it uses a subset of training points in the decision function called
support vectors
Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels
Basic Steps
The basic steps of the SVM are:
1. select two hyperplanes (in 2D) which separates the data with no points between
them (red lines)
2. maximize their distance (the margin)
3. the average line (here the line half way between the two red lines) will be
the decision boundary
SVM for Non-Linear Data Sets
The basic idea is that when a data set is inseparable in the current
dimensions, add another dimension, maybe that way the data will
be separable. Just think about it, the example above is in 2D and it
is inseparable, but maybe in 3D there is a gap between the apples
and the lemons, maybe there is a level difference, so lemons are on
level one and apples are on level two. In this case, we can easily
draw a separating hyperplane (in 3D a hyperplane is a plane)
between level 1 and 2.
Mapping to Higher Dimensions
Mapping from 2D to 3D
Now we have to map the apples and lemons (which are just simple
points) to this new space. Think about it carefully, what did we do?
We just used a transformation in which we added levels based on
distance. If you are in the origin, then the points will be on the
lowest level. As we move away from the origin, it means that we
are climbing the hill (moving from the center of the plane towards
the margins) so the level of the points will be higher. Now if we
consider that the origin is the lemon from the center, we will have
something like this:
After using the kernel and after all the transformations we will get:
Tuning Parameters
Regularization
As you can see in the image, when the C is low, the margin is higher
(so implicitly we don’t have so many curves, the line doesn’t strictly
follows the data points) even if two apples were classified as
lemons. When the C is high, the boundary is full of curves and all
the training data was classified correctly. Don’t forget, even if all
the training data was correctly classified, this doesn’t mean that
increasing the C will always increase the precision (because of
overfitting).
Gamma
Margin
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
o Decision Trees usually mimic human thinking ability while making a decision, so
it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the
best attributes.
o Step-4: Generate the decision tree node, which contains the best
attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is
reached where you cannot further classify the nodes and called the
final node as a leaf node.
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
Where,
Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree
in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high
Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula: