0% found this document useful (0 votes)
4 views34 pages

ML Unit-4

The document provides an overview of various machine learning algorithms, including Logistic Regression, Support Vector Machines (SVM), and Decision Trees. It explains the principles behind these methods, such as the use of logistic functions for binary classification in logistic regression, the concept of hyperplanes in SVM, and the tree structure of decision trees for classification tasks. Additionally, it covers important components like loss functions, attribute selection measures, and the significance of parameters like regularization and gamma in SVM.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views34 pages

ML Unit-4

The document provides an overview of various machine learning algorithms, including Logistic Regression, Support Vector Machines (SVM), and Decision Trees. It explains the principles behind these methods, such as the use of logistic functions for binary classification in logistic regression, the concept of hyperplanes in SVM, and the tree structure of decision trees for classification tasks. Additionally, it covers important components like loss functions, attribute selection measures, and the significance of parameters like regularization and gamma in SVM.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT-4

1.Logistic Reression

2.Log loss

3.Least Square Methods

4.SVM

5.Dession Tree

6.Probability estimation trees

LOGISTIC REGRESSION

Logistic Regression is used when the dependent variable(target) is categorical. For example, To
predict whether an email is spam (1) or (0) Whether the tumor is malignant (1) or not (0)

Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based
on prior observations of a data set.

o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.

o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped


logistic function, which predicts two maximum values (0 or 1).

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted


values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends to
1, and a value below the threshold values tends to 0.
LOSS FUNCTION IN LOGISTIC REGRESSION :

Why don’t we use `Mean Squared Error as a cost function in


Logistic Regression?

In Logistic Regression Ŷi is a nonlinear function(Ŷ=1/1+ e-z), if we put this in the above MSE
equation it will give a non-convex function as shown:

When we try to optimize values using gradient descent it will create complications to
find global minima.
Linear Regression Logistic Regression

Linear regression is used to predict the Logistic Regression is used to predict the
continuous dependent variable using a given set categorical dependent variable using a given set
of independent variables. of independent variables.

Linear Regression is used for solving Regression Logistic regression is used for solving
problem. Classification problems.

In Linear regression, we predict the value of In logistic Regression, we predict the values of
continuous variables. categorical variables.

In linear regression, we find the best fit line, by In Logistic Regression, we find the S-curve by
which we can easily predict the output. which we can classify the samples.

Least square estimation method is used for Maximum likelihood estimation method is used
estimation of accuracy. for estimation of accuracy.

The output for Linear Regression must be a The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categorical value such as 0 or 1, Yes or No, etc.

In Linear regression, it is required that In Logistic regression, it is not required to have


relationship between dependent variable and the linear relationship between the dependent
independent variable must be linear. and independent variable.

What Is the Least Squares Method?


The least squares method is a form of mathematical regression analysis used to determine the line of
best fit for a set of data, providing a visual demonstration of the relationship between the data points.
Each point of data represents the relationship between a known independent variable and an unknown
dependent variable.

This method of regression analysis begins with a set of data points to be plotted on an x- and y-axis
graph. An analyst using the least squares method will generate a line of best fit that explains the
potential relationship between independent and dependent variables.

Least Squares Regression Line

If the data shows a linear relationship between two variables, the line that best fits this linear
relationship is known as a least-squares regression line, which minimizes the vertical distance from the
data points to the regression line. The term “least squares” is used because it is the smallest sum of
squares of errors, which is also called the "variance."
Support vector machine:
Support Vector Machine(SVM) is a supervised machine learning algorithm used for both
classification and regression. Though we say regression problems as well its best suited
for classification. The objective of SVM algorithm is to find a hyperplane in an N-
dimensional space that distinctly classifies the data points. The dimension of the
hyperplane depends upon the number of features. If the number of input features is two,
then the hyperplane is just a line. If the number of input features is three, then the
hyperplane becomes a 2-D plane. It becomes difficult to imagine when the number of
features exceeds three.
Let’s consider two independent variables x1, x2 and one dependent variable which is
either a blue circle or a red circle.

Linearly Separable Data points

From the figure above its very clear that there are multiple lines (our hyperplane here is a
line because we are considering only two input features (x1, x2) that segregates our data
points or does a classification between red and blue circles. So how do we choose the
best line or in general the best hyperplane that segregates our data points.
Selecting the best hyper-plane:
One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data point on each
side is maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2.
Let’s consider a scenario like shown below

Here we have one blue ball in the boundary of the red ball. So how does SVM classify the
data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The
SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane
that maximizes the margin. SVM is robust to outliers.
So in this type of data points what SVM does is, it finds maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin.

to the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge loss(The hinge
So the margins in these type of cases are called soft margin. When there is a soft margin

loss is used for "maximum-margin" classification) is a commonly used penalty. If no


violations no hinge loss.If violations hinge loss proportional to the distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red
balls are separable by a straight line/linear line). What to do if data are not linearly
separable?

Say, our data is like shown in the figure above. SVM solves this by creating a new
variable using a kernel. We call a point x i on the line and we create a new variable y i as a
function of distance from origin o.so if we plot this we get something like as shown below
In this case, the new variable y is created as a function of distance from the origin. A non-
linear function that creates a new variable is referred to as kernel.
SVM Kernel:
The SVM kernel is a function that takes low dimensional input space and transforms it into
higher-dimensional space, ie it converts non separable problem to separable problem. It is
mostly useful in non-linear separation problems. Simply put the kernel, it does some
extremely complex data transformations then finds out the process to separate the data
based on the labels or outputs defined.
Advantages of SVM:
 Effective in high dimensional cases
 Its memory efficient as it uses a subset of training points in the decision function called
support vectors
 Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels

Basic Steps
The basic steps of the SVM are:
1. select two hyperplanes (in 2D) which separates the data with no points between
them (red lines)
2. maximize their distance (the margin)
3. the average line (here the line half way between the two red lines) will be
the decision boundary
SVM for Non-Linear Data Sets

An example of non-linear data is:

In this case we cannot find a straight line to separate apples


from lemons. So how can we solve this problem. We will use
the Kernel Trick!

The basic idea is that when a data set is inseparable in the current
dimensions, add another dimension, maybe that way the data will
be separable. Just think about it, the example above is in 2D and it
is inseparable, but maybe in 3D there is a gap between the apples
and the lemons, maybe there is a level difference, so lemons are on
level one and apples are on level two. In this case, we can easily
draw a separating hyperplane (in 3D a hyperplane is a plane)
between level 1 and 2.
Mapping to Higher Dimensions

To solve this problem we shouldn’t just blindly add another


dimension, we should transform the space so we generate this
level difference intentionally.

Mapping from 2D to 3D

Let's assume that we add another dimension called X3. Another


important transformation is that in the new dimension the points
are organized using this formula x1² + x2².

If we plot the plane defined by the x² + y² formula, we will get


something like this:

Now we have to map the apples and lemons (which are just simple
points) to this new space. Think about it carefully, what did we do?
We just used a transformation in which we added levels based on
distance. If you are in the origin, then the points will be on the
lowest level. As we move away from the origin, it means that we
are climbing the hill (moving from the center of the plane towards
the margins) so the level of the points will be higher. Now if we
consider that the origin is the lemon from the center, we will have
something like this:

Now we can easily separate the two classes. These transformations


are called kernels. Popular kernels are: Polynomial Kernel,
Gaussian Kernel, Radial Basis Function (RBF), Laplace RBF
Kernel, Sigmoid Kernel, Anove RBF Kernel, etc (see Kernel
Functions or a more detailed description Machine Learning
Kernels).
Mapping from 1D to 2D

Another, easier example in 2D would be:

After using the kernel and after all the transformations we will get:

So after the transformation, we can easily delimit the two classes


using just a single line.

In real life applications we won’t have a simple straight line, but we


will have lots of curves and high dimensions. In some cases we
won’t have two hyperplanes which separates the data with no points
between them, so we need some trade-offs, tolerance for
outliers. Fortunately the SVM algorithm has a so-
called regularization parameter to configure the trade-off and to
tolerate outliers.

Tuning Parameters

As we saw in the previous section choosing the right kernel is


crucial, because if the transformation is incorrect, then the model
can have very poor results. As a rule of thumb, always check if you
have linear data and in that case always use linear SVM (linear
kernel). Linear SVM is a parametric model, but an RBF kernel
SVM isn’t, so the complexity of the latter grows with the size of the
training set. Not only is more expensive to train an RBF kernel
SVM, but you also have to keep the kernel matrix around, and
the projection into this “infinite” higher dimensional
space where the data becomes linearly separable is more
expensive as well during prediction.

Regularization

The Regularization Parameter (in python it’s called C) tells the


SVM optimization how much you want to avoid miss
classifying each training example.

If the C is higher, the optimization will choose smaller


margin hyperplane, so training data miss classification rate will
be lower.
On the other hand, if the C is low, then the margin will be big,
even if there will be miss classified training data examples. This
is shown in the following two diagrams:

As you can see in the image, when the C is low, the margin is higher
(so implicitly we don’t have so many curves, the line doesn’t strictly
follows the data points) even if two apples were classified as
lemons. When the C is high, the boundary is full of curves and all
the training data was classified correctly. Don’t forget, even if all
the training data was correctly classified, this doesn’t mean that
increasing the C will always increase the precision (because of
overfitting).

Gamma

The next important parameter is Gamma. The gamma parameter


defines how far the influence of a single training example
reaches. This means that high Gamma will consider only
points close to the plausible hyperplane and low Gamma will
consider points at greater distance.
As you can see, decreasing the Gamma will result that finding the
correct hyperplane will consider points at greater distances so more
and more points will be used (green lines indicates which points
were considered when finding the optimal hyperplane).

Margin

The last parameter is the margin. We’ve already talked about


margin, higher margin results better model, so better
classification (or prediction). The margin should be
always maximized.

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
o The decisions or the test are performed on the basis of features of the given
dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so
it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.

How does the Decision Tree algorithm Work?

o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the
best attributes.
o Step-4: Generate the decision tree node, which contains the best
attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is
reached where you cannot further classify the nodes and called the
final node as a leaf node.

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree
in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high
Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a
human follow while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other
algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using
the Random Forest algorithm.
o For more class labels, the computational complexity of the decision
tree may increase
GINI INDEX:

You might also like