0% found this document useful (0 votes)
16 views40 pages

ML-Unit 5

The document provides an overview of non-parametric machine learning techniques, focusing on the K-Nearest Neighbors (KNN) and Decision Tree algorithms. It explains the functioning, advantages, and disadvantages of KNN, along with its implementation steps and example scenarios. Additionally, it covers Decision Trees, including their structure, working mechanism, and the importance of pruning and attribute selection measures like Information Gain and Gini Index.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views40 pages

ML-Unit 5

The document provides an overview of non-parametric machine learning techniques, focusing on the K-Nearest Neighbors (KNN) and Decision Tree algorithms. It explains the functioning, advantages, and disadvantages of KNN, along with its implementation steps and example scenarios. Additionally, it covers Decision Trees, including their structure, working mechanism, and the importance of pruning and attribute selection measures like Information Gain and Gini Index.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

UNIT V NON-PARAMETRIC MACHINE LEARNING

k- Nearest Neighbors- Decision Trees – Branching – Greedy Algorithm - Multiple Branches – Continuous
attributes – Pruning. Random Forests: ensemble learning. Boosting – Adaboost algorithm. Support Vector
Machines – Large Margin Intuition – Loss Function - Hinge Loss – SVM Kernels

1. K-Nearest Neighbor (KNN) Algorithm for Machine Learning


 K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
 K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using
K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
 It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So, for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats and
dog’s images and based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below
diagram:

1
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:
 There is no particular way to determine the best value for "K", so we need to try some values to find
the best out of them. The most preferred value for K is 5.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
 Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data points for all the
training samples.

Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV car.
The company wants to give the ads to the users who are interested in buying that SUV. So for this problem,
we have a dataset that contains multiple user's information through the social network. The dataset contains
lots of information but the Estimated Salary and Age we will consider for the independent variable and
the Purchased variable is for the dependent variable. Below is the dataset:

2
Steps to implement the K-NN algorithm:
 Data Pre-processing step
 Fitting the K-NN algorithm to the Training set
 Predicting the test result
 Test accuracy of the result(Creation of Confusion matrix)
 Visualizing the test set result.

Data Pre-Processing Step:


The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the code for it:
By executing the above code, our dataset is imported to our program and well pre-processed. After feature
scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.

Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the class, we will create
the Classifier object of the class. The Parameter of this class will be
o n_neighbors: To define the required neighbors of the algorithm. Usually, it takes 5.
o metric='minkowski': This is the default parameter and it decides the distance between the points.
o p=2: It is equivalent to the standard Euclidean metric.

And then we will fit the classifier to the training data. Below is the code for it:
from sklearn.neighbors import KNeighborsClassifier #Fitting K-NN classifier to the training set
classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
classifier.fit(x_train, y_train)

Predicting the Test Result: To predict the test set result, we will create a y_pred vector as we did in Logistic
Regression. Below is the code for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)

Creating the Confusion Matrix:


Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the classifier. Below is
the code for it:
from sklearn.metrics import confusion_matrix #Creating the Confusion matrix
cm= confusion_matrix(y_test, y_pred)

3
Visualizing the Training set result:
Now, we will visualize the training set result for K-NN model. The code will remain same as we did in
Logistic Regression, except the name of the graph.

The output graph is different from the graph which we have occurred in Logistic Regression. It can be
understood in the below points:
 As we can see the graph is showing the red point and green points. The green points are for
Purchased(1) and Red Points for not Purchased(0) variable.
 The graph is showing an irregular boundary instead of showing any straight line or any curve
because it is a K-NN algorithm, i.e., finding the nearest neighbor.
 The graph has classified users in the correct categories as most of the users who didn't buy the SUV
are in the red region and users who bought the SUV are in the green region.
 The graph is showing good result but still, there are some green points in the red region and red
points in the green region. But this is no big issue as by doing this model is prevented from
overfitting issues.
 Hence our model is well trained.

Visualizing the Test set result:


After the training of the model, we will now test the result by putting a new dataset, i.e., Test dataset. Code
remains the same except some minor changes: such as x_train and y_train will be replaced by x_test and
y_test.

2.Decision Tree Classification Algorithm:


 Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
 In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
 Below diagram explains the general structure of a decision tree:
A decision tree can contain categorical data (YES/NO) as well as numeric data.

4
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons
for using the Decision tree:
 Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
 The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies
 Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting
a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on
the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further.
It continues the process until it reaches the leaf node of the tree. The complete process can be better understood
using the below algorithm:
 Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
 Step-3: Divide the S into subsets that contains possible values for the best attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the
offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM).
The root node splits further into the next decision node (distance from the office) and one leaf node based on
the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and
one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:

5
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for the root
node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute selection
measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:
 Information Gain
 Gini Index

1. Information Gain:
 Information gain is the measurement of changes in entropy after the segmentation of a dataset based
on an attribute.
 It calculates how much information a feature provides us about a class.
 According to the value of information gain, we split the node and build the decision tree.
 A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute
having the highest information gain is split first. It can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
 S= Total number of samples
 P(yes)= probability of yes
 P(no)= probability of no

2. Gini Index:
 Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
 An attribute with the low Gini index should be preferred as compared to the high Gini index.
 It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
 Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features of
the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is
known as Pruning. There are mainly two types of tree pruning technology used:
 Cost Complexity Pruning
 Reduced Error Pruning.

Advantages of the Decision Tree


 It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
 The decision tree contains lots of layers, which makes it complex.
 It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
 For more class labels, the computational complexity of the decision tree may increase.

6
Python Implementation of Decision Tree
Now we will implement the Decision tree using Python. For this, we will use the dataset "user_data.csv,"
which we have used in previous classification models. By using the same dataset, we can compare the Decision
tree classifier with other classification models such as KNN SVM, LogisticRegression, etc.
 Data Pre-processing step
 Fitting a Decision-Tree algorithm to the Training set
 Predicting the test result
 Test accuracy of the result(Creation of Confusion matrix)
 Visualizing the test set result.

1. Data Pre-Processing Step:


Below is the code for the pre-processing step:
import numpy as nm # importing libraries
import matplotlib.pyplot as mtp
import pandas as pd

data_set= pd.read_csv('user_data.csv') #importing datasets


x= data_set.iloc[:, [2,3]].values #Extracting Independent and dependent Variable
y= data_set.iloc[:, 4].values
from sklearn.model_selection import train_test_split # Splitting the dataset into training and test set.
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
from sklearn.preprocessing import StandardScaler #feature Scaling
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)

In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:

2. Fitting a Decision-Tree algorithm to the Training set


Now we will fit the model to the training set. For this, we will import the DecisionTreeClassifier class
from sklearn.tree library. Below is the code for it:
#Fitting Decision Tree classifier to the training set
From sklearn.tree import DecisionTreeClassifier
classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
classifier.fit(x_train, y_train)

In the above code, we have created a classifier object, in which we have passed two main parameters;
 "criterion='entropy': Criterion is used to measure the quality of split, which is calculated by
information gain given by entropy.
 random_state=0": For generating the random states.
7
3. Predicting the test result
Now we will predict the test set result. We will create a new prediction vector y_pred. Below is the code for
it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)

Output:
In the below output image, the predicted output and real test output are given. We can clearly see that there
are some values in the prediction vector, which are different from the real vector values. These are prediction
errors.

4. Test accuracy of the result (Creation of Confusion matrix)


In the above output, we have seen that there were some incorrect predictions, so if we want to know the
number of correct and incorrect predictions, we need to use the confusion matrix. Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)
Output:

In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions. Therefore, we can say that compared to other
classification models, the Decision Tree classifier made a good prediction.

5. Visualizing the training set result:


Here we will visualize the training set result. To visualize the training set result we will plot a graph for the
decision tree classifier. The classifier will predict yes or No for the users who have either Purchased or Not
purchased the SUV car as we did in Logistic Regression. Below is the code for it:
#Visulaizing the trianing set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
8
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
fori, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Decision Tree Algorithm (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:

The above output is completely different from the rest classification models. It has both vertical and horizontal
lines that are splitting the dataset according to the age and estimated salary variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.

6. Visualizing the test set result:

Visualization of test set result will be similar to the visualization of the training set except that the training set
will be replaced with the test set.
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
fori, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Decision Tree Algorithm(Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()

9
Output:

3. Greedy Algorithm:
The greedy method is one of the strategies like Divide and conquer used to solve the problems. This
method is used for solving optimization problems. An optimization problem is a problem that demands
either maximum or minimum results.
The Greedy method is the simplest and straightforward approach. It is not an algorithm, but it is a
technique. The main function of this approach is that the decision is taken on the basis of the currently
available information. Whatever the current information is present, the decision is made without
worrying about the effect of the current decision in future.
This technique is basically used to determine the feasible solution that may or may not be optimal. The
feasible solution is a subset that satisfies the given criteria. The optimal solution is the solution which
is the best and the most favorable solution in the subset. In the case of feasible, if more than one
solution satisfies the given criteria then those solutions will be considered as the feasible, whereas the
optimal solution is the best solution among all the solutions.

Characteristics of Greedy method


The following are the characteristics of a greedy method:
 To construct the solution in an optimal way, this algorithm creates two sets where one set contains all
the chosen items, and another set contains the rejected items.
 A Greedy algorithm makes good local choices in the hope that the solution should be either feasible
or optimal.
Components of Greedy Algorithm
The components that can be used in the greedy algorithm are:
 Candidate set: A solution that is created from the set is known as a candidate set.
 Selection function: This function is used to choose the candidate or subset which can be added in the
solution.
 Feasibility function: A function that is used to determine whether the candidate or subset can be used
to contribute to the solution or not.
 Objective function: A function is used to assign the value to the solution or the partial solution.
 Solution function: This function is used to intimate whether the complete function has been reached
or not.

Applications of Greedy Algorithm


 It is used in finding the shortest path.
 It is used to find the minimum spanning tree using the prim's algorithm or the Kruskal's algorithm.
 It is used in a job sequencing with a deadline.
 This algorithm is also used to solve the fractional knapsack problem.

10
Pseudo code of Greedy Algorithm
Algorithm Greedy (a, n)
{
Solution : = 0;
for i = 0 to n do
{
x: = select(a);
if feasible(solution, x)
{
Solution: = union(solution , x)
}
return solution;
}}

The above is the greedy algorithm. Initially, the solution is assigned with zero value. We pass the array and
number of elements in the greedy algorithm. Inside the for loop, we select the element one by one and checks
whether the solution is feasible or not. If the solution is feasible, then we perform the union.
Let's understand through an example.
Suppose there is a problem 'P'. I want to travel from A to B shown as below:
P:A→B

The problem is that we have to travel this journey from A to B. There are various solutions to go from A to
B. We can go from A to B by walk, car, bike, train, aeroplane, etc. There is a constraint in the journey that
we have to travel this journey within 12 hrs. If I go by train or aeroplane then only, I can cover this distance
within 12 hrs. There are many solutions to this problem but there are only two solutions that satisfy the
constraint.
If we say that we have to cover the journey at the minimum cost. This means that we have to travel this
distance as minimum as possible, so this problem is known as a minimization problem. Till now, we have two
feasible solutions, i.e., one by train and another one by air. Since travelling by train will lead to the minimum
cost so it is an optimal solution. An optimal solution is also the feasible solution, but providing the best result
so that solution is the optimal solution with the minimum cost. There would be only one optimal solution.
The problem that requires either minimum or maximum result then that problem is known as an optimization
problem. Greedy method is one of the strategies used for solving the optimization problems.

Disadvantages of using Greedy algorithm


Greedy algorithm makes decisions based on the information available at each phase without considering the
broader problem. So, there might be a possibility that the greedy solution does not give the best solution for
every problem.
It follows the local optimum choice at each stage with a intend of finding the global optimum. Let's understand
through an example.

Consider the graph which is given below:

We have to travel from the source to the destination at the minimum cost. Since we have three feasible
solutions having cost paths as 10, 20, and 5. 5 is the minimum cost path so it is the optimal solution. This is

11
the local optimum, and in this way, we find the local optimum at each stage in order to calculate the global
optimal solution.

Continuous attributes
What are Continuous Variables?
Simply put, if a variable can take any value between its minimum and maximum value, then it is called a
continuous variable. By nature, a lot of things we deal with fall in this category: age, weight, height being
some of them.

Just to make sure the difference is clear, let me ask you to classify whether a variable is continuous or
categorical:
1. Gender of a person
2. Number of siblings of a Person
3. Time on which a laptop runs on battery

Methods to deal with Continuous Variables

Binning The Variable:


Binning refers to dividing a list of continuous variables into groups. It is done to discover set of patterns in
continuous variables, which are difficult to analyze otherwise. Also, bins are easy to analyze and interpret. But,
it also leads to loss of information and loss of power. Once the bins are created, the information gets
compressed into groups which later affects the final model. Hence, it is advisable to create small bins initially.
This would help in minimal loss of information and produces better results.

Normalization:
In simpler words, it is a process of comparing variables at a ‘neutral’ or ‘standard’ scale. It helps to obtain
same range of values. Normally distributed data is easy to read and interpret. As shown below, in a normally
distributed data, 99.7% of the observations lie within 3 standard deviations from the mean. Also, the mean is
zero and standard deviation is one. Normalization technique is commonly used in algorithms such as k-means,
clustering etc.

A commonly used normalization method is z-scores. Z score of an observation is the number of standard
deviations it falls above or below the mean. It’s formula is shown below.

x = observation, μ = mean (population), σ = standard deviation (population)


For example:

12
Randy scored 76 in maths test. Katie score 86 in science test. Maths test has (mean = 70, sd = 2). Science test
has (mean = 80, sd = 3)
z(Randy) = (76 – 70)/2 = 3
z(Katie) = (86 – 80)/3 = 2

Transformations for Skewed Distribution:


Transformation is required when we encounter highly skewed data. It is suggested not to work on skewed data
in its raw form. Because, it reduces the impact of low frequency values which could be equally significant. At
times, skewness is influenced by presence of outliers. Hence, we need to be careful while using this approach.
The technique to deal with outliers is explained in next sections.

There are various types of transformation methods. Some are Log, sqrt, exp, Box-cox, power etc. The
commonly used method is Log Transformation.

Principal Component Analysis:


Sometime data set has too many variables. May be, 100, 200 variables or even more. In such cases, you can’t
build a model on all variables. Reason being, 1) It would be time consuming. 2) It might have lots of noise 3)
A lot of variables will tell similar information

Hence, to avoid such situation we use PCA a.k.a Principal Component Analysis. It is nothing but, finding out
few ‘principal ‘variables which explain significant amount of variation in dependent variable. Using this
technique, a large number of variables are reduced to few significant variables. This technique helps to reduce
noise, redundancy and enables quick computations.

Factor Analysis:
Factor Analysis was invented by Charles Spearman (1904). This is a variable reduction technique. It is used
to determine factor structure or model. It also explains the maximum amount of variance in the model. Let’s
say some variables are highly correlated. These variables can be grouped by their correlations i.e., all variables
in a particular group can be highly correlated among themselves but have low correlation with variables of
other group(s). Here each group represents a single underlying construct or factor. Factor analysis is of two
types:
1. EFA (Exploratory Factor Analysis) – It identifies and summarizes the underlying correlation structure
in a data set
2. CFA (Confirmatory Factor Analysis) – It attempts to confirm hypothesis using the correlation structure
and rate ‘goodness of fit’.

Methods to work with Date & Time Variable


Presence of Data Time variable in a data set usually give lots of confidence. Seriously! It does. Because, in
data-time variable, you get lots of scope to practice the techniques learnt above. You can create bins, you can
create new features, convert its type etc. Date & Time is commonly found in this format:
DD-MM-YYY HH:SS or MM-DD-YYY HH:SS
13
Pruning
When the size of the features exceeds a certain limit, regression trees become inapplicable due to overfitting.
The decision tree’s overfitting problem is caused by other factors as well as synch as branches sometimes are
impacted by noise and outliers of data. Pruning is a critical step in constructing tree based machine
learning models that help overcome these issues.
1. A snippet about decision trees
2. About pruning
3. Strategies for pruning
4. Pruning methods
A decision tree is a traditional supervised machine learning technique.

About pruning
Pruning is the process of eliminating weight connections from a network to speed up inference and reduce
model storage size. Decision trees and neural networks, in general, are overparameterized. Pruning a network
entails deleting unneeded parameters from an overly parameterized network.
Pruning mostly serves as an architectural search inside the tree or network. In fact, because pruning functions
as a regularizer, a model will often generalise slightly better at low levels of sparsity. The trimmed model will
match the baseline at higher levels. If you push it too far, the model will start to generalise worse than the
baseline, but with greater performance.

Need for pruning


Pruning a classifier simplifies it by combining disjuncts that are adjacent in instance space. By removing error-
prone components, the classifier’s performance may be improved. It also permits additional model analysis
for the aim of knowledge gain. Pruning should never be used to remove predicted components of a classifier.
As a result, the pruning operation needs a technique for determining if a group of disjuncts is predictive or
should be merged into a single, bigger disjunct.
The pruned disjunct represents the “null hypothesis” in a significance test, whereas the unpruned disjuncts
represent the “alternative hypothesis.” The test determines if the data offer adequate evidence to support the
alternative. If this is the case, the unpruned disjuncts are left alone; otherwise, pruning continues.
The obvious rationale for significance tests is that they evaluate whether the apparent correlation between a
collection of disjuncts and the data is likely to be attributable to chance alone. They do so by calculating the
likelihood of generating a random relationship as least as strong as the observed association if the null
hypothesis is confirmed. If the observed relationship is unlikely to be attributable to chance and this likelihood
does not exceed a set threshold, the unpruned disjuncts are deemed to be predictive; otherwise, the model is
simplified. The aggressiveness of the pruning operation is determined by the “significance level” criterion
used in the test.

Strategies for pruning


Pruning is a critical step in developing a decision tree model. Pruning is commonly employed to alleviate the
overfitting issue in decision trees. Pre-pruning and post-pruning are two common model tree generating
procedures.
Pre pruning
Prepruning is the process of pruning the model by halting the tree’s formation in advance. When construction
is completed, the leaf nodes inherit the label of the most common class in the subset that is connected to the
current node. There are various ways for pre-pruning, including the following
 When the model reaches a specific height, the decision tree’s growth is stopped.
 When the eigenvectors of instances associated with a node are identical, the tree model stops
developing.
 When the number of occurrences within a node falls below a certain threshold, the tree stops growing.
The downside of this strategy is that it is inapplicable not in particular circumstances where the amount
of data is tiny.
 An expansion is a process of dividing a node into two child nodes. When the gain value of an expansion
falls below a certain threshold, the tree model stops expanding as well.

14
The major disadvantage of pre-pruning is the narrow viewing field, which implies that the tree’s current
expansion may not match the standards, but later expansion may. In this situation, the decision tree’s
development is halted early.

Post-pruning
The decision tree generation is divided into two steps by post-pruning. The first step is the tree-building
process, with the termination condition that the fraction of a certain class in the node reaches 100%, and the
second phase is pruning the tree structure gained in the first phase.
Post-pruning techniques circumvent the problem of a narrow viewing field in this way. As a result, post-
pruning procedures are often more accurate than pre-pruning methods, therefore post-pruning methods are
more widely utilised than pre-pruning methods. The pruning procedure identifies the node as a leaf node by
using the label of the most common class in the subset associated with the current node, which is the same as
in pre-pruning.

Pruning methods
The goal of pruning is to remove sections of a classification model that explain random variation in the training
sample rather than actual domain characteristics. This makes the model more understandable to the user and,
perhaps, more accurate on fresh data that was not used to train the classifier. An effective approach for
differentiating sections of a classifier that are attributable to random effects from parts that describe significant
structure is required for pruning. There are different methods for pruning listed in this article used in both
strategies.

Reduced Error Pruning (REP)


The aim is to discover the most accurate subtree with the shortest version to the pruning set.
The pruning set is used to evaluate the efficacy of a subtree (branch) of a fully grown tree in this approach,
which is conceptually the simplest. It starts with the entire tree and compares the number of classification
mistakes made on the pruning set when the subtree is retained to the number of classification errors made
when internal nodes are transformed into leaves and assigned to the best class for each internal node of the
tree. The simplified tree can sometimes outperform the original tree. It is best to prune the subtree in this
scenario. This branch trimming procedure is continued on the simplified tree until the misclassification rate
rises. Another restriction limits the pruning condition: the internal node can be pruned only if it includes no
subtree with a lower error rate than the internal node itself. This indicates that trimmed nodes are evaluated
using a bottom-up traversal technique.
The advantage of this strategy is its linear computing complexity, as each node is only visited once to evaluate
the possibility of trimming it. REP, on the other hand, has a proclivity towards over-pruning. This is because
all evidence contained in the training set and used to construct a fully grown tree is ignored during the pruning
step. This issue is most obvious when the pruning set is significantly smaller than the training set, but it
becomes less significant as the percentage of instances in the pruning set grows.

Pessimistic Error Pruning (PEP)


The fact that the same training set is utilised for both growing and trimming a tree distinguishes this pruning
strategy. The apparent error rate, that is, the error rate on the training set, is optimistic and cannot be used to
select the best-pruned tree. As a result, the continuity correction for the binomial distribution was proposed,
which may give “a more realistic error rate.”
The distribution of errors at the node is roughly a binomial distribution. The binomial distribution’s mean and
variance are the likelihood of success and failure; the binomial distribution converges to a normal distribution.
The PEP approach is regarded as one of the most accurate decision tree pruning algorithms available today.
However, because the mechanism for traversing PEP is similar to pre-pruning, PEP suffers from excessive
pruning. Furthermore, due to its top-down nature, each subtree in the tree only has to be consulted once, and
the time complexity is in the worst-case linear with the number of non-leaf nodes in the decision tree.

15
Minimum Error Pruning (MEP)
This method is a bottom-up strategy that seeks a single tree with the lowest “anticipated error rate on an
independent data set.” This does not indicate the adoption of a pruning set, but rather that the developer wants
to estimate the error rate for unknown scenarios. Indeed, both the original and enhanced versions described
exploiting just information from the training set.
In the presence of noisy data, Laplace probability estimation is employed to improve the performance of ID3.
Later, the Bayesian technique was employed to enhance this procedure, and the approach is known as an m-
probability estimation. There were two modifications:
 Prior probabilities are used in estimate rather than assuming a uniform starting distribution of classes.
 Several trees with differing degrees of pruning may be generated by adjusting the value of the
parameter. The degree of pruning is now decided by parameters rather than the number of classes.
Furthermore, factors like the degree of noise in the training data may be changed based on domain
expertise or the complexity of the problem.
The predicted error rate for each internal node is estimated in the minimal error pruning approach and is
referred to as static error. The anticipated error rate of the branch with the node is then estimated as a weighted
sum of the expected error rates of the node’s children, where each weight represents the chance that
observation in the node would reach the associated child.

Critical Value Pruning (CVP)


This post-pruning approach is quite similar to pre-pruning. Indeed, a crucial value threshold is defined for the
node selection measure. Then, if the value returned by the selection measure for each test connected with
edges flowing out of that node does not exceed the critical value, an internal node of the tree is pruned.
However, a node may meet the pruning criterion but not all of its offspring. The branch is retained in this
scenario because it includes significant nodes. This additional check is typical of a bottom-up strategy and
distinguishes it from pre-pruning methods that prohibit a tree from developing even if future tests prove to be
important.
The degree of pruning changes obviously with the critical value: a greater critical value results in more extreme
pruning. The approach is divided into two major steps:
 Prune the mature tree to increase crucial values.
 Choose the best tree from the sequence of trimmed trees by weighing the tree’s overall relevance and
forecasting abilities.

Cost-Complexity Pruning (CCP)


The CART pruning algorithm is another name for this approach. It is divided into two steps:
1. Using certain techniques, select a parametric family of subtrees from a fully formed tree.
2. The optimal tree is chosen based on an estimation of the real error rates of the trees in the parametric
family.
In terms of the first phase, the primary concept is to prune the branches that exhibit the least increase in
apparent error rate per cut leaf to produce the next best tree from the best tree. When a tree is pruned at a node,
the apparent error rate increases by a certain amount while the number of leaves reduces by a certain number
of units. As a result, the following ratio of the error rate increase to leaf reduction measures the rise in apparent
error rate per trimmed leaf. The next best tree in the parametric family is then created by trimming all nodes
in the subtree with the lowest value of the above-mentioned ratio.
The best tree in the entire grown tree in terms of predicted accuracy is picked in the second phase. The real
error rate of each tree in the family may be estimated in two ways: one using cross-validation sets and the
other using an independent pruning set.

16
Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It
can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the
performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

Assumptions for Random Forest


Since the random forest combines multiple trees to predict the class of the dataset, it is possible that some
decision trees may predict the correct output, while others may not. But together, all the trees predict the
correct output. Therefore, below are two assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier can
predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?


Below are some points that explain why we should use the Random Forest algorithm: <="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?


Random Forest works in two-phase first is to create the random forest by combining N decision tree, and
second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the
category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random
forest classifier. The dataset is divided into subsets and given to each decision tree. During the training phase,
each decision tree produces a prediction result, and when a new data point occurs, then based on the majority
of results, the Random Forest classifier predicts the final decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Advantages of Random Forest
o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

17
Disadvantages of Random Forest
o Although random forest can be used for both classification and regression tasks, it is not more suitable
for Regression tasks.

Python Implementation of Random Forest Algorithm


Now we will implement the Random Forest Algorithm tree using Python. For this, we will use the same
dataset "user_data.csv", which we have used in previous classification models. By using the same dataset, we
can compare the Random Forest classifier with other classification models such as Decision tree
Classifier, KNN, SVM, Logistic Regression, etc.

Implementation Steps are given below:


o Data Pre-processing step
o Fitting the Random forest algorithm to the Training set
o Predicting the test result
o Test accuracy of the result (Creation of Confusion matrix)
o Visualizing the test set result.

1.Data Pre-Processing Step:


Below is the code for the pre-processing step:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:

2. Fitting the Random Forest algorithm to the training set:


Now we will fit the Random forest algorithm to the training set. To fit it, we will import
the RandomForestClassifier class from the sklearn.ensemble library. The code is given below:
1. #Fitting Decision Tree classifier to the training set
2. from sklearn.ensemble import RandomForestClassifier
3. classifier= RandomForestClassifier(n_estimators= 10, criterion="entropy")
4. classifier.fit(x_train, y_train)

In the above code, the classifier object takes below parameters:


o n_estimators= The required number of trees in the Random Forest. The default value is 10. We can
choose any number but need to take care of the overfitting issue.
o criterion= It is a function to analyze the accuracy of the split. Here we have taken "entropy" for the
information gain.

18
3. Predicting the Test Set result
Since our model is fitted to the training set, so now we can predict the test result. For prediction, we will create
a new prediction vector y_pred. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
Output:
The prediction vector is given as:
By checking the above prediction vector and test set real vector, we can determine the incorrect predictions
done by the classifier.

4. Creating the Confusion Matrix


Now we will create the confusion matrix to determine the correct and incorrect predictions. Below is the code
for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output: As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92
correct predictions.

5. Visualizing the training Set result


Here we will visualize the training set result. To visualize the training set result we will plot a graph for the
Random forest classifier. The classifier will predict yes or No for the users who have either Purchased or Not
purchased the SUV car as we did in Logistic Regression. Below is the code for it:
1. from matplotlib.colors import ListedColormap
2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.0
1),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
6. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('purple', 'green'))(i), label = j)
12. mtp.title('Random Forest Algorithm (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()

6. Visualizing the test set result


Now we will visualize the test set result. Below is the code for it:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.0
1),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))

19
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Random Forest Algorithm(Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()

Ensemble Learning
ensemble learning helps improve machine learning results by combining several models. This approach
allows the production of better predictive performance compared to a single model. Basic idea is to learn a
set of classifiers (experts) and to allow them to vote.

Advantage: Improvement in predictive accuracy.


Disadvantage : It is difficult to understand an ensemble of classifiers.

Main Challenge for Developing Ensemble Models?


The main challenge is not to obtain highly accurate base models, but rather to obtain base models which
make different kinds of errors. For example, if ensembles are used for classification, high accuracies can be
accomplished if different base models misclassify different training examples, even if the base classifier
accuracy is low.
Methods for Independently Constructing Ensembles –
 Majority Vote
 Bagging and Random Forest
 Randomness Injection
 Feature-Selection Ensembles
 Error-Correcting Output Coding

Methods for Coordinated Construction of Ensembles –


 Boosting
 Stacking

Types of Ensemble Classifier –


Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision tree. Suppose a set D of d
tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap).
Then a classifier model Mi is learned for each training set D < i. Each classifier M i returns its class
prediction. The bagged classifier M* counts the votes and assigns the class with the most votes to X
(unknown sample).
Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal tuples, selecting observations
with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of each other.
4. The final predictions are determined by combining the predictions from all the models.

20
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a decision tree classifier
and is generated using a random selection of attributes at each node to determine the split. During
classification, each tree votes and the most popular class is returned.

Implementation steps of Random Forest –


1. Multiple subsets are created from the original data set, selecting observations with replacement.
2. A subset of features is selected randomly and whichever feature gives the best split is used to split
the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the aggregation of predictions from n number
of trees.

Boosting in Machine Learning - Boosting and AdaBoost


Boosting is an ensemble modeling technique that attempts to build a strong classifier from the
number of weak classifiers. It is done by building a model by using weak models in series. Firstly,
a model is built from the training data. Then the second model is built which tries to correct the
errors present in the first model. This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for the purpose of binary
classification. AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that
combines multiple “weak classifiers” into a single “strong classifier”. It was formulated by Yoav Freund
and Robert Schapire. They also won the 2003 Gödel Prize for their work.

21
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End

Explanation:
The above diagram explains the AdaBoost algorithm in a very simple way. Let’s try to understand it in a
stepwise process:
 B1 consists of 10 data points which consist of two types namely plus(+) and minus(-) and 5 of
which are plus(+) and the other 5 are minus(-) and each one has been assigned equal weight
initially. The first model tries to classify the data points and generates a vertical separator line
but it wrongly classifies 3 plus(+) as minus(-).
 B2 consists of the 10 data points from the previous model in which the 3 wrongly classified
plus(+) are weighted more so that the current model tries more to classify these pluses(+)
correctly. This model generates a vertical separator line that correctly classifies the previously
wrongly classified pluses(+) but in this attempt, it wrongly classifies three minuses(-).
 B3 consists of the 10 data points from the previous model in which the 3 wrongly classified
minus(-) are weighted more so that the current model tries more to classify these minuses(-)
correctly. This model generates a horizontal separator line that correctly classifies the previously
wrongly classified minuses(-).
 B4 combines together B1, B2, and B3 in order to build a strong prediction model which is much
better than any individual model used.

Making Predictions with AdaBoost


Predictions are made by calculating the weighted average of the weak classifiers.
For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted
values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a
the sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the
second class is predicted.
For example, 5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks
like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage

22
values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an
output of -0.8, which would be an ensemble prediction of -1.0 or the second class.

Data Preparation for AdaBoost


This section lists some heuristics for best preparing your data for AdaBoost.
 Quality Data: Because the ensemble method continues to attempt to correct misclassifications in the
training data, you need to be careful that the training data is of a high-quality.
 Outliers: Outliers will force the ensemble down the rabbit hole of working hard to correct for cases
that are unrealistic. These could be removed from the training dataset.
 Noisy Data: Noisy data, specifically noise in the output variable can be problematic. If possible,
attempt to isolate and clean these from your training dataset.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems. However, primarily, it is used for Classification problems
in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called
as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in
which there are two different categories that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify whether
it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train our model
with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and then we
test it with this strange creature. So as support vector creates a decision boundary between these two data (cat
and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis
of the support vectors, it will classify it as a cat. Consider the below diagram:

23
SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space,
but we need to find out the best decision boundary that helps to classify the data points. This best boundary is
known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between the
data points.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane
are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that
has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify
the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be
multiple lines that can separate these classes. Consider the below image:

24
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points
are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the
goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot
draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2

25
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine


Now we will implement the SVM algorithm using Python. Here we will use the same dataset user_data,
which we have used in Logistic regression and KNN classification.

26
Data Pre-processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
#Data Pre-processing Step
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give the dataset as:

The scaled output for the test set will be:

27
Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:
from sklearn.svm import SVC # "Support vector classifier"
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for linearly separable data.
However, we can change it for non-linear data. And then we fitted the classifier to the training dataset(x_train,
y_train)

Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)

The model performance can be altered by changing the value of C(Regularization factor), gamma, and
kernel.
Predicting the test set result:
Now, we will predict the output for test set. For this, we will create a new vector y_pred. Below is the code
for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)

After getting the y_pred vector, we can compare the result of y_pred and y_test to check the difference
between the actual value and predicted value.

Output: Below is the output for the prediction of the test set:

Creating the confusion matrix:


Now we will see the performance of the SVM classifier that how many incorrect predictions are there as
28
compared to the Logistic regression classifier. To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library. After importing the function, we will call it using a
new variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:
1. #Creating the Confusion matrix
2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:

As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10 correct
predictions. Therefore we can say that our SVM model improved as compared to the Logistic regression
model.
Visualizing the training set result:
Now we will visualize the training set result, below is the code for it:
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('SVM classifier (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:

29
As we can see, the above output is appearing similar to the Logistic regression output. In the output, we got
the straight line as hyperplane because we have used a linear kernel in the classifier. And we have also
discussed above that for the 2d space, the hyperplane in SVM is a straight line.
Visualizing the test set result:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('SVM classifier (Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the users into two regions (Purchased
or Not purchased). Users who purchased the SUV are in the red region with the red scatter points. And users
who did not purchase the SUV are in the green region with green scatter points. The hyperplane has divided
the two classes into Purchased and not purchased variable.

30
Large Margin Intuition
SVM Decision Boundary
Consider a case where we set constant C to be a very large value, when minimizing the optimization objective,
we are going to be highly motivated to choose a value, so that the first term is equal to 0. So what would it
take to make this first term equal to 0.

When the first term is equal to 0, we need to minimize (ignored θ0).

Linear separable case


The obtained decision boundary when minimizing the optimization objective will have the margin as large as
possible (hence the name Large Margin Intuition).
This means SVM will choose the black decision boundary instead of the pink and green one:

Mathematics Behind Large Margin Intuition

Vector Inner Product


p = length of projection of v onto u. p can be positive or negative.

31
SVM Decision Boundary
We can rewrite the optimization objective of SVM as follow:
where p(i) is the projection of x(u) onto the vector θ.
Simplification: θ0 = 0.
According to the illustration below, with the minimal value of the magnitude of θ, the absolute value of p will
large as much as possible (hence the large margin).

In logistic regression, we take the output of the linear function and squash the value within the range of [0,1]
using the sigmoid function. If the squashed value is greater than a threshold value(0.5) we assign it a label 1,
else we assign it a label 0. In SVM, we take the output of the linear function and if that output is greater than
1, we identify it with one class and if the output is -1, we identify is with another class. Since the threshold
values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as
margin.

Loss Function

In Machine learning, the loss function is determined as the difference between the actual output and the
predicted output from the model for the single training example while the average of the loss function for all
the training examples is termed as the cost function. This computed difference from the loss functions( such

32
as Regression Loss, Binary Classification, and Multiclass Classification loss function) is termed the error
value; this error value is directly proportional to the actual and predicted value.

How does Loss Functions Work?


The word ‘Loss’ states the penalty for failing to achieve the expected output. If the deviation in the predicted
value than the expected value by our model is large, then the loss function gives the higher number as output,
and if the deviation is small & much closer to the expected value, it outputs a smaller number.

It is important to note that, amount of deviation doesn’t matter; the thing which matters here is whether the
value predicted by our model is right or wrong. Loss functions are different based on your problem statement
to which machine learning is being applied. The cost function is another term used interchangeably for the
loss function, but it holds a slightly different meaning. A loss function is for a single training example, while
a cost function is an average loss over the complete train dataset.

Types of Loss Functions in Machine Learning


Below are the different types of the loss function in machine learning which are as follows:

1. Regression loss functions


Linear regression is a fundamental concept of this function. Regression loss functions establish a linear
relationship between a dependent variable (Y) and an independent variable (X); hence we try to fit the best
line in space on these variables.
Y = X0 + X1 + X2 + X3 + X4….+ Xn
 X = Independent variables
 Y = Dependent variable

Mean Squared Error Loss


MSE(L2 error) measures the average squared difference between the actual and predicted values by the model.
The output is a single number associated with a set of values. Our aim is to reduce MSE to improve the
accuracy of the model.

Consider the linear equation, y = mx + c, we can derive MSE as:


MSE=1/N ∑i=1 to n (y(i)−(mx(i)+b))2
Here, N is the total number of data points, 1/N ∑i=1 to n is the mean value, and y(i) is the actual value and
mx(i)+b its predicted value.

Mean Squared Logarithmic Error Loss (MSLE)


MSLE measures the ratio between actual and predicted value. It introduces an asymmetry in the error curve.
MSLE only cares about the percentual difference between actual and predicted values. It can be a good choice
as a loss function when we want to predict house sales prices, bakery sales prices, and the data is continuous.
Here, the loss can be calculated as the mean of observed data of the squared differences between the log-
transformed actual and predicted values, which can be given as:
L=1nn∑i=1(log(y(i)+1)−log(^y(i)+1))2

Mean Absolute Error (MAE)


MAE calculates the sum of absolute differences between actual and predicted variables. That means it
measures the average magnitude of errors in a set of predicted values. Using the mean square error is easier
to solve, but using the absolute error is more robust to outliers. Outliers are those values, which deviate
extremely from other observed data points.

MAE can be calculated as:


L=1nn∑i=1∣∣y(i)−^y(i)∣∣

33
2. Binary Classification Loss Functions
These loss functions are made to measure the performances of the classification model. In this, data points are
assigned one of the labels, i.e. either 0 or 1. Further, they can be classified as:
Binary Cross-Entropy
It’s a default loss function for binary classification problems. Cross-entropy loss calculates the performance
of a classification model, which gives an output of a probability value between 0 and 1. Cross-entropy loss
increases as the predicted probability value deviate from the actual label.

Hinge loss
Hinge loss can be used as an alternative to cross-entropy, which was initially developed to use with a support
vector machine algorithm. Hinge loss works best with the classification problem because target values are in
the set of {-1,1}. It allows to assign more error when there is a difference in sign between actual and predicted
values. Hence resulting in better performance than cross-entropy.

Squared Hinge loss


An extension of hinge loss, which simply calculates the square of the hinge loss score. It reduces the error
function and makes it numerically easier to work with. It finds the classification boundary that specifies the
maximum margin between the data points of various classes. Squared hinge loss fits perfect for YES OR NO
kind of decision problems, where probability deviation is not the concern.

3. Multi-class Classification Loss Functions


Multi-class classification is the predictive models in which the data points are assigned to more than two
classes. Each class is assigned a unique value from 0 to (Number_of_classes – 1). It is highly recommended
for image or text classification problems, where a single paper can have multiple topics.

Multi-class Cross-Entropy
In this case, the target values are in the set of 0 to n i.e {0,1,2,3…n}. It calculates a score that takes an average
difference between actual and predicted probability values, and the score is minimized to reach the best
possible accuracy. Multi-class cross-entropy is the default loss function for text classification problems.

Sparse Multi-class Cross-Entropy


One hot encoding process makes multi-class cross-entropy difficult to handle a large number of data points.
Sparse cross-entropy solves this problem by performing the calculation of error without using one-hot
encoding.

Kullback Leibler Divergence Loss


KL divergence loss calculates the divergence between probability distribution and baseline distribution and
finds out how much information is lost in terms of bits. The output is a non-negative value that specifies how
close two probability distributions are. To describe KL divergence in terms of probabilistic view, the
likelihood ratio is used.

Hinge Loss
The hinge loss is a specific type of cost function that incorporates a margin or distance from the
classification boundary into the cost calculation. Even if new observations are classified correctly, they
can incur a penalty if the margin from the decision boundary is not large enough. The hinge loss
increases linearly.
The hinge loss is mostly associated with soft-margin support vector machines.

34
If you are familiar with the construction of hyperplanes and their margins in support vector machines, you
probably know that margins are often defined as having a distance equal to 1 from the data-separating-
hyperplane. Otherwise, check out my post on support vector machines (link opens in new tab), where I explain
the details of maximum margins classifiers. We want data points to not only fall on the correct side of the
hyperplane but also to be located beyond the margin.

Support vector machines address a classification problem where observations either have an outcome of +1
or -1. The support vector machine produces a real-valued output that is negative or positive depending on
which side of the decision boundary it falls. Only if an observation is classified correctly and the distance from
the plane is larger than the margin will it incur no penalty. The distance from the hyperplane can be regarded
as a measure of confidence. The further an observation lies from the plane, the more confident it is in the
classification.

For example, if an observation was associated with an actual outcome of +1, and the SVM produced an output
of 1.5, the loss would equal 0.
Contrary to methods like linear regression, where we try to find a line that minimizes the distance from the
data points, an SVM tries to maximize the distance. If you are interested, check out my post on constructing
regression lines. Comparing the two approaches nicely illustrates the difference between the nature of
regression and classification problems.

35
An observation that is located directly on the boundary would incur a loss of 1 regardless of whether the real
outcome was +1 or -1.

Observations that fall on the correct side of the decision boundary (hyperplane) but are within the margin incur
a cost between 0 and 1.

All observations that end up on the wrong side of the hyperplane will incur a loss that is greater than 1 and
increases linearly. If the actual outcome was 1 and the classifier predicted 0.5, the corresponding loss would
be 0.5 even though the classification is correct.
Now that we have a strong intuitive understanding of the hinge loss, understanding the math will be a breeze.

HInge Loss Formula


The loss is defined according to the following formula, where t is the actual outcome (either 1 or -1), and y is
the output of the classifier.
l(y) = max(0, 1 -t \cdot y)l(y)=max(0,1−t⋅y)
Let’s plug in the values from our last example. The outcome was 1, and the prediction was 0.5.
l(y) = max(0, 1 - 1 \cdot 0.5) = 0.5l(y)=max(0,1−1⋅0.5)=0.5
If, on the other hand, the outcome was -1, the loss would be higher since we’ve misclassified our example.
l(y) = max(0, 1 - (-1) \cdot 0.5) = 1.5l(y)=max(0,1−(−1)⋅0.5)=1.5
36
Instead of using a labelling convention of -1, and 1 we could also use 0 and 1 and use the formula for cross-
entropy to set one of the terms equal to zero. But the math checks out more beautifully in the former case.
With the hinge loss defined, we are now in a position to understand the loss function for the support vector
machine. But before we do this, we’ll briefly discuss why and when we actually need a cost function.

Hard Margin vs Soft Margin Support Vector Machine


In a hard margin SVM, we want to linearly separate the data without misclassification. This implies that the
data actually has to be linearly separable.

In this case, the blue and red data points are linearly separable, allowing for a hard margin classifier.
If the data is not linearly separable, hard margin classification is not applicable.
Even though support vector machines are linear classifiers, they are still able to separate data points that are
not linearly separable by applying the kernel trick.

The blue and the red data points are not linearly separable.
Furthermore, if the margin of the SVM is very small, the model is more likely to overfit. In these cases, we
can choose to cut the model some slack by allowing for misclassifications. We call this a soft margin support
vector machine. But if the model produces too many misclassifications, its utility declines. Therefore, we need
to penalize the misclassified samples by introducing a cost function.
In summary, the soft margin support vector machine requires a cost function while the hard margin SVM does
not.

SVM Cost
In the post on support vectors, we’ve established that the optimization objective of the support vector classifier
is to minimize the term w, which is a vector orthogonal to the data-separating hyperplane onto which we
project our data points.
\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2wmin21i=1∑nwi2

37
This minimization problem represents the primal form of the hard margin SVM, which doesn’t account for
classification errors.
For the soft-margin SVM, we combine the minimization objective with a loss function such as the hinge loss.

\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2 + \sum^m_{j=1} max(0, 1 -t_j \cdot y_j)wmin21i=1∑nwi2


+j=1∑mmax(0,1−tj⋅yj)

The first term sums over the number of features (n), while the second term sums over the number of samples
in the data (m).
The t variable is the output produced by the model as a product of the weight parameter w and the data input
x.
t_i = w^Tx_jti=wTxj
To understand how the model generates this output, refer to the post on support vectors (link opens in new
tab).
The loss term has a regularizing effect on the model. But how can we control the regularization? That is how
can we control how aggressively the model should try to avoid misclassifications. To manually control the
number of misclassifications during training, we introduce an additional parameter, C, which we multiply with
the loss term.
\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2 + C\sum^m_{j=1} max(0, 1 -t_j \cdot y_j)wmin21i=1∑nwi2
+Cj=1∑mmax(0,1−tj⋅yj)

The smaller C is, the stronger the regularization. Accordingly, the model will attempt to maximize the margin
and be more tolerant towards misclassifications.

Cost function with a small regularization parameter C


If we set C to a large number, then the SVM will pursue outliers more aggressively, which potentially comes
at the cost of a smaller margin and may lead to overfitting on the training data. The classifier might be less
robust on unseen data.

Cost function with a large regularization parameter C leading to less regularization.

38
SVM Kernels
Kernel Function is a method used to take data as input and transform it into the required form of
processing data. “Kernel” is used due to a set of mathematical functions used in Support Vector Machine
providing the window to manipulate the data. So, Kernel Function generally transforms the training set of
data so that a non-linear decision surface is able to transform to a linear equation in a higher number of
dimension spaces. Basically, It returns the inner product between two points in a standard feature
dimension.
Standard Kernel Function Equation :

Major Kernel Functions :-


For Implementing Kernel Functions, first of all, we have to install the “scikit-learn” library using the
command prompt terminal:

pip install scikit-learn

Gaussian Kernel: It is used to perform transformation when there is no prior knowledge about data.

 Gaussian Kernel Radial Basis Function (RBF): Same as above kernel function, adding radial
basis method to improve the transformation.

Gaussian Kernel Graph


Code:
from sklearn.svm import SVC
classifier = SVC(kernel ='rbf', random_state = 0)
# training set in x, y axis
classifier.fit(x_train, y_train)

Sigmoid Kernel: this function is equivalent to a two-layer, perceptron model of the neural network,
which is used as an activation function for artificial neurons.

39
Sigmoid Kernel Graph

Code:

from sklearn.svm import SVC


classifier = SVC(kernel ='sigmoid')
classifier.fit(x_train, y_train) # training set in x, y axis

Polynomial Kernel: It represents the similarity of vectors in the training set of data in a feature space
over polynomials of the original variables used in the kernel.

Polynomial Kernel Graph


Code:
from sklearn.svm import SVC
classifier = SVC(kernel ='poly', degree = 4)
classifier.fit(x_train, y_train) # training set in x, y axis

40

You might also like