ML-Unit 5
ML-Unit 5
k- Nearest Neighbors- Decision Trees – Branching – Greedy Algorithm - Multiple Branches – Continuous
attributes – Pruning. Random Forests: ensemble learning. Boosting – Adaboost algorithm. Support Vector
Machines – Large Margin Intuition – Loss Function - Hinge Loss – SVM Kernels
1
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV car.
The company wants to give the ads to the users who are interested in buying that SUV. So for this problem,
we have a dataset that contains multiple user's information through the social network. The dataset contains
lots of information but the Estimated Salary and Age we will consider for the independent variable and
the Purchased variable is for the dependent variable. Below is the dataset:
2
Steps to implement the K-NN algorithm:
Data Pre-processing step
Fitting the K-NN algorithm to the Training set
Predicting the test result
Test accuracy of the result(Creation of Confusion matrix)
Visualizing the test set result.
And then we will fit the classifier to the training data. Below is the code for it:
from sklearn.neighbors import KNeighborsClassifier #Fitting K-NN classifier to the training set
classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
classifier.fit(x_train, y_train)
Predicting the Test Result: To predict the test set result, we will create a y_pred vector as we did in Logistic
Regression. Below is the code for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)
3
Visualizing the Training set result:
Now, we will visualize the training set result for K-NN model. The code will remain same as we did in
Logistic Regression, except the name of the graph.
The output graph is different from the graph which we have occurred in Logistic Regression. It can be
understood in the below points:
As we can see the graph is showing the red point and green points. The green points are for
Purchased(1) and Red Points for not Purchased(0) variable.
The graph is showing an irregular boundary instead of showing any straight line or any curve
because it is a K-NN algorithm, i.e., finding the nearest neighbor.
The graph has classified users in the correct categories as most of the users who didn't buy the SUV
are in the red region and users who bought the SUV are in the green region.
The graph is showing good result but still, there are some green points in the red region and red
points in the green region. But this is no big issue as by doing this model is prevented from
overfitting issues.
Hence our model is well trained.
4
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons
for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting
a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on
the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further.
It continues the process until it reaches the leaf node of the tree. The complete process can be better understood
using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the
offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM).
The root node splits further into the next decision node (distance from the office) and one leaf node based on
the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and
one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
5
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for the root
node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute selection
measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:
Information Gain
Gini Index
1. Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of a dataset based
on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the decision tree.
A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute
having the highest information gain is split first. It can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no
2. Gini Index:
Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high Gini index.
It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features of
the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is
known as Pruning. There are mainly two types of tree pruning technology used:
Cost Complexity Pruning
Reduced Error Pruning.
6
Python Implementation of Decision Tree
Now we will implement the Decision tree using Python. For this, we will use the dataset "user_data.csv,"
which we have used in previous classification models. By using the same dataset, we can compare the Decision
tree classifier with other classification models such as KNN SVM, LogisticRegression, etc.
Data Pre-processing step
Fitting a Decision-Tree algorithm to the Training set
Predicting the test result
Test accuracy of the result(Creation of Confusion matrix)
Visualizing the test set result.
In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:
In the above code, we have created a classifier object, in which we have passed two main parameters;
"criterion='entropy': Criterion is used to measure the quality of split, which is calculated by
information gain given by entropy.
random_state=0": For generating the random states.
7
3. Predicting the test result
Now we will predict the test set result. We will create a new prediction vector y_pred. Below is the code for
it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
Output:
In the below output image, the predicted output and real test output are given. We can clearly see that there
are some values in the prediction vector, which are different from the real vector values. These are prediction
errors.
In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions. Therefore, we can say that compared to other
classification models, the Decision Tree classifier made a good prediction.
The above output is completely different from the rest classification models. It has both vertical and horizontal
lines that are splitting the dataset according to the age and estimated salary variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.
Visualization of test set result will be similar to the visualization of the training set except that the training set
will be replaced with the test set.
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
fori, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Decision Tree Algorithm(Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
9
Output:
3. Greedy Algorithm:
The greedy method is one of the strategies like Divide and conquer used to solve the problems. This
method is used for solving optimization problems. An optimization problem is a problem that demands
either maximum or minimum results.
The Greedy method is the simplest and straightforward approach. It is not an algorithm, but it is a
technique. The main function of this approach is that the decision is taken on the basis of the currently
available information. Whatever the current information is present, the decision is made without
worrying about the effect of the current decision in future.
This technique is basically used to determine the feasible solution that may or may not be optimal. The
feasible solution is a subset that satisfies the given criteria. The optimal solution is the solution which
is the best and the most favorable solution in the subset. In the case of feasible, if more than one
solution satisfies the given criteria then those solutions will be considered as the feasible, whereas the
optimal solution is the best solution among all the solutions.
10
Pseudo code of Greedy Algorithm
Algorithm Greedy (a, n)
{
Solution : = 0;
for i = 0 to n do
{
x: = select(a);
if feasible(solution, x)
{
Solution: = union(solution , x)
}
return solution;
}}
The above is the greedy algorithm. Initially, the solution is assigned with zero value. We pass the array and
number of elements in the greedy algorithm. Inside the for loop, we select the element one by one and checks
whether the solution is feasible or not. If the solution is feasible, then we perform the union.
Let's understand through an example.
Suppose there is a problem 'P'. I want to travel from A to B shown as below:
P:A→B
The problem is that we have to travel this journey from A to B. There are various solutions to go from A to
B. We can go from A to B by walk, car, bike, train, aeroplane, etc. There is a constraint in the journey that
we have to travel this journey within 12 hrs. If I go by train or aeroplane then only, I can cover this distance
within 12 hrs. There are many solutions to this problem but there are only two solutions that satisfy the
constraint.
If we say that we have to cover the journey at the minimum cost. This means that we have to travel this
distance as minimum as possible, so this problem is known as a minimization problem. Till now, we have two
feasible solutions, i.e., one by train and another one by air. Since travelling by train will lead to the minimum
cost so it is an optimal solution. An optimal solution is also the feasible solution, but providing the best result
so that solution is the optimal solution with the minimum cost. There would be only one optimal solution.
The problem that requires either minimum or maximum result then that problem is known as an optimization
problem. Greedy method is one of the strategies used for solving the optimization problems.
We have to travel from the source to the destination at the minimum cost. Since we have three feasible
solutions having cost paths as 10, 20, and 5. 5 is the minimum cost path so it is the optimal solution. This is
11
the local optimum, and in this way, we find the local optimum at each stage in order to calculate the global
optimal solution.
Continuous attributes
What are Continuous Variables?
Simply put, if a variable can take any value between its minimum and maximum value, then it is called a
continuous variable. By nature, a lot of things we deal with fall in this category: age, weight, height being
some of them.
Just to make sure the difference is clear, let me ask you to classify whether a variable is continuous or
categorical:
1. Gender of a person
2. Number of siblings of a Person
3. Time on which a laptop runs on battery
Normalization:
In simpler words, it is a process of comparing variables at a ‘neutral’ or ‘standard’ scale. It helps to obtain
same range of values. Normally distributed data is easy to read and interpret. As shown below, in a normally
distributed data, 99.7% of the observations lie within 3 standard deviations from the mean. Also, the mean is
zero and standard deviation is one. Normalization technique is commonly used in algorithms such as k-means,
clustering etc.
A commonly used normalization method is z-scores. Z score of an observation is the number of standard
deviations it falls above or below the mean. It’s formula is shown below.
12
Randy scored 76 in maths test. Katie score 86 in science test. Maths test has (mean = 70, sd = 2). Science test
has (mean = 80, sd = 3)
z(Randy) = (76 – 70)/2 = 3
z(Katie) = (86 – 80)/3 = 2
There are various types of transformation methods. Some are Log, sqrt, exp, Box-cox, power etc. The
commonly used method is Log Transformation.
Hence, to avoid such situation we use PCA a.k.a Principal Component Analysis. It is nothing but, finding out
few ‘principal ‘variables which explain significant amount of variation in dependent variable. Using this
technique, a large number of variables are reduced to few significant variables. This technique helps to reduce
noise, redundancy and enables quick computations.
Factor Analysis:
Factor Analysis was invented by Charles Spearman (1904). This is a variable reduction technique. It is used
to determine factor structure or model. It also explains the maximum amount of variance in the model. Let’s
say some variables are highly correlated. These variables can be grouped by their correlations i.e., all variables
in a particular group can be highly correlated among themselves but have low correlation with variables of
other group(s). Here each group represents a single underlying construct or factor. Factor analysis is of two
types:
1. EFA (Exploratory Factor Analysis) – It identifies and summarizes the underlying correlation structure
in a data set
2. CFA (Confirmatory Factor Analysis) – It attempts to confirm hypothesis using the correlation structure
and rate ‘goodness of fit’.
About pruning
Pruning is the process of eliminating weight connections from a network to speed up inference and reduce
model storage size. Decision trees and neural networks, in general, are overparameterized. Pruning a network
entails deleting unneeded parameters from an overly parameterized network.
Pruning mostly serves as an architectural search inside the tree or network. In fact, because pruning functions
as a regularizer, a model will often generalise slightly better at low levels of sparsity. The trimmed model will
match the baseline at higher levels. If you push it too far, the model will start to generalise worse than the
baseline, but with greater performance.
14
The major disadvantage of pre-pruning is the narrow viewing field, which implies that the tree’s current
expansion may not match the standards, but later expansion may. In this situation, the decision tree’s
development is halted early.
Post-pruning
The decision tree generation is divided into two steps by post-pruning. The first step is the tree-building
process, with the termination condition that the fraction of a certain class in the node reaches 100%, and the
second phase is pruning the tree structure gained in the first phase.
Post-pruning techniques circumvent the problem of a narrow viewing field in this way. As a result, post-
pruning procedures are often more accurate than pre-pruning methods, therefore post-pruning methods are
more widely utilised than pre-pruning methods. The pruning procedure identifies the node as a leaf node by
using the label of the most common class in the subset associated with the current node, which is the same as
in pre-pruning.
Pruning methods
The goal of pruning is to remove sections of a classification model that explain random variation in the training
sample rather than actual domain characteristics. This makes the model more understandable to the user and,
perhaps, more accurate on fresh data that was not used to train the classifier. An effective approach for
differentiating sections of a classifier that are attributable to random effects from parts that describe significant
structure is required for pruning. There are different methods for pruning listed in this article used in both
strategies.
15
Minimum Error Pruning (MEP)
This method is a bottom-up strategy that seeks a single tree with the lowest “anticipated error rate on an
independent data set.” This does not indicate the adoption of a pruning set, but rather that the developer wants
to estimate the error rate for unknown scenarios. Indeed, both the original and enhanced versions described
exploiting just information from the training set.
In the presence of noisy data, Laplace probability estimation is employed to improve the performance of ID3.
Later, the Bayesian technique was employed to enhance this procedure, and the approach is known as an m-
probability estimation. There were two modifications:
Prior probabilities are used in estimate rather than assuming a uniform starting distribution of classes.
Several trees with differing degrees of pruning may be generated by adjusting the value of the
parameter. The degree of pruning is now decided by parameters rather than the number of classes.
Furthermore, factors like the degree of noise in the training data may be changed based on domain
expertise or the complexity of the problem.
The predicted error rate for each internal node is estimated in the minimal error pruning approach and is
referred to as static error. The anticipated error rate of the branch with the node is then estimated as a weighted
sum of the expected error rates of the node’s children, where each weight represents the chance that
observation in the node would reach the associated child.
16
Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It
can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the
performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
17
Disadvantages of Random Forest
o Although random forest can be used for both classification and regression tasks, it is not more suitable
for Regression tasks.
18
3. Predicting the Test Set result
Since our model is fitted to the training set, so now we can predict the test result. For prediction, we will create
a new prediction vector y_pred. Below is the code for it:
1. #Predicting the test set result
2. y_pred= classifier.predict(x_test)
Output:
The prediction vector is given as:
By checking the above prediction vector and test set real vector, we can determine the incorrect predictions
done by the classifier.
Output: As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92
correct predictions.
19
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Random Forest Algorithm(Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Ensemble Learning
ensemble learning helps improve machine learning results by combining several models. This approach
allows the production of better predictive performance compared to a single model. Basic idea is to learn a
set of classifiers (experts) and to allow them to vote.
20
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a decision tree classifier
and is generated using a random selection of attributes at each node to determine the split. During
classification, each tree votes and the most popular class is returned.
AdaBoost was the first really successful boosting algorithm developed for the purpose of binary
classification. AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that
combines multiple “weak classifiers” into a single “strong classifier”. It was formulated by Yoav Freund
and Robert Schapire. They also won the 2003 Gödel Prize for their work.
21
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
Explanation:
The above diagram explains the AdaBoost algorithm in a very simple way. Let’s try to understand it in a
stepwise process:
B1 consists of 10 data points which consist of two types namely plus(+) and minus(-) and 5 of
which are plus(+) and the other 5 are minus(-) and each one has been assigned equal weight
initially. The first model tries to classify the data points and generates a vertical separator line
but it wrongly classifies 3 plus(+) as minus(-).
B2 consists of the 10 data points from the previous model in which the 3 wrongly classified
plus(+) are weighted more so that the current model tries more to classify these pluses(+)
correctly. This model generates a vertical separator line that correctly classifies the previously
wrongly classified pluses(+) but in this attempt, it wrongly classifies three minuses(-).
B3 consists of the 10 data points from the previous model in which the 3 wrongly classified
minus(-) are weighted more so that the current model tries more to classify these minuses(-)
correctly. This model generates a horizontal separator line that correctly classifies the previously
wrongly classified minuses(-).
B4 combines together B1, B2, and B3 in order to build a strong prediction model which is much
better than any individual model used.
22
values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an
output of -0.8, which would be an ensemble prediction of -1.0 or the second class.
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify whether
it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train our model
with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and then we
test it with this strange creature. So as support vector creates a decision boundary between these two data (cat
and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis
of the support vectors, it will classify it as a cat. Consider the below diagram:
23
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane
are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be
multiple lines that can separate these classes. Consider the below image:
24
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points
are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the
goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot
draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
25
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:
26
Data Pre-processing step
Till the Data pre-processing step, the code will remain the same. Below is the code:
#Data Pre-processing Step
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
After executing the above code, we will pre-process the data. The code will give the dataset as:
27
Fitting the SVM classifier to the training set:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will
import SVC class from Sklearn.svm library. Below is the code for it:
from sklearn.svm import SVC # "Support vector classifier"
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(x_train, y_train)
In the above code, we have used kernel='linear', as here we are creating SVM for linearly separable data.
However, we can change it for non-linear data. And then we fitted the classifier to the training dataset(x_train,
y_train)
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization factor), gamma, and
kernel.
Predicting the test set result:
Now, we will predict the output for test set. For this, we will create a new vector y_pred. Below is the code
for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)
After getting the y_pred vector, we can compare the result of y_pred and y_test to check the difference
between the actual value and predicted value.
Output: Below is the output for the prediction of the test set:
Output:
As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10 correct
predictions. Therefore we can say that our SVM model improved as compared to the Logistic regression
model.
Visualizing the training set result:
Now we will visualize the training set result, below is the code for it:
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('SVM classifier (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:
29
As we can see, the above output is appearing similar to the Logistic regression output. In the output, we got
the straight line as hyperplane because we have used a linear kernel in the classifier. And we have also
discussed above that for the 2d space, the hyperplane in SVM is a straight line.
Visualizing the test set result:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('SVM classifier (Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:
By executing the above code, we will get the output as:
As we can see in the above output image, the SVM classifier has divided the users into two regions (Purchased
or Not purchased). Users who purchased the SUV are in the red region with the red scatter points. And users
who did not purchase the SUV are in the green region with green scatter points. The hyperplane has divided
the two classes into Purchased and not purchased variable.
30
Large Margin Intuition
SVM Decision Boundary
Consider a case where we set constant C to be a very large value, when minimizing the optimization objective,
we are going to be highly motivated to choose a value, so that the first term is equal to 0. So what would it
take to make this first term equal to 0.
31
SVM Decision Boundary
We can rewrite the optimization objective of SVM as follow:
where p(i) is the projection of x(u) onto the vector θ.
Simplification: θ0 = 0.
According to the illustration below, with the minimal value of the magnitude of θ, the absolute value of p will
large as much as possible (hence the large margin).
In logistic regression, we take the output of the linear function and squash the value within the range of [0,1]
using the sigmoid function. If the squashed value is greater than a threshold value(0.5) we assign it a label 1,
else we assign it a label 0. In SVM, we take the output of the linear function and if that output is greater than
1, we identify it with one class and if the output is -1, we identify is with another class. Since the threshold
values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as
margin.
Loss Function
In Machine learning, the loss function is determined as the difference between the actual output and the
predicted output from the model for the single training example while the average of the loss function for all
the training examples is termed as the cost function. This computed difference from the loss functions( such
32
as Regression Loss, Binary Classification, and Multiclass Classification loss function) is termed the error
value; this error value is directly proportional to the actual and predicted value.
It is important to note that, amount of deviation doesn’t matter; the thing which matters here is whether the
value predicted by our model is right or wrong. Loss functions are different based on your problem statement
to which machine learning is being applied. The cost function is another term used interchangeably for the
loss function, but it holds a slightly different meaning. A loss function is for a single training example, while
a cost function is an average loss over the complete train dataset.
33
2. Binary Classification Loss Functions
These loss functions are made to measure the performances of the classification model. In this, data points are
assigned one of the labels, i.e. either 0 or 1. Further, they can be classified as:
Binary Cross-Entropy
It’s a default loss function for binary classification problems. Cross-entropy loss calculates the performance
of a classification model, which gives an output of a probability value between 0 and 1. Cross-entropy loss
increases as the predicted probability value deviate from the actual label.
Hinge loss
Hinge loss can be used as an alternative to cross-entropy, which was initially developed to use with a support
vector machine algorithm. Hinge loss works best with the classification problem because target values are in
the set of {-1,1}. It allows to assign more error when there is a difference in sign between actual and predicted
values. Hence resulting in better performance than cross-entropy.
Multi-class Cross-Entropy
In this case, the target values are in the set of 0 to n i.e {0,1,2,3…n}. It calculates a score that takes an average
difference between actual and predicted probability values, and the score is minimized to reach the best
possible accuracy. Multi-class cross-entropy is the default loss function for text classification problems.
Hinge Loss
The hinge loss is a specific type of cost function that incorporates a margin or distance from the
classification boundary into the cost calculation. Even if new observations are classified correctly, they
can incur a penalty if the margin from the decision boundary is not large enough. The hinge loss
increases linearly.
The hinge loss is mostly associated with soft-margin support vector machines.
34
If you are familiar with the construction of hyperplanes and their margins in support vector machines, you
probably know that margins are often defined as having a distance equal to 1 from the data-separating-
hyperplane. Otherwise, check out my post on support vector machines (link opens in new tab), where I explain
the details of maximum margins classifiers. We want data points to not only fall on the correct side of the
hyperplane but also to be located beyond the margin.
Support vector machines address a classification problem where observations either have an outcome of +1
or -1. The support vector machine produces a real-valued output that is negative or positive depending on
which side of the decision boundary it falls. Only if an observation is classified correctly and the distance from
the plane is larger than the margin will it incur no penalty. The distance from the hyperplane can be regarded
as a measure of confidence. The further an observation lies from the plane, the more confident it is in the
classification.
For example, if an observation was associated with an actual outcome of +1, and the SVM produced an output
of 1.5, the loss would equal 0.
Contrary to methods like linear regression, where we try to find a line that minimizes the distance from the
data points, an SVM tries to maximize the distance. If you are interested, check out my post on constructing
regression lines. Comparing the two approaches nicely illustrates the difference between the nature of
regression and classification problems.
35
An observation that is located directly on the boundary would incur a loss of 1 regardless of whether the real
outcome was +1 or -1.
Observations that fall on the correct side of the decision boundary (hyperplane) but are within the margin incur
a cost between 0 and 1.
All observations that end up on the wrong side of the hyperplane will incur a loss that is greater than 1 and
increases linearly. If the actual outcome was 1 and the classifier predicted 0.5, the corresponding loss would
be 0.5 even though the classification is correct.
Now that we have a strong intuitive understanding of the hinge loss, understanding the math will be a breeze.
In this case, the blue and red data points are linearly separable, allowing for a hard margin classifier.
If the data is not linearly separable, hard margin classification is not applicable.
Even though support vector machines are linear classifiers, they are still able to separate data points that are
not linearly separable by applying the kernel trick.
The blue and the red data points are not linearly separable.
Furthermore, if the margin of the SVM is very small, the model is more likely to overfit. In these cases, we
can choose to cut the model some slack by allowing for misclassifications. We call this a soft margin support
vector machine. But if the model produces too many misclassifications, its utility declines. Therefore, we need
to penalize the misclassified samples by introducing a cost function.
In summary, the soft margin support vector machine requires a cost function while the hard margin SVM does
not.
SVM Cost
In the post on support vectors, we’ve established that the optimization objective of the support vector classifier
is to minimize the term w, which is a vector orthogonal to the data-separating hyperplane onto which we
project our data points.
\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2wmin21i=1∑nwi2
37
This minimization problem represents the primal form of the hard margin SVM, which doesn’t account for
classification errors.
For the soft-margin SVM, we combine the minimization objective with a loss function such as the hinge loss.
The first term sums over the number of features (n), while the second term sums over the number of samples
in the data (m).
The t variable is the output produced by the model as a product of the weight parameter w and the data input
x.
t_i = w^Tx_jti=wTxj
To understand how the model generates this output, refer to the post on support vectors (link opens in new
tab).
The loss term has a regularizing effect on the model. But how can we control the regularization? That is how
can we control how aggressively the model should try to avoid misclassifications. To manually control the
number of misclassifications during training, we introduce an additional parameter, C, which we multiply with
the loss term.
\min_{w} \frac{1}{2} \sum^n_{i=1}w_i^2 + C\sum^m_{j=1} max(0, 1 -t_j \cdot y_j)wmin21i=1∑nwi2
+Cj=1∑mmax(0,1−tj⋅yj)
The smaller C is, the stronger the regularization. Accordingly, the model will attempt to maximize the margin
and be more tolerant towards misclassifications.
38
SVM Kernels
Kernel Function is a method used to take data as input and transform it into the required form of
processing data. “Kernel” is used due to a set of mathematical functions used in Support Vector Machine
providing the window to manipulate the data. So, Kernel Function generally transforms the training set of
data so that a non-linear decision surface is able to transform to a linear equation in a higher number of
dimension spaces. Basically, It returns the inner product between two points in a standard feature
dimension.
Standard Kernel Function Equation :
Gaussian Kernel: It is used to perform transformation when there is no prior knowledge about data.
Gaussian Kernel Radial Basis Function (RBF): Same as above kernel function, adding radial
basis method to improve the transformation.
Sigmoid Kernel: this function is equivalent to a two-layer, perceptron model of the neural network,
which is used as an activation function for artificial neurons.
39
Sigmoid Kernel Graph
Code:
Polynomial Kernel: It represents the similarity of vectors in the training set of data in a feature space
over polynomials of the original variables used in the kernel.
40