Classification (Part II)
Classification (Part II)
• Classification and clustering are two methods of pattern identification with some
similarity and disimilarity.
For example, in a banking application, the customer who applies for a loan may be classified as a
Clustering
• Clustering is a technique of organising a group of data into classes and clusters
where the objects reside inside a cluster will have high similarity and the objects
of two clusters would be dissimilar to each other.
• In clustering, the similarity as well as distance between two objects is measured
by the similarity function. Generally, Intra-cluster distance are less than inter-
cluster distance.
Classification Vs Clustering
Parameter Classification Clustering
Type Used for supervised learning Used for unsupervised learning
Basic process of classifying the input instances grouping the instances based on their
based on their corresponding class labels similarity without the help of class
labels
Need it has labels, so there is need of training there is no need of training and testing
and testing dataset for verifying the dataset
created model
Complexity more complex less complex
Example Logistic regression, Naive Bayes k-means clustering algorithm, Fuzzy c-
Algorithms classifier, Support vector machines, KNN, means clustering algorithm, Gaussian
Decision Tree etc. (EM) clustering algorithm, etc.
Application Detection of unsolicited email Investigation of the social networks
Recognition of the face Segmentation of an image
Approval of a Bank Loan Recommendation Engines
Supervised vs. Unsupervised Learning
Supervised and Unsupervised learning are the two techniques of ML. Both are
used in different scenarios and with different dataset.
Supervised vs. Unsupervised Learning
Supervised learning (classification) –
In Supervised learning models are trained using labeled data.
In supervised learning, models need to find the mapping function to map the input variable (X) with the
output variable (Y).
Y= f(X)
Supervised learning needs supervision to train the model, which is similar to as a student learns things in
the presence of a teacher.
Supervised learning can be used for two types of problems: Classification and Regression.
• Find a model for the class attribute as a function of the values of the other
attributes
• Use training set to construct the model and use test set to estimate the
accuracy of the model
Classifiaction: Test data are used to estimate the accuracy of the classification rules. If the
accuracy is considered acceptable, the rules can be applied to the classification of new data tuples.
Issues regarding classification and prediction:
(1) Data Preparation
19
Issues regarding classification and prediction:
(2) Evaluating Classification Methods
• Predictive accuracy
• Speed and scalability
time to construct the model
time to use the model
• Robustness
handling noise and missing values
• Scalability
efficiency in disk-resident databases
• Interpretability
understanding and insight provded by the model
• Goodness of rules
decision tree size
compactness of classification rules 20
Classification Technique
• Decision Trees
• Bayesian Classifiers
• Neural Networks
• K-Nearest Neighbour
• Support Vector Machines
• Linear Regression
• Logistic Regression
• Fuzzzy Algorithm
• Genetic Algorithm
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
– The topmost node in the tree is the root node.
• Decision tree generation consists of two phases
–Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
–Tree pruning
Identify and remove branches that reflect noise or outliers
– Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Decision Tree for Cheat or not Cheat
Decision Tree for Playing Tennis
Play-Tennis Data
DAY OUTLOOK TEMP HUMIDITY WIND PLAY
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Decision Tree for Conjunction
Decision Tree for Disjunction
Decision Tree for XOR
Decision Tree for Disjunction of Conjuction
Three Possible Partition Scheme
Atribute Selection Measure
• Attribute selection measure finds the best attributes in sequence which can
give a smallest decision tree. It provide a splitting rule that determine how
the tuple at a give node can be best split.
• On the basis of sample dataset/traning dataset attribute selection measure
provide a ranking for each attibute.
• The attribute with best rank/score for the measure is choosen as the
splitting attribute for the given tuples.
• Three popular attribute selection measure:
• Information Gain
• Gain Ratio
• Gini Index
Information Gain Method
Age has highest information gain from rest of the attribute, and hence is seleceted
as splitting attribute.
Level 0 Decision Tree using 1st Splitting Attribute
Age
• Tupple Falling on age=middle_aged
all belong to the same class “YES”
and hence a leaf can be created for
that branch.
Example:
• If (Age == Youth) AND (Student == No) then buys_computer=”No”
• If (Age == Youth) AND (Student == No) then buys_computer=”Yes”
• If (Age == middle_aged) then buys_computer=”Yes”
• If( Age == senior) AND credit_rating == “Excellent” then buys_computer=”No”
• If(Age == senior) AND credit_rating == “fair” then buys_computer=”Yes”
Drawback of Information gain as attribute selection measure
• The information gain measure is biased towards tests with many outcomes.
i.e. it prefers to select attribute having a large number of values.
• For instance, for the this dataset information gain
prefers to split on attribute “Table income” that
would result in a large number of pure partitions
but each one containg one tuple.
Infoincome(D)=0 [variation = 0]
=> Gainincome(D) = Info(D)-Infoincome(D)=Info(D)
• Hence, Information gain w.r.t. income is maximal.
Solution:
C4.5, a successor of ID3
Gain Ratio
• C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias.
• It attempts a kind of normalization to information gain using a “split information” value
defined analogously with Info(D) as follows.
• If there are “v” partitions in attribute A then SlpitInfo finds the number of tuples having
that outcome w.r.t. total number of tuple.
• Now, to compute the best splitting attirbute the criteria “GainRatio” will be used instead
of “Gain” which intern uses “SplitInfo”.
A prior probability is the probability that an observation will fall into a group
before one collect the data.
Example:
1. John conducts a single coin toss. What is the a priori probability of landing a head?
A priori probability of landing a head = 1 / 2 = 50%.
Where, A, B = Events
P(B|A):- The probability of B occuring given that A has happened
P(A) & P(B):- The prior probability of event A & event B
Probability Basics: Example Calculating Posterior Probability
Probability Basics: Posterior Probability contd...
For an intuitive understanding of this probability, suppose the following grid represents this
forest with 100 trees. Exactly 20 of the trees are Oak trees (20%) and 18 (90%) of them are
healthy. The other 80 (80%) trees are Maple and 40 (50%) of them are healthy.
(O = Oak, M = Maple, Green = Healthy, Red = Unhealthy)
Out of all exactly 58 of them are healthy and 18 of these healthy ones are Oak trees. Thus, if
a healthy tree has selected then the probability that it’s an Oak tree is 18/58 = 0.3103.
Bayes’ Theorem [useful in finding posterior probability]
An Example of Bayes’ Theorem
Determine:-
Normalization Constant
An Example of Bayes’ Theorem contd...
Naive Bayes Classifier
Naive Bayes Classifier
Hence, the Naive Bayesian classifier predicts buys_comp is yes for given tuple X
Problem
For the given training
dataset let the new
customer X value as
follows. Find the
probability that Golf will
be played or not?
Find the query X belongs
to which class?
X = (Outlook=Rainy
Temp = Hot
Humidity = High
Windy = False)
Partial Solution
Advantage
• Simple, fast and easy to implement.Needs less training data. Highly scalable.
• Handles continuous and discrete data.
• Easily updatable if training dataset increases
Uses
• Text Classification
• Spam Filtering
• Hybrid Recommender
• Online Application
Disavantages
• Assumption: Class conditional independence, therefore loss of accuracy
• Practically, dependecies exist among variables.
Example: Hospitals, patients profile, family history, didease symptoms etc
Major Issues
• Violations of Independence Assumption: It assumes that the effect of an attribute values
on a given class is independent of the values of the other attribute.
Solution: Bayesian belief network classifier allows the representation of
dependencies among subsets of attributes
• Zero Conditional Probability problem: If a given class and feature value never occur
together in the training set then the frequency based probability estimate will be zero. This
is problematic since it will wipe out all information in the other probabilities when they are
multiplied.
Therefore often it is desirable to incorporate a small sample correction in all possible
option. Or consider the occurrence/probability as 1 instead of zero which can be
negligible in the consideration of maximum.
K-Nearest Neighbor
(Lazy Learners or Learning from your Neighbour)
Eager Learners Vs. Lazy Learners
Eager Learner Lazy Learner
• Eager Learner approach of classification • Lazy Learner approach stores the received
constructs a generalized classification training tuples and wait to the last minute of
model just after receiving the training tuples receiving the new/test tuple. Only when it
receives the test tuple, it constructs the
(much before receiving new/test tuple), and
generalized classification model based on
be ready and eager to classify the new similarity among test tuple and training
unseen tuple. tuples.
• More work when training tuple is received • Less work (only store) when training tuple is
and less work when making a classification. received and more work when making a
• Less Expensive but not support incremental classification
learning • More Expensive but support incremental
learning
• e.g. Decision Tree, Bayesian Classifier,
Backpropagation, Association Rule Mining • e.g. K-Nearest Neighbor, case-based learning
K-Nearest Neighbor Classifier
• The problem can be solved by tuning the value of k-neighbours parameter. As we increase
the number of neighbours k, the model starts to generalize well, but increasing the value too
much would again drop the performance.
• Therefore, it’s important to find an optimal value of K, such that the model is able to
classify well on the test data set with minimum error.
• Setting K value to the square root of number of training tuple or half of the square root
of number of training tuple can lead to better result.
• For an even number of classes (e.g. 2 or 4) it is a good idea to choose a K value with an odd
number to avoid a tie. And the inverse, use an even number for K when you have an odd
number of classes
How to Measure Distance
Majority of {Yes, Yes, No} is Yes. Hence Class of new tuple is Yes
Should all Attributes be weighted equally,
or
Some Attributes have more influence than others?
Some attributes (like income, loan) outweighs some smaller ranged attributes (like binary or
categorical or numeric like age) and thus, the result obtained can be a biased one.
Normalization is a solution to it: Max-Min Normalization generally used
Solution : Normalization
KNN Classifier Algorithm
Advantage:
• No training period: There is no discriminative function or model build at the traing stage
• As there is no training period, any new data can be added seamlessly without impacting
accuracy of the result.
• Easy to implement
• Best used where probability is unknown
• Useful for non-linear data as no assumption is required
Disadvantage:
• Computationally expensive and more space complexity
• Output depends on choosen K value whic can reduce accuracy for some K value
• Does not work well with large data set and high dimension
• Sensitive to noisy, missing and outlier data
• Need normlization
Application
• Used in classification and regression
• Used in pattern recognition
• Used in gene expression, protein-protein prediction, get 3D structure
of the protein
• Used to measure document similarity
Clustering
Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups.
It is basically a collection of objects on the basis of similarity and dissimilarity between them.
Following is an example of finding clusters of population based on their income and debt.
96
Clustering Example
Imagine of a number of objects in a basket. Each item has a distinct set of features (size, shape, color, etc.).
Now the task is to group each of the objects in the basket. A natural first question to ask is, ‘on what basis
these objects should be grouped?’ Perhaps size, shape, or color.
97
Clustering Example cont…
The data points clustered together can be classified into one single group. The clusters can be distinguished,
and can identify that there are 3 clusters.
98
Clustering Types
Clustering can be divided into two subgroups:
Hard Clustering: each data point either belongs to a cluster completely or not.
Soft Clustering: Instead of putting each data point into a separate cluster, a probability or likelihood of
that data point to be in those clusters is assigned.
99
Clustering Application
Marketing: It can be used to characterize & discover customer segments for marketing purposes.
Biology: It can be used for classification among different species of plants and animals.
Libraries: It is used in clustering different books on the basis of topics and information.
Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
City Planning: It is used to make groups of houses and to study their values based on their geographical
locations and other factors present.
Earthquake studies: By learning the earthquake affected areas, the dangerous zones can be
determined.
Healthcare: It can be used in identifying and classifying the cancerous gene.
Search Engine: It is the backbone behind the search engines. Search engines try to group similar
objects in one cluster and the dissimilar objects far from each other.
Education: It can be used to monitor the students' academic performance. Based on the students' score
they are grouped into different-different clusters, where each clusters denoting the different level of
performance.
100
Requirements of Clustering in Data Mining
Scalability − Highly scalable clustering algorithms are needed to deal with large databases.
Ability to deal with different kinds of attributes - Algorithms should be capable to be applied on any
kind of data such as interval-based (numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape - The clustering algorithm should be capable of detecting
clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
High dimensionality - The clustering algorithm should not only be able to handle low-dimensional data
but also the high dimensional space.
Ability to deal with noisy data − Dataset contain noisy, missing or erroneous data. Some algorithms are
sensitive to such data and may lead to poor quality clusters.
Interpretability − The clustering results should be interpretable, comprehensible, and usable.
101
Types of clustering algorithms
Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty. Every
methodology follows a different set of rules for defining the ‘similarity’ among data points. The popular types are:
Connectivity models: These models are based on the notion that the data points closer in data space exhibit more
similarity to each other than the data points lying farther away. These models can follow two approaches. In the
first approach, they start with classifying all data points into separate clusters & then aggregating them as the
distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned
as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret
but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its
variants.
Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the
closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that
falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which
makes it important to have prior knowledge of the dataset.
102
Types of clustering algorithms cont…
Distribution models: These clustering models are based on the notion of how probable is it that all data points in the
cluster belong to the same distribution. These models often suffer from overfitting. A popular example of these models is
Expectation-maximization algorithm which uses multivariate normal distributions.
Density Models: These models search the data space for areas of varied density of data points in the data space. It
isolates various different density regions and assign the data points within these regions in the same cluster. Popular
examples of density models are DBSCAN and OPTICS.
Summary
Connectivity based: Create a hierarchical decomposition of the set of data using some criterion.
Centroid based: Construct various partitions and then evaluate them by some criterion.
Distribution based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that
model to each other.
Density based: Based on connectivity and density functions.
103
Hierarchical Clustering
Let’s say we have the below points and we want to cluster them into groups.
Now, based on the similarity of these clusters, the most similar clusters combined together and this process is
repeated until only a single cluster is left.
We are essentially building a hierarchy of clusters. That’s why this algorithm is called hierarchical clustering.
104
Hierarchical Clustering Example
Suppose a faculty wants to divide the students into different groups. The faculty has the marks scored by each student
in an assignment and based on these marks, he/she wants to segment them into groups. There’s no fixed target as to
how many groups to have. Since the faculty does not know what type of students should be assigned to which group, it
cannot be solved as a supervised learning problem. So, hierarchical clustering is applied to segment the students into
different groups. Let’s take a sample of 5 students.
105
Hierarchical Clustering Example cont…
Proximity Matrix
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
The diagonal elements of this matrix is always 0 as the distance of a point with itself is always 0. The Euclidean distance
formula is used to calculate the rest of the distances. So, to calculate the distance between
Point 1 and 2: √(10-7)^2 = √9 = 3
Point 1 and 3: √(10-28)^2 = √324 = 18 and so on…
Similarly, all the distances are calculated and the proximity matrix is filled.
106
Steps to Perform Hierarchical Clustering
Step 1: First, all the points to an individual cluster is assigned. Different colors here represent different clusters. Hence,
5 different clusters for the 5 points in the data.
Step 2: Next, look at the smallest distance in the proximity matrix and merge the points with the smallest distance.
Then the proximity matrix is updated.
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Here, the smallest distance is 3 and hence point 1 and 2 is merged.
107
Steps to Perform Hierarchical Clustering cont…
Let’s look at the updated clusters and accordingly update the proximity matrix. Here, we have taken the maximum of
the two marks (7, 10) to replace the marks for this cluster. Instead of the maximum, the minimum value or the average
values can also be considered.
Roll No Mark
(1, 2) 10
3 28
4 20
5 35
108
Steps to Perform Hierarchical Clustering cont…
Revised Proximity Matrix
Roll No 1, 2 3 4 5
1, 2 0 18 10 25
3 18 0 8 7
4 10 8 0 15
5 25 7 15 0
Step 3: Step 2 is repeated until only a single cluster is left. So, look at the minimum distance in the proximity matrix
and then merge the closest pair of clusters. We will get the merged clusters after repeating these steps:
109
Steps to Perform Hierarchical Clustering cont…
110
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
Agglomerative Hierarchical Clustering
Each point is assigned to an individual cluster in this technique. Suppose there are 4 data points, so each of these
points would be assigned to a cluster and hence there would be 4 clusters in the beginning.
Then, at each iteration, closest pair of clusters are merged and this step is repeated until only a single cluster is left.
111
Types of Hierarchical Clustering cont…
112
Dendrogram
To get the number of clusters for hierarchical clustering, we make use of the concept called a Dendrogram.
A dendrogram is a tree-like diagram that records the sequences of merges or splits.
Let’s get back to faculty-student example. Whenever we merge two clusters, a dendrogram record the distance
between these clusters and represent it in graph form.
113
Dendrogram cont…
We started by merging sample 1 and 2 and the distance
between these two samples was 3. Let’s plot this in the
dendrogram.
Here, we can see that we have merged sample 1 and 2. The vertical line represents the distance between these
samples.
114
Dendrogram cont…
Similarly, we plot all the steps where we merged the clusters and finally, we get a dendrogram like this:
We can clearly visualize the steps of hierarchical clustering. More the distance of the vertical lines in the
dendrogram, more the distance between those clusters.
115
Dendrogram cont…
Now, we can set a threshold distance and draw a horizontal line (Generally, the threshold is set in such a way that it cuts
the tallest vertical line). Let’s set this threshold as 12 and draw a horizontal line:
The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the
threshold. In the above example, since the red line intersects 2 vertical lines, we will have 2 clusters. One cluster will
have a sample (1,2,4) and the other will have a sample (3,5).
116
Hierarchical Clustering closeness of two clusters
The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple
metrics for deciding the closeness of two clusters and primarily are:
Euclidean distance Manhattan distance
Squared Euclidean distance Maximum distance
Mahalanobis distance
117
K- Means Clustering Algorithm
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each
dataset belongs only one group that has similar properties.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters. A
centroid is a data point (imaginary or real) at the center of a cluster.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
Determines the best value for k center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
118
K- Means Algorithm cont…
The below diagram explains the working of the K-means Clustering Algorithm:
119
How does the K-Means Algorithm Work?
1. Begin
2. Step-1: Select the number K to decide the number of clusters.
3. Step-2: Select random K points or centroids. (It can be other from the input dataset).
4. Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
5. Step-4: Calculate the variance and place a new centroid of each cluster.
6. Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
7. Step-6: If any reassignment occurs, then go to step-4 else go to step-7.
8. Step-7: The model is ready.
9. End
120
Working of K-Means Algorithm
Suppose we have two variables x and y. The x-y axis scatter plot of these two variables is given below:
121
Working of K-Means Algorithm cont…
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It
means here we will try to group these datasets into two different clusters.
We need to choose some random K points or centroid to form the cluster. These points can be either the
points from the dataset or any other point. So, here we are selecting the below two points as K points, which
are not the part of dataset. Consider the below image:
122
Working of K-Means Algorithm cont…
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by
calculating the distance between two points. So, we will draw a median between both the centroids. Consider
the below image:
123
Working of K-Means Algorithm cont…
From the image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the
right of the line are close to the orange centroid i.e. K2. Let's color them as blue and yellow for clear
visualization.
124
Working of K-Means Algorithm cont…
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose
the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as
below:
125
Working of K-Means Algorithm cont…
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding
a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to
the line. So, these three points will be assigned to new centroids.
126
Working of K-Means Algorithm cont…
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-
points.
127
Working of K-Means Algorithm cont…
We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown
in the below image:
128
Working of K-Means Algorithm cont…
As we got the new centroids so again will draw the median line and reassign the data points. So, the image
will be:
129
Working of K-Means Algorithm cont…
We can see in the previous image; there are no dissimilar data points on either side of the line, which means
our model is formed. Consider the below image:
130
Working of K-Means Algorithm cont…
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as
shown in the below image:
131
K-means Algoritms
• Initialization
• Arbitrarily choose k objects as the initial cluster
centers (centroids)
• Iteration until no change
• For each object Oi
• Calculate the distances between Oi and the k
centroids
• (Re)assign Oi to the cluster whose centroid is the
closest to Oi
• Update the cluster centroids based on current
assignment
Illustrating K-Means
K-means example, step 1
K-means example, step 2
K-means example, step 3
K-means example, step 4
K-means example, step 4a
K-means example, step 4b
K-means example, step 5
K-Means Clustering Example cont..
(Calculate the distances of all data points from 3 centroids)
Sr.
No.
1
2
3
4
• Compare the 3 distances of individual 5
data points from the corresponding 6
centroids and find the minimumm 7
distance. 8
• The data point will belong to the 9
cluster to which it is closer
10
Clusters
141
G-1 = {1, 2, 4} G-2 = {7, 10} G-3 = {3, 5, 6, 8, 9}
K-Means Clustering Example cont..
Sr.
No.
1
Iteration-2 2
3
4
5
6
7
8
9
10
New Cluster
142
G-1 = {1, 2, 4} G-2 = {7, 9, 10} G-3 = {3, 5, 6, 8}
K-Means Clustering Example cont..
Sr.
No.
1
2
Iteration-3
3
4
5
6
7
8
9
10
New Cluster
143
G-1 = {1, 2, 4} G-2 = {5, 7, 9, 10} G-3 = {3, 6, 8}
K-Means Clustering Example cont..
Sr.
Iteration-4 No.
1
2
3
4
5
6
7
8
9
10
New Cluster
144
G-1 = {1, 2} G-2 = {5, 7, 9, 10} G-3 = {3, 4, 6, 8}
K-Means Clustering Example cont..
Sr.
Iteration-5 No.
1
2
3
4
5
6
7
8
9
10
New Cluster
G-1 = {1, 2} G-2 = {5, 7, 9, 10} G-3 = {3, 4, 6, 8}
145
Example:
• Apply K-mean clustering for the
following data sets for two
clusters. Tabulate all the
assignments.
Step-2: New Centroid
Step-3: Update the cluster centroid
Step-4: Similarly process for next data set
Step-7: Update the cluster centroid
Chart Title
90
80
70
60
50
40
30
20
10
0
165 170 175 180 185 190
Support Vector Machines
Support Vector Machine (SVM) is an algorithm which can be used for both classification or regression challenges.
However, it is mostly used in two-group classification problems. In the SVM algorithm, each data item is plotted as a
point in n-dimensional space (where n is number of features) with the value of each feature being the value of a
particular coordinate. Then, classification is performed by finding the hyperplane that differentiates the two classes.
What is hyperplane?
A hyperplane is a generalization of a plane.
In one dimension, it is called a point.
In two dimensions, it is a line.
In three dimensions, it is a plane.
In more dimensions one can call it an hyperplane.
The following figure represents datapoint in one dimension and the point L is a separating hyperplane.
151
School of Computer Engineering
Support Vector Machine
A support vector machine takes these data points and outputs the hyperplane
(which in two dimensions it’s simply a line) that best separates the tags. This
line is the decision boundary: anything that falls to one side of it we will
classify as blue, and anything that falls to the other as red.
In 2D, the best hyper-plane is simply a line.
153
School of Computer Engineering
Support Vector Machines cont…
But, what exactly is the best hyperplane? For SVM, it’s the one that maximizes the margins from both tags. In other
words: the hyperplane (in 2D, it's a line) whose distance to the nearest element of each tag is the largest.
154
School of Computer Engineering
Support Vector Machines cont…
(Scenario-1) Identification of the right hyperplane: Here, we have three hyperplanes (A, B and C). Now, the job is to
identify the right hyperplane to classify star and circle. Remember a thumb rule - identify the right hyperplane: “Select
the hyperplane which segregates the two classes better”. In this scenario, hyperplane “B” has excellently performed this
job.
155
School of Computer Engineering
Support Vector Machines cont…
(Scenario-2) Identify the right hyperplane: Here, we have three hyperplanes (A, B and C) and all are segregating the
classes well. Now, How can we identify the right hyperplane?
Here, maximizing the distances between nearest data point (either class) and hyperplane will help to decide the right
hyperplane. This distance is called as margin.
156
School of Computer Engineering
Support Vector Machines cont…
Below, you can see that the margin for hyperplane C is high as compared to both A and B. Hence, we name the right
hyperplane as C. Another lightning reason for selecting the hyperplane with higher margin is robustness. If we select a
hyperplane having low margin then there is high chance of misclassification.
157
School of Computer Engineering
Support Vector Machines cont…
(Scenario-3) Identify the right hyperplane: Use the rules as discussed in previous section to identify the right
hyperplane.
Some of you may have selected the hyperplane B as it has higher margin compared to A. But, here is the catch, SVM
selects the hyperplane which classifies the classes accurately prior to maximizing margin. Here, hyperplane B has a
classification error and A has classified all correctly. Therefore, the right hyperplane is A.
158
School of Computer Engineering
Support Vector Machines cont…
(Scenario-4) Below, we are unable to segregate the two classes using a straight line, as one of the stars lies in the
territory of other(circle) class as an outlier.
One star at other end is like an outlier for star class. The SVM algorithm has a feature to ignore outliers and find the
hyperplane that has the maximum margin. Hence, we can say, SVM classification is robust to outliers.
159
School of Computer Engineering
Support Vector Machines cont…
(Scenario-5) Find the hyperplane to segregate to classes: In the
scenario, we can’t have linear hyperplane between the two classes,
so how does SVM classify these two classes? Till now, we have only
looked at the linear hyperplane.
161
School of Computer Engineering
Types of SVM
Linear SVM: Linear SVM is used for data that are linearly separable i.e. for a dataset that can be
categorized into two categories by utilizing a single straight line. Such data points are termed as linearly
separable data, and the classifier is used described as a linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for data that are non-linearly separable data i.e. a straight line
cannot be used to classify the dataset. For this, we use something known as a kernel trick that sets data
points in a higher dimension where they can be separated using planes or other mathematical functions.
Such data points are termed as non-linear data, and the classifier used is termed as a non-linear SVM
classifier.
162
School of Computer Engineering