0% found this document useful (0 votes)
12 views130 pages

Module III

Data classification is a two-step process involving a learning phase to build a classifier and a classification phase to predict class labels for new data. The document discusses decision tree-based algorithms, particularly the ID3 and C4.5 algorithms, which utilize information gain and gain ratio for attribute selection, respectively. It also highlights the advantages and disadvantages of these algorithms, including their efficiency and interpretability, as well as challenges in handling continuous data and overfitting.

Uploaded by

cj matrix
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views130 pages

Module III

Data classification is a two-step process involving a learning phase to build a classifier and a classification phase to predict class labels for new data. The document discusses decision tree-based algorithms, particularly the ID3 and C4.5 algorithms, which utilize information gain and gain ratio for attribute selection, respectively. It also highlights the advantages and disadvantages of these algorithms, including their efficiency and interpretability, as well as challenges in handling continuous data and overfitting.

Uploaded by

cj matrix
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

Module III

What Is Classification?
 Data classification is a two-step process, consisting of a learning
step (where a classification model is constructed) and a
classification step (where the model is used to predict class labels
for given data).
 For example a bank loans officer needs analysis of her data to learn
which loan applicants are “safe”and which are “risky” for the bank.
For that classification is done as follows
 In the first step, a classifier is built describing a predetermined set
of data classes or concepts. This is the learning step (or training
phase), where a classification algorithm builds the classifier by
analyzing or “learning from” a training set made up of database
tuples and their associated class labels.
 A tuple, X, is represented by an n-dimensional attribute
vector, X = (x1, x2,..., xn), depicting n measurements made
on the tuple from n database attributes, respectively, A1,
A2,..., An.
 Each tuple, X, is assumed to belong to a predefined class as
determined by another database attribute called the class
label attribute. The class label attribute is discrete-valued and
unordered. It is categorical (or nominal) in that each value
serves as a category or class.
 The individual tuples making up the training set are referred
to as training tuples and are randomly sampled from the
database under analysis.
 Because the class label of each training tuple is provided, this
step is also known as supervised learning.
 It contrasts with unsupervised learning (or clustering), in
which the class label of each training tuple is not known, and
the number or set of classes to be learned may not be known
in advance
 This first step of the classification process can also be viewed
as the learning of a mapping or function, y = f (X), that can
predict the associated class label y of a given tuple X.
 This is mapping or function that separates the data classes.
 Typically, this mapping is represented in the form of
classification rules, decision trees, or mathematical formulae
 In the above example, the mapping is represented as
classification rules that identify loan applications as being
either safe or risky.
 The rules can be used to categorize future data tuples, as well
as provide deeper insight into the data contents.
 In the second step (Figure b), the model is used for classification.
 First, the predictive accuracy of the classifier is estimated. If the
training set is used to measure the classifier’s accuracy, this
estimate would likely be optimistic, because the classifier tends to
overfit the data.
 Therefore, a test set is used, made up of test tuples and their
associated class labels.
 They are independent of the training tuples, meaning that they
were not used to construct the classifier.
 The accuracy of a classifier on a given test set is the percentage of
test set tuples that are correctly classified by the classifier. The
associated class label of each test tuple is compared with the
learned classifier’s class prediction for that tuple.
 . If the accuracy of the classifier is considered acceptable, the
classifier can be used to classify future data tuples for which the
class label is not known.
DECISION TREE-BASED ALGORITHMS
 The decision tree approach is most useful in classification
problems.
 With this technique, a tree is constructed to model the
classification process.
 Once the tree is built, it is applied to each tuple in the
database and results in a classification for that tuple.
 There are two basic steps in the technique: building the tree
and applying the tree to the database.
 Most research has focused on how to build effective trees as
the application process is straightforward.
DECISION TREE-BASED ALGORITHMS
 DEFINITION 4.3. Given a database D = {t1 , ... , tn } where ti =
(ti1 , . .. , tih} and the database schema contains the following
attributes {A1 , A2, ... , Ah }. Also given is a set of classes C = {
C 1 , ... , C m}. A decision tree(DT) or classification tree is a tree
associated with D that has the following properties:
 Each internal node is labeled with an attribute, Ai.
 Each arc is labeled with a predicate that can be applied to the
attribute associated with the parent.
 Each leaf node is labeled with a class, C j.
 Solving the classification problem using decision trees is a two-step
process:
 1. Decision tree induction: Construct a DT using training data.
 2. For each ti E D, apply the DT to determine its class.
DECISION TREE-BASED ALGORITHMS
 Avanatages
 1. DTs are easy to use and efficient.
 2. Rules can be generated that are easy to interpret and
understand.
 3. They scale well for large databases because the tree size is
independent of the database size.
 4. Trees can be constructed for data with many attributes.
 Disadvanages
 1.Do not easily handle continuous data
 2. Handling missing data is difficult because correct branches in
the tree could not be taken.
 Since the DT is constructed from the training data, overfitting may
occur. This can be overcome via tree pruning.
 Correlations among attributes in the database are ignored by the
DT process.
Decision Tree Induction
 The algorithm is called with three parameters: D, attribute list,
and Attribute selection method. Where
 D is is the complete set of training tuples and their associated
class labels.
 The parameter attribute list is a list of attributes describing the
tuples.
 Attribute selection method specifies a heuristic procedure for
selecting the attribute that “best” discriminates the given
tuples according to class.
Decision Tree Induction
 The tree starts as a single node, N, representing the training
tuples in D. (step 1).
 If the tuples in D are all of the same class, then node N becomes a
leaf and is labeled with that class (steps 2 and 3). Note that
steps 4 and 5 are terminating conditions.
 Otherwise, the algorithm calls Attribute selection method to
determine the splitting criterion. The splitting criterion tells
us which attribute to test at node N by determining the “best”
way to separate or partition the tuples in D into individual
classes(step 6). The splitting criterion is determined so that,
ideally, the resulting partitions at each branch are as “pure” as
possible. A partition is pure if all the tuples in it belong to
the same class.
Decision Tree Induction
 The node N is labeled with the splitting criterion, which serves as a
test at the node(step 7).A branch is grown from node N for each
of the outcomes of the splitting criterion. The tuples in D are
partitioned accordingly (steps 10 to 11). There are three possible
scenarios, as illustrated in Figure. Let A be the splitting
attribute. A has v distinct values, {a1, a2, : : : , av}, based on the
training data.
 1.A is discrete-valued: In this case, the outcomes of the test at
node N correspond directly to the known values of A. A branch is
created for each known value, aj of A and labeled with that value.
Partition Dj is the subset of class-labeled tuples in D having
value aj of A.Then A is removed from the attribute list.
Decision Tree Induction
 2.A is continuous-valued: In this case, the test at node N has
two possible outcomes, corresponding to the conditions A <=
split point and A > split point, respectively, where split point is the
split-point returned by Attribute selection method as part of the
splitting criterion. Two branches are grown from N and
labelled according to the previous outcomes (Figure ).The tuples are
partitioned such that D1 holds the subset of class-labelled tuples
in D for which A<= split point, while D2 holds the rest.
Decision Tree Induction
 3. A is discrete-valued and a binary tree must be
produced (as dictated by the attribute selection
measure or algorithm being used): The test at node N is
of the form“AƐ SA?,” where SA is the splitting subset for A, returned
by Attribute selection method as part of the splitting criterion. It
is a subset of the known values of A. If a given tuple has value
aj of A and if aj Ɛ SA, then the test at node N is satisfied.Two
branches are grown fromN (Figure ). By convention, the left
branch out of N is labeled yes so that D1 corresponds to the subset
of class-labelled tuples in D that satisfy the test. The right branch
out of N is labelled no so that D2 corresponds to the subset of
class-labelled tuples from D that do not satisfy the test.
Decision Tree Induction
Decision Tree Induction
 The algorithm uses the same process recursively to form a
decision tree for the tuples at each resulting partition, Dj , of
D (step 14).
 The recursive partitioning stops only when any one of the
following terminating conditions is true:
 1. All the tuples in partition D (represented at node N) belong to
the same class (steps 2 and 3).
 2.There are no remaining attributes on which the tuples may
be further partitioned (step 4). In this case, majority voting is
employed (step 5). This involves converting node N into a leaf
and labeling it with the most common class in D. Alternatively, the
class distribution of the node tuples may be stored
 3.There are no tuples for a given branch, that is, a partition
Dj is empty (step 12).In this case, a leaf is created with the
majority class in D (step 13).
Decision Tree Induction
 The computational complexity of the algorithm given
training set D is O(nDlog(/Dj/)) where n is the number of
attributes describing the tuples in D and /D/ is the number of
training tuples in D.
ID3 Algorithm
 ID3 uses information gain as its attribute selection measure.
Inputs: R: a set of non- target attributes, C: the target attribute, D: training data.
Output: returns a decision tree
Start
Initialize to empty tree;
If D is empty then
Return a single node failure value
End If
If D is made only for the values of the same target then
Return a single node of this value
End if
If R is empty then
Return a single node with value as the most common value of the target attribute values
found in D
End if
A ← the attribute that has the largest Gain (A, S) among all the attributes of R
{aj, j = 1, 2, ..., m} ← Attribute values of A
{Dj with j = 1, 2, ..., m} ←The subsets of D respectively constituted of aj records attribute value A
Return a tree whose root is A and the arcs are labeled by a1, a2, ..., am and going to sub-
trees ID3 (R-{A}, C, D1), ID3 (R-{A} C, D2), .., ID3 (R-{A}, C, Dm)
End
ID3 Algorithm
 Gain of an attribute A is calculated as follows
If D is sample or training data, then information carried by
this( also called Entropy) is given by the equation
Entropy(D) or

where pi is the nonzero probability that an arbitrary tuple in D


belongs to class Ci and is estimated by |Ci,D|/|D| .
(|Ci,D| IS the number of tuples in D having class value Ci,)
and m is number of classes.
Now Gain(A,D)=Info(D)- InfoA(D) where

Where v is the no of
attribute values and Dj is set of tuples having same attribute
values for A.
Q1. Construct the decision tree using ID3 algorithm for the
following trained data
Q2. Construct the decision tree using ID3 algorithm for the
following trained data
Advantages of ID3
 Understandable prediction rules are created from the
training data.
 Builds the fastest tree.
 Builds a short tree.
 Only need to test enough attributes until all data is
classified.
 Finding leaf nodes enables test data to be pruned,reducing
number of tests.
 Whole dataset is searched to create tree.
DisAdvantages of ID3
 Only one attribute at a time is tested for making a decision.
 Classifying continuous data may be computationally
expensive, as many trees must be generated to see where to
break the continuum.
 The information gain measure is biased toward tests with
many outcomes. That is, it prefers to select attributes having
a large number of values.
C4.5 Algorithm
 It use gain ratio as attribute selection measure

where Gain(A,S)=Info(D)- InfoA(D). And

Where v is the number of values of the attribute A, |Dj| is


the number of tuple having specified value for attribute A and
|D| is total number of tuples.
C4.5 Algorithm
 C4,5 uses gain ration as its attribute selection measure.
Inputs: R: a set of non- target attributes, C: the target attribute, D: training data.
Output: returns a decision tree
Start
Initialize to empty tree;
If D is empty then
Return a single node failure value
End If
If D is made only for the values of the same target then
Return a single node of this value
End if
If R is empty then
Return a single node with value as the most common value of the target attribute values
found in D
End if
A ← the attribute that has the largest Gain ratio(A, S) among all the attributes of R
{aj, j = 1, 2, ..., m} ← Attribute values of A
{Dj with j = 1, 2, ..., m} ←The subsets of D respectively constituted of aj records attribute value A
Return a tree whose root is A and the arcs are labeled by a1, a2, ..., am and going to sub-
trees C4.5 (R-{A}, C, D1), C4.5 (R-{A} C, D2), .., C4.5 (R-{A}, C, Dm)
End
CART
 Gini Index used in CART
 the Gini index measures the impurity of D, a data partition or set of
training tuples, as

 where pi is the probability that a tuple in D belongs to class Ci and is


estimated by | Ci,D|/|D| .
 The Gini index considers a binary split for each attribute.
 Let A is a discrete-valued attribute having v distinct values, {a1, a2,.. av},
occurring in D.
 To determine the best binary split on A, examine all the possible
subsets that can be formed using known values of A.
 Each subset, SA, can be considered as a binary test for attribute
A of the form “A Ɛ SA?”
 Given a tuple, this test is satisfied if the value of A for the tuple
is among the values listed in SA.
 If A has v possible values,then there are 2v possible subsets.
 In a binary split, compute a weighted sum of the impurity of
each resulting partition. For example, if a binary split on A
partitions D into D1 and D2, the Gini index of D given that
partitioning is
 For each attribute, each of the possible binary splits is considered.
 For a discrete-valued attribute, the subset that gives the minimum
Gini index for that attribute is selected as its splitting subset
 For continuous-valued attributes, each possible split-point must be
considered.
 The reduction in impurity that would be incurred by a binary split
on a discrete- or continuous-valued attribute A is

 The attribute that maximizes the reduction in impurity (or,


equivalently, has the minimum Gini index) is selected as the
splitting attribute.
Tree Pruning
 When a decision tree is built, many of the branches will reflect
anomalies in the training data due to noise or outliers.
 Tree pruning methods address this problem of overfitting the data.
 These methods use statistical measures to remove the least-reliable
branches.
 Pruned trees tend to be smaller and less complex and, thus, easier
to comprehend.
 They are usually faster and better at correctly classifying
independent test data (i.e., of previously unseen tuples) than
unpruned trees.
 An unpruned tree and a pruned version of it are shown in Figure.
 There are two common approaches to tree pruning:
prepruning and postpruning.
 In the prepruning approach, a tree is “pruned” by
halting its construction early.
 Upon halting, the node becomes a leaf.
 The leaf may hold the most frequent class among the subset
tuples or the probability distribution of those tuples.
 When constructing a tree, measures such as statistical
significance, information gain,Gini index, and so on, can be
used to assess the goodness of a split.
 If partitioning the tuples at a node would result in a split that
falls below a prespecified threshold, then further partitioning
of the given subset is halted.
 There are difficulties, however, in choosing an appropriate
threshold.
 High thresholds could result in oversimplified trees, whereas
low thresholds could result in very little simplification.
 The second and more common approach is postpruning,
which removes subtrees from a “fully grown” tree.
 A subtree at a given node is pruned by removing its branches
and replacing it with a leaf.
 The leaf is labeled with the most frequent class among the
subtree being replaced.
 For example, notice the subtree at node “A3?” in the unpruned
tree of Figure.
 Suppose that the most common class within this subtree is
“class B.”
 In the pruned version of the tree, the subtree in question is
pruned by replacing it with the leaf “class B.”
 The cost complexity pruning algorithm used in
CART is an example of the postpruning approach.
 Decision trees can suffer from repetition and replication.
 Repetition occurs when an attribute is repeatedly tested
along a given branch of the tree (e.g., “age < 60?,” followed by
“age < 45?,” and so on).
 In replication, duplicate subtrees exist within the
tree.
METRICS FOR EVALUATING
CLASSIFIER PERFORMANCE
Important Terminology
 positive tuples (tuples of the main class of interest)
 negative tuples (all other tuples).
 Given two classes, for example, the positive tuples may be
buys computer = yes while the negative tuples are
 P is the number of positive tuples and
 N is the number of negative tuples.
 True positives (TP): These refer to the positive tuples that were
correctly labeled by the classifier. Let TP be the number of true
positives.
 True negatives(TN): These are the negative tuples that were
correctly labeled by the classifier. Let TN be the number of true
negatives.
 False positives (FP): These are the negative tuples that were
incorrectly labeled as positive (e.g., tuples of class buys computer
= no for which the classifier predicted buys computer = yes). Let
FP be the number of false positives.
 False negatives (FN): These are the positive tuples that were
mislabeled as negative (e.g., tuples of class buys computer = yes
for which the classifier predicted buys computer = no). Let FN be
the number of false negatives.
 If we were to use the training set (instead of a test set) to estimate
the error rate of a model, this quantity is known as the
resubstitution error.
 Class imbalance problem, where the main class of interest is rare.
That is, the data set distribution reflects a significant majority of
the negative class and a minority positive class.
 An accuracy rate of 97% may not be acceptable—the classifier
could be correctly labeling only the noncancer tuples, for instance,
and misclassifying all the cancer tuples. Instead, we need other
measures, which access how well the classifier can recognize the
positive tuples (cancer = yes) and how well it can recognize the
negative tuples (cancer = no).
 The sensitivity and specificity measures can be used:
 . Sensitivity is also referred to as the true positive
(recognition) rate (i.e., the proportion of positive tuples that
are correctly identified)

 specificity is the true negative rate (i.e., the proportion of


negative tuples that are correctly identified).
 Accuracy is a function of sensitivity and specificity
 The precision and recall measures are also widely used in
classification.
 Precision can be thought of as a measure of exactness (i.e.,
what percentage of tuples labeled as positive are actually
such),
 Recall is a measure of completeness (what percentage of
positive tuples are labeled as such). If recall seems familiar,
that’s because it is the same as sensitivity (or the true positive
rate).
 An alternative way to use precision and recall is to combine them
into a single measure. This is the approach of the F measure (also
known as the F1 score or F-score) and the Fβ measure. They are
defined as
 Where β is a non-negative real number.
 Consider the following confusion matrix for medical data
where the class values are yes and no for a class label
attribute, cancer.
 Sensitivity=90/300=30.00%
 Specificity=9650/9700=98.56%
 Overall Accuracy= 9650/10000=96.50%
 Note: Although the classifier has a high accuracy, it’s ability to
correctly label the positive (rare) class is poor given its low
sensitivity. It has high specificity, meaning that it can
accurately recognize negative tuples
Problem 1
 Consider a two-class classification problem of predicting whether
a photograph contains a man or a woman. Suppose we have a test
dataset of 10 records with expected outcomes and a set of
predictions from our classification algorithm.
Problem 2
 A database contains 80 records on a particular topic of which
55 are relevant to a certain investigation. A search was
conducted on that topic and 50 records were retrieved. Of
the 50 records retrieved, 40 were relevant. Construct the
confusion matrix and calculate the precision and recall scores
for the search
Problem 3
 For a spam email classifier, is precision or recall more
important evaluation measure. Justify your answer.
Classifier accuracy

 Estimating classifier accuracy is important in that it allows


one to evaluate how accurately a given classifier will
correctly label future data, i.e., data on which the classifier
has not been trained.
 For example, if data from previous sales are used to train a
classifier to predict customer purchasing behavior, It is
necessary to estimate of how accurately the classifier can
predict the purchasing behavior of future customers.
Estimating classifier accuracy

 Holdout and cross-validation are two common techniques for assessing classifier accuracy,
based on randomly-sampled partitions of the given data.
Holdout method
 In the holdout method, the given data are randomly partitioned into two independent sets, a
training set and a test set.
 Typically, two thirds of the data are allocated to the training set, and the remaining one third is
allocated to the test set.
 The training set is used to derive the classifier, whose accuracy is estimated with the test set.
Random subsampling
 It is a variation of the holdout method in which the holdout method is repeated k times.
 The overall accuracy estimate is taken as the average of the accuracies obtained from each
iteration.
Estimating classifier accuracy

 Cross Validation
 In k-fold cross validation, the initial data are
randomly partitioned into k mutually
exclusive subsets or \folds", S1, S2 .... Sk,
each of approximately equal size.
 Training and testing is performed k times.
 In iteration i, the subset Si is reserved as the
test set, and the remaining subsets are
collectively used to train the classifier.
Estimating classifier
accuracy
 That is, the classifier of the first iteration is trained on
subsets S2 ... Sk, and tested on S1 , the classifier of the
second iteration is trained on subsets S1; S3; ::; Sk, and
tested on S2; and so on.
 The accuracy estimate is the overall number of correct
classifications from the k iterations, divided by the total
number of samples in the initial data.
 In stratified cross-validation, the folds are stratified so that
the class distribution of the samples in each fold is
approximately the same as that in the initial data.
 Other methods of estimating classifier accuracy include
bootstrapping, which samples the given training instances
uniformly with replacement, and leave-one-out, which is
k-fold cross validation with k set to s, the number of initial
samples.
Increasing classifer accuracy
 Bagging (or boostrap aggregation) and
boosting are two such techniques (Figure).
Each combines a series of T learned
classifers, C1; C2; ::;CT, with the aim of
creating an improved composite classifer,
C*.
Cluster Analysis
 Clustering is the process of grouping the data into classes or
clusters so that objects within a cluster have high similarity in
comparison to one another, but are very dissimilar to objects in
other clusters.
 Dissimilarities are assessed based on the attribute values describing
the objects.
 Often, distance measures are used.
 Clustering is known as unsupervised learning because the
class label information is not present
 Because a cluster is a collection of data objects that are similar to
one another within the cluster and dissimilar to objects in other
clusters, a cluster of data objects can be treated as an implicit class.
In this sense, clustering is sometimes called automatic
classification.
Applications of clustering
 Cluster analysis has been widely used in many applications such as
business intelligence,image pattern recognition, Web search,
biology, and security.
 In business intelligence, clustering can be used to organize a
large number of customers into groups, where customers within a
group share strong similar characteristics.
 This facilitates the development of business strategies for enhanced
customer relationship management.
 In image recognition, clustering can be used to discover
clusters or “subclasses” in handwritten character recognition
systems.
 There can be a large variance in the way in which people write the
same digit. Based on this constraints, sub class can be formed
which improve over all image recognition process.
Applications of clustering
 In Web search, a keyword search may often return a very large
number of hits due to the extremely large number of web pages.
 Clustering can be used to organize the search results into groups
and present the results in a concise and easily accessible way.
 Clustering techniques have been developed to cluster documents
into topics, which are commonly used in information retrieval
practice.
 As a data mining function, cluster analysis can be used as a
standalone tool to gain insight into the distribution of data, to
observe the characteristics of each cluster, and to focus on a
particular set of clusters for further analysis
Requirements of clustering in data
mining
 Scalability: Many clustering algorithms work well in small
data sets. Therefore clustering on only a sample of a given
large data set may lead to biased results . Highly scalable
clustering algorithms are needed.
 Ability to deal with different types of attributes:
Many algorithms are designed to cluster interval-based
(numerical) data. However, applications may require
clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types
Requirements of clustering in data mining
 Discovery of clusters with arbitrary shape: Most of the
Algorithms result in a spherical clusters with similar size and
density. However, a cluster could be of any shape. It is important
to develop algorithms which can detect clusters of arbitrary shape.
 Minimal requirements for domain knowledge to
determine input parameters.
 Ability to deal with noisy data.
 Insensitivity to the order of input records: Order of the
records should not effect the efficiency of the algorithm.
 High dimensionality: Should work well for data with any
dimensions.
 Constraint-based clustering: Real-world applications may
need to perform clustering under various kinds of constraints. So
it is required to cope up with variety of constraints
 Interpretability and usability: Users expect clustering results
to be interpretable, comprehensible, and usable.
A categorization of major clustering
methods
 Large number of clustering algorithms in the literature
 The choice of clustering algorithm depends both on the type
of data available and on the particular purpose and
application.
 In general, major clustering methods can be classified into
the following categories.
1. Partitioning methods
 Given a database of n objects or data tuples, a partitioning method
constructs k partitions of the data, where each partition represents
a cluster, and k ≤ n.
 That is, it classifies the data into k groups, which together satisfy
the following requirements.
 (1) Each group must contain at least one object, and
(2) Each object must belong to exactly one group.
 Given k, the number of partitions to construct, a partitioning
method creates an initial partitioning.
 It then uses an iterative relocation technique which attempts to
improve the partitioning by moving objects from one group to
another.
 Eg:(1) the k-means algorithm, where each cluster is represented
by the mean value of the objects in the cluster;
(2) the k-medoids algorithm, where each cluster is represented by
one of the objects located near the center of the cluster
2. Hierarchical methods.
 A hierarchical method creates a hierarchical decomposition
of the given set of data objects.
 A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical
decomposition is formed.
 The agglomerative approach, also called the \bottom-up"
approach, starts with each object forming a separate group.
 It successively merges the objects or groups close to one
another, until all of the groups are merged into one (the
topmost level of the hierarchy), or until a termination
condition holds
2. Hierarchical methods.
 The divisive approach, also called the “top-down" approach,
starts with all the objects in the same cluster.
 In each successive iteration, a cluster is split up into smaller
clusters, until eventually each object is in one cluster, or until
a termination condition holds.
 Hierarchical methods suffer from the fact that once a step
(merge or split) is done, it can never be undone.
3. Density-based methods
 Most partitioning methods can end only spherical-shaped
clusters and encounter difficulty at discovering clusters of
arbitrary shapes.
 Density based methods based on the notion of density can
produce clusters of arbitrary shapes.
 The general idea is to continue growing the given cluster as
long as the density (number of objects or data points) in the
“neighborhood" exceeds some threshold.
 That is for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a
minimum number of points.
 Eg: DBSCAN,OPTICS
4. Grid-based methods
 Grid-based methods quantize the object space into a finite
number of cells which form a grid structure.
 All of the clustering operations are performed on the grid
structure (i.e., on the quantized space).
 The main advantage of this approach is its fast processing
time which is typically independent of the number of data
objects, and dependent only on the number of cells in each
dimension in the quantized space.
 Eg: STING
Summary
K medoid clustering algorithm
 Problems of K means algorithm is if outliers are present then
it cluster the data points wrongly.
 Distance from the centroid and the data obejetcs may be very
large.
Steps
1.Initialize: randomly select k of the n data points as
the medoids.
2. Assignment step: Associate each data point to the closest
medoid by calculating the distance from the medoid.
(Manhattan distance is used.Which is calculated for two point
(x1,y1) and (x2,y2) as |y2-y1|+|x2-x1|
3.Update step: For each medoid m and each data
point o associated to m swap m and o and compute the total
cost of the configuration (that is, the average dissimilarity
of o to all the data points associated to m). Select the
medoid o with the lowest cost of the configuration.
4.Repeat alternating steps 2 and 3 until there is no change in
the assignments.
 Use k medoids algorithm to cluster the following data into 2
cluster.
DBSCAN Algorithm
 What is DBSCAN ?
➢ DBSCAN is an algorithm can identify clusters in large
spatial data by looking at the local density of the data
elements.
➢DBSCAN can find clusters of arbitrary shape. Clusters
thate lie close to each other tend to belong to same cluster.
Terminologies
 Eps: Maximum radius of the neighborhood
 MinPts: Minimum no of points in the Eps-neighborhood of a
point.
 The Eps-neighbourhood of a point: N(p)={q ε
D|dist(p,q)<=Eps}. For a point belong to cluster it needs to
have at least one other point that lies closer to it than the
distance Eps.
 Core Points: A point is a core point if and only if that
certain point has minimum Mnpts number of neighbors.
 BorderPoint: A point is a border point if it doesn’t have Minpts
number of neighbors and it share neighborhood with at least one
core point.
 NoisePoint: A point which does not share neighborhood with
any of the core point.
Terminologies
 Directly density reachable: A point p is directly density
reachable from a point q with respect to Eps and Minpts if
1. p belong to neighborhood of q. ie, disp(q,p)<Eps.
2. q has to be core point.
Terminologies
 Density reachable: A point p is density reachable from point q
with respect to Eps and Minpts if there are chain of point
p1,p2….pn and p1=q , pn=p such that pi+1 is directly density
reachable from pi.
 Density connected: A point p is density connected to a point q
with respect to Eps and Minpts if there is a point o such that both
p and q are density reachable from point o with respect to Eps and
Minpts.
Terminologies
DBSCAN algorithm
 Label all points as core, border noise points.
 Put an edge among all core points that are within Eps of each
other.
 Make each group of connected core points into separate
cluster.
 Assign each of the border points to one of the clusters of its
associated core points.
 If a point is a noise point then it is either ignored or kept as
reference to check different aspects of the dataset.
 When all the instances are processed then the algorithm
converges.
Pseudo code
Advantages and dsiadvanatges
➢ Advantages
 Does not required predefined number of clusters.
Clusters can be of any shape.
Able to identify noisy data.
➢Disadvantages
 Density based algorithm fails if there are no density drop
between the clusters.
It also sensitive to the parameters that define the
density.(Eps and Minpts)
Proper parameter setting may require domain knowledge.
 Q:Exexute DBSCAN algorithm(Eps=1.5, Minpts=2) for the
data given below. Determine whether the points are core,
border or noise? What are the resulting clusters?
X1 x2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
Hierarchical Clustering method
 A hierarchical clustering method works by grouping data
objects into a tree of clusters.
 A tree data structure, called dendrogram, can be used
a

to illustrate the hierarchical clustering technique and the


sets of different clusters.
 The root in a dendrogram tree contains one cluster
where all elements are together.
 The leaves in the dendrogram each consist of a single
element cluster.
 Internal nodes in the dendrogram represent new
clusters formed by merging the clusters that appear as
its children in the tree.
Hierarchical Clustering method
 Figure shows how six elements, {A, B, C, D, E, F}, to
be clustered.
Hierarchical Clustering method
 A dendrogram for the above clustering is shown in
Figure.
Hierarchical Clustering method
 In general, there are two types of hierarchical
clustering methods.
 1. Agglomerative hierarchical clustering: This
bottom-up strategy starts by placing each object in
its own cluster and then merges these atomic
clusters into larger and larger clusters, until all of the
objects are in a single cluster or until certain
termination conditions are satisfied.
 Most hierarchical clustering methods belong to this
category.
Hierarchical Clustering method
 2. Divisive hierarchical clustering: This top-down
strategy does the reverse of agglomerative
hierarchical clustering by starting with all objects in
one cluster.
 It subdivides the cluster into smaller and smaller
pieces, until each object forms a cluster on its own
or until it satisfies certain termination conditions,
such as a desired number of clusters is obtained or
the distance between the two closest clusters is
above a certain threshold distance
Hierarchical Clustering method
 Figure shows the application of AGNES (AGglomerative
NESting), an agglomerative hierarchical clustering method,
and DIANA (DIvisive ANAlysis), a divisive hierarchical
clustering method, on a data set of five objects, {a,b, c,d, e}.
Hierarchical Clustering method
 Initially, AGNES places each object into a cluster of its own.

 The clusters are then merged step-by-step according to some


criterion. For example, clusters C1 and C2 may be merged if
an object in C1 and an object in C2 form the minimum
Euclidean distance between any two objects from different
clusters.
 This is a single-link approach in that each cluster is
represented by all of the objects in the cluster, and the
similarity between two clusters is measured by the similarity
of the closest pair of data points belonging to different
clusters.
 The cluster merging process repeats until all of the objects
are eventually merged to form one cluster.
Hierarchical Clustering method
 In DIANA, all of the objects are used to form one
initial cluster.
 The cluster is split according to some principle, such
as the maximum Euclidean distance between the
closest neighboring objects in the cluster.
 The cluster splitting process repeats until,
eventually, each new cluster contains only a single
object.
Hierarchical Clustering method
Hierarchical Clustering method
 A tree structure called a dendrogram is commonly
used to represent the process of hierarchical
clustering.
 It shows how objects are grouped together (in an
agglomerative method) or partitioned (in a divisive
method) step-by-step.
 Above Figure shows a dendrogram for the five
objects a,b,c,d,e.
 At level l= 0 shows the five objects as singleton
clusters at level 0. At l= 1, objects a and b are
grouped together to form the first cluster, and they
stay together at all subsequent levels.
Hierarchical Clustering method
 Distance Measures in Algorithmic Methods
Hierarchical Clustering method
 When an algorithm uses the minimum distance,
dmin(Ci ,Cj) to measure the distance between clusters,
it is sometimes called a nearest-neighbor clustering
algorithm.
 if the clustering process is terminated when the
distance between nearest clusters exceeds a user-
defined threshold, it is called a single-linkage
algorithm.
Hierarchical Clustering method
 If we view the data points as nodes of a graph, with
edges forming a path between the nodes in a cluster,
then the merging of two clusters, Ci and Cj , corresponds
to adding an edge between the nearest pair of nodes in
Ci and Cj .
 Because edges linking clusters always go between
distinct clusters, the resulting graph will generate a tree.
 Thus, an agglomerative hierarchical clustering algorithm
that uses the minimum distance measure is also called a
minimal spanning tree algorithm, where a spanning tree
of a graph is a tree that connects all vertices, and a
minimal spanning tree is the one with the least sum of
edge weights.
Hierarchical Clustering method
 When an algorithm uses the maximum distance,
dmax(Ci ,Cj) to measure the distance between clusters,
it is sometimes called a farthest-neighbor clustering
algorithm.
 If the clustering process is terminated when the
maximum distance between nearest clusters
exceeds a user-defined threshold, it is called a
complete-linkage algorithm.
Hierarchical Clustering method

You might also like