Module III
Module III
What Is Classification?
Data classification is a two-step process, consisting of a learning
step (where a classification model is constructed) and a
classification step (where the model is used to predict class labels
for given data).
For example a bank loans officer needs analysis of her data to learn
which loan applicants are “safe”and which are “risky” for the bank.
For that classification is done as follows
In the first step, a classifier is built describing a predetermined set
of data classes or concepts. This is the learning step (or training
phase), where a classification algorithm builds the classifier by
analyzing or “learning from” a training set made up of database
tuples and their associated class labels.
A tuple, X, is represented by an n-dimensional attribute
vector, X = (x1, x2,..., xn), depicting n measurements made
on the tuple from n database attributes, respectively, A1,
A2,..., An.
Each tuple, X, is assumed to belong to a predefined class as
determined by another database attribute called the class
label attribute. The class label attribute is discrete-valued and
unordered. It is categorical (or nominal) in that each value
serves as a category or class.
The individual tuples making up the training set are referred
to as training tuples and are randomly sampled from the
database under analysis.
Because the class label of each training tuple is provided, this
step is also known as supervised learning.
It contrasts with unsupervised learning (or clustering), in
which the class label of each training tuple is not known, and
the number or set of classes to be learned may not be known
in advance
This first step of the classification process can also be viewed
as the learning of a mapping or function, y = f (X), that can
predict the associated class label y of a given tuple X.
This is mapping or function that separates the data classes.
Typically, this mapping is represented in the form of
classification rules, decision trees, or mathematical formulae
In the above example, the mapping is represented as
classification rules that identify loan applications as being
either safe or risky.
The rules can be used to categorize future data tuples, as well
as provide deeper insight into the data contents.
In the second step (Figure b), the model is used for classification.
First, the predictive accuracy of the classifier is estimated. If the
training set is used to measure the classifier’s accuracy, this
estimate would likely be optimistic, because the classifier tends to
overfit the data.
Therefore, a test set is used, made up of test tuples and their
associated class labels.
They are independent of the training tuples, meaning that they
were not used to construct the classifier.
The accuracy of a classifier on a given test set is the percentage of
test set tuples that are correctly classified by the classifier. The
associated class label of each test tuple is compared with the
learned classifier’s class prediction for that tuple.
. If the accuracy of the classifier is considered acceptable, the
classifier can be used to classify future data tuples for which the
class label is not known.
DECISION TREE-BASED ALGORITHMS
The decision tree approach is most useful in classification
problems.
With this technique, a tree is constructed to model the
classification process.
Once the tree is built, it is applied to each tuple in the
database and results in a classification for that tuple.
There are two basic steps in the technique: building the tree
and applying the tree to the database.
Most research has focused on how to build effective trees as
the application process is straightforward.
DECISION TREE-BASED ALGORITHMS
DEFINITION 4.3. Given a database D = {t1 , ... , tn } where ti =
(ti1 , . .. , tih} and the database schema contains the following
attributes {A1 , A2, ... , Ah }. Also given is a set of classes C = {
C 1 , ... , C m}. A decision tree(DT) or classification tree is a tree
associated with D that has the following properties:
Each internal node is labeled with an attribute, Ai.
Each arc is labeled with a predicate that can be applied to the
attribute associated with the parent.
Each leaf node is labeled with a class, C j.
Solving the classification problem using decision trees is a two-step
process:
1. Decision tree induction: Construct a DT using training data.
2. For each ti E D, apply the DT to determine its class.
DECISION TREE-BASED ALGORITHMS
Avanatages
1. DTs are easy to use and efficient.
2. Rules can be generated that are easy to interpret and
understand.
3. They scale well for large databases because the tree size is
independent of the database size.
4. Trees can be constructed for data with many attributes.
Disadvanages
1.Do not easily handle continuous data
2. Handling missing data is difficult because correct branches in
the tree could not be taken.
Since the DT is constructed from the training data, overfitting may
occur. This can be overcome via tree pruning.
Correlations among attributes in the database are ignored by the
DT process.
Decision Tree Induction
The algorithm is called with three parameters: D, attribute list,
and Attribute selection method. Where
D is is the complete set of training tuples and their associated
class labels.
The parameter attribute list is a list of attributes describing the
tuples.
Attribute selection method specifies a heuristic procedure for
selecting the attribute that “best” discriminates the given
tuples according to class.
Decision Tree Induction
The tree starts as a single node, N, representing the training
tuples in D. (step 1).
If the tuples in D are all of the same class, then node N becomes a
leaf and is labeled with that class (steps 2 and 3). Note that
steps 4 and 5 are terminating conditions.
Otherwise, the algorithm calls Attribute selection method to
determine the splitting criterion. The splitting criterion tells
us which attribute to test at node N by determining the “best”
way to separate or partition the tuples in D into individual
classes(step 6). The splitting criterion is determined so that,
ideally, the resulting partitions at each branch are as “pure” as
possible. A partition is pure if all the tuples in it belong to
the same class.
Decision Tree Induction
The node N is labeled with the splitting criterion, which serves as a
test at the node(step 7).A branch is grown from node N for each
of the outcomes of the splitting criterion. The tuples in D are
partitioned accordingly (steps 10 to 11). There are three possible
scenarios, as illustrated in Figure. Let A be the splitting
attribute. A has v distinct values, {a1, a2, : : : , av}, based on the
training data.
1.A is discrete-valued: In this case, the outcomes of the test at
node N correspond directly to the known values of A. A branch is
created for each known value, aj of A and labeled with that value.
Partition Dj is the subset of class-labeled tuples in D having
value aj of A.Then A is removed from the attribute list.
Decision Tree Induction
2.A is continuous-valued: In this case, the test at node N has
two possible outcomes, corresponding to the conditions A <=
split point and A > split point, respectively, where split point is the
split-point returned by Attribute selection method as part of the
splitting criterion. Two branches are grown from N and
labelled according to the previous outcomes (Figure ).The tuples are
partitioned such that D1 holds the subset of class-labelled tuples
in D for which A<= split point, while D2 holds the rest.
Decision Tree Induction
3. A is discrete-valued and a binary tree must be
produced (as dictated by the attribute selection
measure or algorithm being used): The test at node N is
of the form“AƐ SA?,” where SA is the splitting subset for A, returned
by Attribute selection method as part of the splitting criterion. It
is a subset of the known values of A. If a given tuple has value
aj of A and if aj Ɛ SA, then the test at node N is satisfied.Two
branches are grown fromN (Figure ). By convention, the left
branch out of N is labeled yes so that D1 corresponds to the subset
of class-labelled tuples in D that satisfy the test. The right branch
out of N is labelled no so that D2 corresponds to the subset of
class-labelled tuples from D that do not satisfy the test.
Decision Tree Induction
Decision Tree Induction
The algorithm uses the same process recursively to form a
decision tree for the tuples at each resulting partition, Dj , of
D (step 14).
The recursive partitioning stops only when any one of the
following terminating conditions is true:
1. All the tuples in partition D (represented at node N) belong to
the same class (steps 2 and 3).
2.There are no remaining attributes on which the tuples may
be further partitioned (step 4). In this case, majority voting is
employed (step 5). This involves converting node N into a leaf
and labeling it with the most common class in D. Alternatively, the
class distribution of the node tuples may be stored
3.There are no tuples for a given branch, that is, a partition
Dj is empty (step 12).In this case, a leaf is created with the
majority class in D (step 13).
Decision Tree Induction
The computational complexity of the algorithm given
training set D is O(nDlog(/Dj/)) where n is the number of
attributes describing the tuples in D and /D/ is the number of
training tuples in D.
ID3 Algorithm
ID3 uses information gain as its attribute selection measure.
Inputs: R: a set of non- target attributes, C: the target attribute, D: training data.
Output: returns a decision tree
Start
Initialize to empty tree;
If D is empty then
Return a single node failure value
End If
If D is made only for the values of the same target then
Return a single node of this value
End if
If R is empty then
Return a single node with value as the most common value of the target attribute values
found in D
End if
A ← the attribute that has the largest Gain (A, S) among all the attributes of R
{aj, j = 1, 2, ..., m} ← Attribute values of A
{Dj with j = 1, 2, ..., m} ←The subsets of D respectively constituted of aj records attribute value A
Return a tree whose root is A and the arcs are labeled by a1, a2, ..., am and going to sub-
trees ID3 (R-{A}, C, D1), ID3 (R-{A} C, D2), .., ID3 (R-{A}, C, Dm)
End
ID3 Algorithm
Gain of an attribute A is calculated as follows
If D is sample or training data, then information carried by
this( also called Entropy) is given by the equation
Entropy(D) or
Where v is the no of
attribute values and Dj is set of tuples having same attribute
values for A.
Q1. Construct the decision tree using ID3 algorithm for the
following trained data
Q2. Construct the decision tree using ID3 algorithm for the
following trained data
Advantages of ID3
Understandable prediction rules are created from the
training data.
Builds the fastest tree.
Builds a short tree.
Only need to test enough attributes until all data is
classified.
Finding leaf nodes enables test data to be pruned,reducing
number of tests.
Whole dataset is searched to create tree.
DisAdvantages of ID3
Only one attribute at a time is tested for making a decision.
Classifying continuous data may be computationally
expensive, as many trees must be generated to see where to
break the continuum.
The information gain measure is biased toward tests with
many outcomes. That is, it prefers to select attributes having
a large number of values.
C4.5 Algorithm
It use gain ratio as attribute selection measure
Holdout and cross-validation are two common techniques for assessing classifier accuracy,
based on randomly-sampled partitions of the given data.
Holdout method
In the holdout method, the given data are randomly partitioned into two independent sets, a
training set and a test set.
Typically, two thirds of the data are allocated to the training set, and the remaining one third is
allocated to the test set.
The training set is used to derive the classifier, whose accuracy is estimated with the test set.
Random subsampling
It is a variation of the holdout method in which the holdout method is repeated k times.
The overall accuracy estimate is taken as the average of the accuracies obtained from each
iteration.
Estimating classifier accuracy
Cross Validation
In k-fold cross validation, the initial data are
randomly partitioned into k mutually
exclusive subsets or \folds", S1, S2 .... Sk,
each of approximately equal size.
Training and testing is performed k times.
In iteration i, the subset Si is reserved as the
test set, and the remaining subsets are
collectively used to train the classifier.
Estimating classifier
accuracy
That is, the classifier of the first iteration is trained on
subsets S2 ... Sk, and tested on S1 , the classifier of the
second iteration is trained on subsets S1; S3; ::; Sk, and
tested on S2; and so on.
The accuracy estimate is the overall number of correct
classifications from the k iterations, divided by the total
number of samples in the initial data.
In stratified cross-validation, the folds are stratified so that
the class distribution of the samples in each fold is
approximately the same as that in the initial data.
Other methods of estimating classifier accuracy include
bootstrapping, which samples the given training instances
uniformly with replacement, and leave-one-out, which is
k-fold cross validation with k set to s, the number of initial
samples.
Increasing classifer accuracy
Bagging (or boostrap aggregation) and
boosting are two such techniques (Figure).
Each combines a series of T learned
classifers, C1; C2; ::;CT, with the aim of
creating an improved composite classifer,
C*.
Cluster Analysis
Clustering is the process of grouping the data into classes or
clusters so that objects within a cluster have high similarity in
comparison to one another, but are very dissimilar to objects in
other clusters.
Dissimilarities are assessed based on the attribute values describing
the objects.
Often, distance measures are used.
Clustering is known as unsupervised learning because the
class label information is not present
Because a cluster is a collection of data objects that are similar to
one another within the cluster and dissimilar to objects in other
clusters, a cluster of data objects can be treated as an implicit class.
In this sense, clustering is sometimes called automatic
classification.
Applications of clustering
Cluster analysis has been widely used in many applications such as
business intelligence,image pattern recognition, Web search,
biology, and security.
In business intelligence, clustering can be used to organize a
large number of customers into groups, where customers within a
group share strong similar characteristics.
This facilitates the development of business strategies for enhanced
customer relationship management.
In image recognition, clustering can be used to discover
clusters or “subclasses” in handwritten character recognition
systems.
There can be a large variance in the way in which people write the
same digit. Based on this constraints, sub class can be formed
which improve over all image recognition process.
Applications of clustering
In Web search, a keyword search may often return a very large
number of hits due to the extremely large number of web pages.
Clustering can be used to organize the search results into groups
and present the results in a concise and easily accessible way.
Clustering techniques have been developed to cluster documents
into topics, which are commonly used in information retrieval
practice.
As a data mining function, cluster analysis can be used as a
standalone tool to gain insight into the distribution of data, to
observe the characteristics of each cluster, and to focus on a
particular set of clusters for further analysis
Requirements of clustering in data
mining
Scalability: Many clustering algorithms work well in small
data sets. Therefore clustering on only a sample of a given
large data set may lead to biased results . Highly scalable
clustering algorithms are needed.
Ability to deal with different types of attributes:
Many algorithms are designed to cluster interval-based
(numerical) data. However, applications may require
clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types
Requirements of clustering in data mining
Discovery of clusters with arbitrary shape: Most of the
Algorithms result in a spherical clusters with similar size and
density. However, a cluster could be of any shape. It is important
to develop algorithms which can detect clusters of arbitrary shape.
Minimal requirements for domain knowledge to
determine input parameters.
Ability to deal with noisy data.
Insensitivity to the order of input records: Order of the
records should not effect the efficiency of the algorithm.
High dimensionality: Should work well for data with any
dimensions.
Constraint-based clustering: Real-world applications may
need to perform clustering under various kinds of constraints. So
it is required to cope up with variety of constraints
Interpretability and usability: Users expect clustering results
to be interpretable, comprehensible, and usable.
A categorization of major clustering
methods
Large number of clustering algorithms in the literature
The choice of clustering algorithm depends both on the type
of data available and on the particular purpose and
application.
In general, major clustering methods can be classified into
the following categories.
1. Partitioning methods
Given a database of n objects or data tuples, a partitioning method
constructs k partitions of the data, where each partition represents
a cluster, and k ≤ n.
That is, it classifies the data into k groups, which together satisfy
the following requirements.
(1) Each group must contain at least one object, and
(2) Each object must belong to exactly one group.
Given k, the number of partitions to construct, a partitioning
method creates an initial partitioning.
It then uses an iterative relocation technique which attempts to
improve the partitioning by moving objects from one group to
another.
Eg:(1) the k-means algorithm, where each cluster is represented
by the mean value of the objects in the cluster;
(2) the k-medoids algorithm, where each cluster is represented by
one of the objects located near the center of the cluster
2. Hierarchical methods.
A hierarchical method creates a hierarchical decomposition
of the given set of data objects.
A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical
decomposition is formed.
The agglomerative approach, also called the \bottom-up"
approach, starts with each object forming a separate group.
It successively merges the objects or groups close to one
another, until all of the groups are merged into one (the
topmost level of the hierarchy), or until a termination
condition holds
2. Hierarchical methods.
The divisive approach, also called the “top-down" approach,
starts with all the objects in the same cluster.
In each successive iteration, a cluster is split up into smaller
clusters, until eventually each object is in one cluster, or until
a termination condition holds.
Hierarchical methods suffer from the fact that once a step
(merge or split) is done, it can never be undone.
3. Density-based methods
Most partitioning methods can end only spherical-shaped
clusters and encounter difficulty at discovering clusters of
arbitrary shapes.
Density based methods based on the notion of density can
produce clusters of arbitrary shapes.
The general idea is to continue growing the given cluster as
long as the density (number of objects or data points) in the
“neighborhood" exceeds some threshold.
That is for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a
minimum number of points.
Eg: DBSCAN,OPTICS
4. Grid-based methods
Grid-based methods quantize the object space into a finite
number of cells which form a grid structure.
All of the clustering operations are performed on the grid
structure (i.e., on the quantized space).
The main advantage of this approach is its fast processing
time which is typically independent of the number of data
objects, and dependent only on the number of cells in each
dimension in the quantized space.
Eg: STING
Summary
K medoid clustering algorithm
Problems of K means algorithm is if outliers are present then
it cluster the data points wrongly.
Distance from the centroid and the data obejetcs may be very
large.
Steps
1.Initialize: randomly select k of the n data points as
the medoids.
2. Assignment step: Associate each data point to the closest
medoid by calculating the distance from the medoid.
(Manhattan distance is used.Which is calculated for two point
(x1,y1) and (x2,y2) as |y2-y1|+|x2-x1|
3.Update step: For each medoid m and each data
point o associated to m swap m and o and compute the total
cost of the configuration (that is, the average dissimilarity
of o to all the data points associated to m). Select the
medoid o with the lowest cost of the configuration.
4.Repeat alternating steps 2 and 3 until there is no change in
the assignments.
Use k medoids algorithm to cluster the following data into 2
cluster.
DBSCAN Algorithm
What is DBSCAN ?
➢ DBSCAN is an algorithm can identify clusters in large
spatial data by looking at the local density of the data
elements.
➢DBSCAN can find clusters of arbitrary shape. Clusters
thate lie close to each other tend to belong to same cluster.
Terminologies
Eps: Maximum radius of the neighborhood
MinPts: Minimum no of points in the Eps-neighborhood of a
point.
The Eps-neighbourhood of a point: N(p)={q ε
D|dist(p,q)<=Eps}. For a point belong to cluster it needs to
have at least one other point that lies closer to it than the
distance Eps.
Core Points: A point is a core point if and only if that
certain point has minimum Mnpts number of neighbors.
BorderPoint: A point is a border point if it doesn’t have Minpts
number of neighbors and it share neighborhood with at least one
core point.
NoisePoint: A point which does not share neighborhood with
any of the core point.
Terminologies
Directly density reachable: A point p is directly density
reachable from a point q with respect to Eps and Minpts if
1. p belong to neighborhood of q. ie, disp(q,p)<Eps.
2. q has to be core point.
Terminologies
Density reachable: A point p is density reachable from point q
with respect to Eps and Minpts if there are chain of point
p1,p2….pn and p1=q , pn=p such that pi+1 is directly density
reachable from pi.
Density connected: A point p is density connected to a point q
with respect to Eps and Minpts if there is a point o such that both
p and q are density reachable from point o with respect to Eps and
Minpts.
Terminologies
DBSCAN algorithm
Label all points as core, border noise points.
Put an edge among all core points that are within Eps of each
other.
Make each group of connected core points into separate
cluster.
Assign each of the border points to one of the clusters of its
associated core points.
If a point is a noise point then it is either ignored or kept as
reference to check different aspects of the dataset.
When all the instances are processed then the algorithm
converges.
Pseudo code
Advantages and dsiadvanatges
➢ Advantages
Does not required predefined number of clusters.
Clusters can be of any shape.
Able to identify noisy data.
➢Disadvantages
Density based algorithm fails if there are no density drop
between the clusters.
It also sensitive to the parameters that define the
density.(Eps and Minpts)
Proper parameter setting may require domain knowledge.
Q:Exexute DBSCAN algorithm(Eps=1.5, Minpts=2) for the
data given below. Determine whether the points are core,
border or noise? What are the resulting clusters?
X1 x2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
Hierarchical Clustering method
A hierarchical clustering method works by grouping data
objects into a tree of clusters.
A tree data structure, called dendrogram, can be used
a