2. Classification and clustering algorithms
2. Classification and clustering algorithms
PHDS-634
Topic 2:
Classification,
Clustering and
Bayesian Theory
2
Classification and Clustering
Modeling from past data is crucial tool to help
experts and physicians make more informed
decisions.
Machine Learning Models provide evidence on
how things happen from which experts can use
for better understanding of public health
situation and appropriate interventions
Understanding the value of modeling in the
field of health science is beneficial.
3
Basic Algorithms
⚫ Classification: which is also called Supervised
learning, maps data into predefined groups or classes
to enhance the prediction process
⚫ Clustering: which is also called Unsupervised
learning, groups similar data together into clusters.
⚫ is used to find appropriate groupings of elements for
a set of data.
⚫ Unlike classification, clustering is a kind of
undirected knowledge discovery or unsupervised
learning; i.e., there is no target field & the
relationship among the data is identified by
bottom-up approach.
4
Supervised vs. Unsupervised Learning
⚫ Supervised learning (classification)
⚫ Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
⚫ New data is classified based on the training set
5
CLASSIFICATION
⚫ Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the class or
the label attribute.
⚫ Goal: Find a model for class attribute as a function of the values
of other attributes.
⚫ Previously unseen records should be assigned a class as
accurately as possible. A test set is used to determine the
accuracy of the model.
⚫ Usually, the data set is divided into training and test sets,
⚫ For example:
⚫ One can use a classification model to predict whether a patient is
HIV +Ve or -Ve
⚫ Classification can also be user to categorize the effectiveness of a
6 drug as “High”, “Moderate” or “Low”
CLASSIFICATION: A TWO-STEP
PROCESS
1. Model construction: describing a set of predetermined classes
⚫ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
⚫ The set of tuples used for model construction is called the training
set
⚫ The model is represented as classification rules, decision trees,
or mathematical formulae
2. Model usage: for classifying new or unseen instances into one
of the class values or labels.
⚫ Estimate accuracy of the model
⚫ The known label of test instance is compared with the classified result from
the model
⚫ Accuracy rate is the percentage of test set samples that are correctly classified
by the model
7
⚫ Test set should be different from the training set
⚫ If the accuracy is acceptable, use the model to classify data
Illustrating Classification Task
8
Instance-Based Methods
⚫ Instance-based learning:
⚫ Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
⚫ Typical approaches
⚫ k-nearest neighbor approach
⚫ Instances represented as points in a Euclidean
space.
⚫ Locally weighted regression
⚫ Constructs local approximation
⚫ Case-based reasoning
9
⚫ Uses symbolic representations and
knowledge-based inference
Different Learning Methods
⚫ Eager Learning
⚫ Learning = acquiring an explicit structure of a
classifier on the whole training set;
⚫ Classification = an instance gets a classification
using the explicit structure of the classifier.
11
K-Nearest Neighbors
12
K-Nearest Neighbors
⚫ K-nearest neighbor is a supervised learning algorithm
where the result of new instance query is classified based
on majority of K-nearest neighbor category.
⚫ The purpose of this algorithm is to classify a new object
based on attributes and training samples: (xi, f(xi)), i=1..N.
⚫ Given a query point Q, we find K number of objects or
(training points) closest to the query point.
⚫ The decision is using majority vote among the class of the K
objects.
⚫ K Nearest neighbor algorithm uses neighborhood classification as
the prediction value of the new query instance.
⚫ It uses distance measure from the query instance to the
training samples to determine the K-nearest neighbors.
13
How to compute….
K-Nearest Neighbor (KNN) Algorithm?
1. Determine parameter K = number of nearest neighbors
2. Calculate the distance between the query-instance and
all the training samples
⚫ we can use Euclidean distance or any other similarity
measure
3. Sort the distance and determine nearest neighbors based
on the Kth minimum distance
4. Gather the category of the nearest neighbors
5. Use simple majority of the category of nearest neighbors
as the prediction value of the query instance
⚫ Any ties can be broken at random with reason.
14
K Nearest Neighbors: Key issues
The key issues involved in training KNN model includes:
⚫ Setting the variable K (Number of nearest neighbors)
⚫ The numbers of nearest neighbors (K) should be based on cross
validation over a number of K setting.
⚫ When k=1 is a good baseline model to benchmark against.
⚫ A good rule-of-thumb is that K should be less than or equal to the
square root of the total number of training patterns.
16
Example
⚫ We have data from people & an objective testing with two
attributes (acid durability & strength) to classify whether a
special medical tissue is good or not. Here are four training
samples. X X Y
1 2
7 seconds 7 Bad
X1 = Acid Durability 7 seconds 4 Bad
3 seconds 4 Good
X2 = Strength (kg/m2)
1 seconds 4 Good
Y = Classification
Test instance
19
KNNs: advantages & Disadvantages
⚫ Advantage
⚫ Simple
⚫ Powerful
⚫ Requires no training time (prediction done online)
⚫ Nonparametric architecture
⚫ Disadvantage: Difficulties with k-nearest
neighbor
⚫ Memory intensive:
⚫ training examples should always be available
⚫ Classification/estimation is slow
⚫ Have to calculate the distance of the test case from all
training cases
⚫ There may be irrelevant attributes amongst the attributes
20
– curse of dimensionality
Bayesian Learning
21
Why Bayesian Classification?
⚫ Provides practical learning algorithms
⚫ Probabilistic learning: Calculate explicit probabilities for
hypothesis. E.g. Naïve Bayes
⚫ Prior knowledge and observed data can be combined
⚫ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
⚫ It is a generative (model based) approach, which offers a
useful conceptual framework
⚫ Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities.
E.g. sequences could also be classified, based on a probabilistic model
specification
⚫ Any kind of objects can be classified, based on a probabilistic
22 model specification
CONDITIONAL PROBABILITY
⚫ Probability : How likely is it that an event will happen?
⚫ Sample Space S
⚫ An event A and C are a subset of S
Therefore:
24
Bayes Theorem
⚫ Example: A medical cancer diagnosis problem. There are 2 possible outcomes of
a diagnosis: +ve, -ve.
We know 0.8% of world population has cancer. Test gives correct +ve result 98% of the
time and gives correct –ve result 97% of the time. If a patient’s test returns +ve,
should we diagnose the patient as having cancer?
P(cancer) = 0.008 p(no-cancer) = 0.992
P(+ve|cancer) = 0.98 P(-ve|cancer) = 0.02
P(+ve|no-cancer) = 0.03 P(-ve|no-cancer) = 0.97
26
General Bayes Theorem
⚫ Consider each attribute & class label as random variables
⚫ Given a record with attributes (A1, A2,…,An)
⚫ Goal is to predict class C
⚫ Specifically, we want to find the value of C that maximizes P(C|
A1, A2,…,An )
⚫ Can we estimate P(C| A1, A2,…,An ) directly from data?
⚫ Approach: compute the posterior probability P(C | A1, A2, …, An)
for all values of C using the Bayes theorem
28
Example. ‘Play Tennis’ data
•Suppose that some one who loves tennis have a free afternoon and he is
thinking whether or not to go and play tennis, How you do help this guy?
✔ Based on the following training data, predict when this player will Play
Tennis?
29
Naive Bayes Classifier
⚫ Given a training set, we can compute the probabilities
P(Play) = 9/14
P(NoPlay) = 5/14
Working:
A: attributes
M: mammals
N: non-mammals
P(A|M)P(M) >
P(A|N)P(N)
=> Mammals
32
Naive Bayesian Classifier
⚫ Advantages
⚫ Easy to implement
⚫ Good results obtained in most of the cases
⚫ Robust to isolated noise points
⚫ Handle missing values by ignoring the instance during probability
estimate calculations (the probability will be zero)
⚫ Robust to irrelevant attributes
⚫ Disadvantages
⚫ Class conditional independence assumption may not hold for some
attributes, therefore loss of accuracy
⚫ Practically dependencies exist among variables
⚫ E.g. hospitals: patients: profile: age, family history, etc. symptoms: fever,
cough etc. Disease: lung cancer, diabetes, etc.
⚫ Dependencies among these cannot be modeled by Naïve Bayesian classifier
⚫ How to deal with these dependencies?
⚫ Bayesian Belief Networks
33
Clustering
34
Paper Review
Journal URL:
http://www.jmlr.org/
Clustering
• Clustering is a data mining (machine learning) technique that
finds similarities between data according to the characteristics
found in the data & groups similar data objects into one cluster
x x x
⚫ Given a set of points, with a notion x x x
x x x
of distance between points, group the x x
x
points into some number of clusters, x x x x x
so that members of a cluster are in x x x x x x
some sense as close to each other as x x x xx
x x x
possible. x x x
⚫ While data points in the same x
cluster are similar, those in separate x x
clusters are dissimilar to one x
36another. x
Example: Clustering
⚫ The example below demonstrates the clustering of padlocks
of same kind. There are a total of 10 padlocks which
various in color, size, shape, etc.
37
38
What is Cluster Analysis?
⚫ Cluster: a collection of data objects
⚫ Similar to one another within the same cluster
⚫ Dissimilar to the objects in other clusters
⚫ Cluster analysis
⚫ Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
⚫ Unsupervised learning: no predefined classes
⚫ Typical applications
⚫ As a stand-alone tool to get insight into data distribution
⚫ As a preprocessing step for other algorithms like
information retrieval
39
Examples of Clustering Applications
⚫ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing programs
● Feature Selection
● identifying the most effective subset of the original
features to use in clustering
● Feature Extraction
● transformations of the input features to produce new
salient features.
● Inter-Pattern Similarity
● measured by a distance function defined on pairs of
patterns.
● Grouping
● methods to group similar patterns in the same cluster
43
Evaluation
⚫ Intra-cluster cohesion (compactness):
⚫ Cohesion measures how near the data points in a
cluster are to the cluster centroid.
⚫ Sum of squared error (SSE) is a commonly used
measure.
⚫ Inter-cluster separation (isolation):
⚫ Separation means that different cluster centroids
should be far away from one another.
⚫ In most applications, expert judgments are still
the key.
44
Cluster Evaluation: Hard Problem
⚫ The quality of a clustering is very hard to evaluate
because
⚫ We do not know the correct clusters/classes
⚫ User inspection
⚫ Study centroids of the cluster, and spreads of data items in each
cluster
⚫ For text documents clustering problem, one can read some
45 documents in clusters to see if the documents are about similar
topics and judge the quality of clustering algorithms employed.
Measuring clustering validity
Internal Index:
• Validate without external info
• With different number of clusters ? ?
• Solve the number of clusters
47
Estimating the “right” number of clusters
⚫ Typical approach: find a “knee” in an internal measure
curve.
⚫ Desirable property:
⚫ some clustering algorithm does not require the number of
clusters to be specified (e.g., DBSCAN)
⚫ Discussion Topic:
48
⚫ why not the k that minimizes the SSE?
Internal Measures: Cohesion and Separation
⚫ Cluster Cohesion: Measures how closely related are
objects within a single cluster
⚫ Cluster Separation: Measure how distinct or
well-separated a cluster is from other clusters
⚫ Example: Squared Error
⚫ Cohesion is measured by the within cluster sum of squares
(SSE)
Ideally this should be small
49
Internal Measures: Cohesion and Separation
⚫ A proximity graph based approach can also be used for
cohesion and separation.
⚫ Cluster cohesion is the sum of the weight of all links within
a cluster.
⚫ Cluster separation is the sum of the weights between
nodes in the cluster and nodes outside the cluster.
50
cohesion separation
External Measure: Ground Truth
⚫ We use some labeled data (for classification) to check if
the clustering or grouping is acceptable
⚫ Assumption:
cluster.
n1
n2
n3
m1
m2
m3
m1
m2
m3
Good and bad
clustering
∙ ∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙
Cluster I Cluster II Cluster III
1 250 20 10
2 20 180 80
3 30 100 210
Total 300 300 300
58
A remark about ground truth evaluation
⚫ Commonly used to compare different clustering
algorithms.
⚫ A real-life data set for clustering has no class labels.
⚫ Thus although an algorithm may perform very well on some
labeled data sets, no guarantee that it will perform well on
the actual application data at hand.
⚫ The fact that it performs well on some label data sets
does give us some confidence of the quality of the
algorithm.
⚫ This evaluation method is said to be based on
external data or information.
59
Indirect Evaluation
⚫ In some applications, clustering is not the primary task, but
used to help perform another task.
⚫ We can use the performance on the primary task to compare
clustering methods.
⚫ For instance, in designing a recommender system, if the
primary task is to provide recommendations on book
purchasing to online shoppers.
⚫ If we can cluster books according to their features, we might be
able to provide better recommendations.
⚫ We can evaluate different clustering algorithms based
on how well they help with the recommendation task.
⚫ Here, we assume that the recommendation can be reliably
evaluated.
60
Similarity/Dissimilarity Measures
⚫ Each clustering problem is based on some
kind of distance “farness” or “nearness”
measurement between data points.
⚫ Clustering task performs some kind of
comparison:
⚫ Among the members of the same cluster and
⚫ Among members of different clusters
⚫ Thus, distances are normally used to measure
the similarity or dissimilarity between two
data objects
61
How to Measure Similarity/Dissimilarity?
⚫ Popular similarity measure is: Minkowski distance:
where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two
n-dimensional data objects; n is size of vector attributes of the
data object; q= 1,2,3,…
62
Similarity/Dissimilarity Measures
63
Similarity & Dissimilarity Between Objects
⚫ If q = 2, dis is Euclidean distance:
⚫ Cosine Similarity
⚫ If X and Y are two vector attributes of data objects, then
cosine similarity measure is given by:
64
Example: Similarity measure
⚫ Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1∙d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1= 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)½ =(42)½ = 6.481
65
The need for representative
⚫ Key problem: as you build clusters, more than one
instances might be put under the same cluster. How do
you represent the location of each cluster?. The
representative of the cluster will be used to tell which pair
of clusters is closest?
⚫ All members of the cluster can not be used as
representative. Thus it is important to pick a
representative of each cluster.
⚫ For each cluster assign a centroid (closest to all other
points with in the cluster) = average of its points.
66
• One can measure inter-cluster distances by distances of
Major Clustering Approaches
⚫ Partitioning clustering approach:
⚫ Construct various partitions as per the given number of clusters
⚫ Typical methods:
⚫ distance-based: K-means clustering
⚫ model-based: expectation maximization (EM) clustering.
67
Major Clustering Approaches (I)
⚫ Partitioning approach:
⚫ Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square
errors
⚫ Typical methods: k-means, k-medoids, CLARANS
⚫ Hierarchical approach:
⚫ Create a hierarchical decomposition of the set of data
(or objects) using some criterion
⚫ Typical methods: Diana, Agnes, BIRCH, ROCK,
CAMELEON
⚫ Density-based approach:
⚫ Based on connectivity and density functions
68 ⚫ Typical methods: DBSACN, OPTICS, DenClue
Major Clustering Approaches (II)
⚫ Grid-based approach:
⚫ based on a multiple-level granularity structure
⚫ Typical methods: STING, WaveCluster, CLIQUE
⚫ Model-based:
⚫ A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
⚫ Typical methods: EM, SOM, COBWEB
⚫ Frequent pattern-based:
⚫ Based on the analysis of frequent patterns
⚫ Typical methods: pCluster
⚫ User-guided or constraint-based:
⚫ Clustering by considering user-specified or application-specific
constraints
⚫ Typical methods: COD (obstacles), constrained clustering
69
Partitioning Algorithms: Basic Concept
⚫ Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters; such that, sum of
squared distance is minimum
⚫ Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
⚫ Global optimal: exhaustively enumerate all partitions
⚫ Heuristic methods: k-means and k-medoids algorithms
⚫ k-means: Each cluster is represented by the center of the
cluster
⚫ K is the number of clusters to partition the dataset
⚫ Means refers to the average location of members of a particular cluster
⚫ k-medoids or PAM (Partition Around Medoids):
⚫ Each cluster is represented by one of the objects in the
cluster
70
The K-Means Clustering Algorithm
▪ Given k (number of clusters), the k-means algorithm is
implemented as follows:
1. Select K cluster points randomly as initial centroids
2. Compute similarity between each instance and each
cluster
3. Assign each instance to the cluster with the nearest
seed point
4. Re-compute the centroids of each K clusters of the
current partition (the centroid is the center, i.e., mean
point, of the cluster)
5. Repeat until the centroid don’t change
71
The K-Means Clustering Method
⚫ Example
10 Assign 1
0
9
9
8
each Update 8
7
7
6
objects the 6
5
5
4
to most cluster 4
3
3
2
similar means 2
1
1
0
center 0
0 1 2 3 4 5 6 7 8 9 1
0 1 2 3 4 5 6 7 8 9 10 0
reassign reassign
K=2
Update
Arbitrarily choose the
K object as initial cluster
cluster center means
72
Example Problem
⚫ Cluster the following eight points (with (x, y)
representing locations) into three clusters :
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8)
A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9).
⚫ Assume that the randomly selected initial cluster centers are:
A1(2, 10), A4(8,4) and A7(1, 2).
⚫ The distance function between two points Aj=(x1, y1)
and Ci=(x2, y2) is defined as:
Manhatta
n
dis(Aj, Ci) = |x2 – x1| + |y2 – y1| .
⚫ Use k-means algorithm to find optimal centroids to
group the given data into three clusters.
73
Iteration 1
⚫ Starting from point A1(2, 10) calculate the distance to each of
the three means, by using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |8 – 2| + |4 – 10| = 6 + 6 = 12
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
⚫ Fill these values in the table & decide which cluster should the point (2,
10) be placed in? The one, where the point has the shortest distance to the
mean – i.e. mean 1 (cluster 1), since the distance is 0.
⚫ Next go to the second point A2(2, 5) and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |8 – 2| + |4 – 5| = 6 + 1 = 7
dis(A2, mean3) = |1 – 2| + |2 – 5| = 1 + 3 = 4
⚫ So, we fill in these values in the table and assign the point (2, 5) to
cluster 3 since mean 3 is the shortest distance from A2.
⚫ Analogically, we fill in the rest of the table, and place each point
in one of the clusters
74
Iteration 1
First we list all points in the first column of the table below. The initial cluster
centers - centroids, are (2, 10), (8,4) and (1, 2) - chosen randomly.
The Table shows the distance of each data points (instances) from the chosen
centroids. The last column shows the cluster the instance should be assigned
to based on its distance from the centroids.
Cluster
Cluster 2
3
78
Comments on the K-Means Method
⚫ Strength: Simple assumptions and Relatively efficient: O
(tkn), where n is the number objects, k is the number
clusters, and t is the number iterations. Normally, k, t << n.
⚫ Weakness
⚫ Applicable only when mean is defined and K, the number
of clusters, specified in advance
⚫ One can use hierarchical clustering to overcome this weakness
81
Hierarchical clustering
K=1
K=2
K=3
K=4
K=5
K=6
K=7
K=8
Two main types of hierarchical clustering
⚫ Agglomerative: it is a Bottom Up clustering technique
⚫ Start with all sample units in n clusters of size 1.
⚫ Then, at each step of the algorithm, the pair of clusters with the shortest distance
are combined into a single cluster.
⚫ The algorithm stops when all sample units are grouped into one cluster of size n.
⚫ Divisive: it is a Top Down clustering technique
⚫ Start with all sample units in a single cluster of size n.
⚫ Then, at each step of the algorithm, clusters are partitioned into a pair of clusters,
selected to maximize the distance between each clauster.
⚫ The algorithm stops when sample units are partitioned into n clusters of size 1.
84
Example
Problem: clustering analysis with agglomerative
algorithm
data matrix
121
Example
Update distance matrix (iteration 1)
122
Example
Merge two closest clusters (iteration 2)
123
Example
Update distance matrix (iteration 2)
124
Example
Merge two closest clusters/update distance matrix
(iteration 3)
125
Example
Merge two closest clusters/update distance matrix
(iteration 4)
126
Example
• Final result (meeting termination condition)
127
Example
Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
6 2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
e
lifetim
Data item X Y
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12
101
Proximity Matrix: First epoch
1 2 3 4 5
1 = (4,4) - 4 15 20 28
2= (8,4) 4 - 11 16 24
3=(15,8) 15 11 - 13 13
4=(24,4) 20 16 13 - 8
5=(24,12) 28 24 13 8 -
102
Exercise: Hierarchical clustering
⚫ Given the following 8 samples having two attributes,
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8),
A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
⚫ Weakness
⚫ Do not scale well: time complexity of at least O(n2),
where the total number of data objects
104 ⚫ If there are many data points (examples), the algorithm is
Semi Supervised Learning
Use small number of labeled data to label large amount of
cheap unlabeled data.
Basic idea: similar examples should be given the same
classification.
Typical example :
web page classification: unlimited amount of cheap
unlabeled data, while labeling is expensive.
133
Why Semi-Supervised
Because people want better performance for cheaper cost.
unlabeled data is cheap
labeled data can be hard to get and expensive
human annotation is time taking, boring and erroneous
labels may require experts
labels may require special devices
How to Semi-Supervise
Using both labeled and unlabeled data to build better
learners, than using each one alone.
134
Semi-supervised learning
□ Label propagation
□ Transductive learning
□ Co-training
□ Active learning
135
End of Topic 2
108