BDA Unit 2
BDA Unit 2
INDEX:
Clustering
o K-means
Overview
Use Cases
Overview of the Method
Steps
Example
Determining the Number of Clusters
Diagnostics
Reasons to Choose and Cautions
Classification
o Working Model
o Decision Trees
Overview of a Decision Tree
The General Algorithm
Decision Tree Algorithms
Evaluating a Decision Tree
Decision Trees in R
o Naïve Bayes
Bayes‘ Theorem
Naïve Bayes Classifier
Page 1 of 31
LECTURE NOTES:
Supervised Learning:
Learning with class label is called supervised learning.
Example: Classification, Prediction
o Decision Tree, Naïve Bayes
Unsupervised Learning:
Learning without using class label is called unsupervised learning.
Example: Clustering, Association Rules
o K-Means, Apriori Algorithm
Clustering - Definition:
Grouping of similar data items is called clustering.
Page 2 of 31
Similar objects are grouped in one cluster and dissimilar objects are grouped in
another cluster.
For example, based on customers’ personal income, it is straightforward to divide the
customers into three groups. The customers could be divided into three groups as follows:
o Earn less than $10,000
o Earn between $10,000 and $99,999
o Earn $100,000 or more
Clustering – Applications:
Clustering can be helpful in many fields, such as:
o Marketing: Clustering helps to find group of customers with similar behavior from
a given data set customer record.
o Biology: Clustering of plants and animal according to their features.
o Library: Clustering of similar books.
Requirements of Clustering:
Scalability − We need highly scalable clustering algorithms to deal with large databases.
Ability to deal with different kinds of attributes − Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and binary
data.
Discovery of clusters with attribute shape − The clustering algorithm should be capable
of detecting clusters of arbitrary shape. They should not be bounded to only distance
measures that tend to find spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability − The clustering results should be interpretable, comprehensible, and
usable.
Types of Clustering:
Partitioning: In this approach, several partitions are created and then evaluated based on
given criteria.
o Example: K-Means, K-Medoids
Page 3 of 31
Hierarchical method: In this method, the set of data objects are decomposed (multilevel)
hierarchically by using certain criteria.
Density-based method: This method is based on density (density reachability and density
connectivity).
Grid-based methods: This approach is based on multi-resolution grid data structure.
K-MEANS:
It is one of the partitioning based clustering approach.
In this approach, several partitions are created and then evaluated based on given
criteria.
K-means clustering aims to partition n observations into k clusters.
Each observation belongs to the cluster with the nearest mean.
Use Cases:
Image Processing: Video is one example of the growing volumes of unstructured data
being collected. Within each frame of a video, k-means analysis can be used to identify
objects in the video. For each frame, the task is to determine which pixels are most similar
to each other. The attributes of each pixel can include brightness, color, and location, the x
and y coordinates in the frame. With security video images, for example, successive frames
are examined to identify any changes to the clusters. These newly identified clusters may
indicate unauthorized access to a facility.
Medical: Patient attributes such as age, height, weight, systolic and diastolic blood
pressures, cholesterol level, and other attributes can identify naturally occurring clusters.
These clusters could be used to target individuals for specific preventive measures or
clinical trial participation. Clustering, in general, is useful in biology for the classification
of plants and animals as well as in the field of human genetics.
Customer Segmentation: Marketing and sales groups use k-means to better identify
customers who have similar behaviors and spending patterns. For example, a wireless
provider may look at the following customer attributes: monthly bill, number of text
messages, data volume consumed, minutes used during various daily periods, and years as
a customer. The wireless company could then look at the naturally occurring clusters and
consider tactics to increase sales or reduce the customer churn rate, the proportion of
customers who end their relationship with a particular company.
Page 4 of 31
Overview of the Method:
Considering the same data set, let us solve the problem using K-Means clustering (taking
K = 2).
Page 5 of 31
The first step in k-means clustering is the allocation of two centroids randomly (as K=2).
Two random points are assigned as centroids. Note that the points can be anywhere, as
they are random points
The next step is to determine the actual centroid for these two clusters. The original
randomly allocated centroid is to be repositioned to the actual centroid of the clusters.
This process of calculating the distance and repositioning the centroid continues until
we obtain our final cluster. Then the centroid repositioning stops.
As seen above, the centroid doesn't need any more repositioning, and it means the
algorithm has converged, and we have the two clusters with a centroid.
Steps:
Input:
x: Records
k: Number of clusters
Working Principle:
2) Calculate the distance between each data point and cluster centers.
Page 6 of 31
Euclidean Distance =
3) Assign the data point to the cluster center whose distance from the cluster center is minimum
of all the cluster centers.
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3.
Output:
Number of records in “K” clusters
WSS is the sum of the squares of the distances between each data point and the closest
centroid.
Diagnostics:
The heuristic using WSS can provide at least several possible k values to consider. When
number of attributes is relatively small, a common approach to further refine the choice of
k is to plot the data to determine how distinct the identified clusters are from each other. In
general, the following questions should be considered.
o Are the clusters well separated from each other?
o Do any of the clusters have only a few points?
o Do any of the centroids appear to be too close to each other?
Page 7 of 31
K-Means Clustering - Reasons to Choose (+) and Cautions (-):
Object Attributes
Regarding which object attributes (for example, age and income) to use in the analysis, it
is important to understand what attributes will be known at the time a new object will be
assigned to a cluster.
Units of Measure
From a computational perspective, the k-means algorithm is somewhat indifferent to the
units of measure for a given attribute (for example, meters or centimeters for a patient’s
height). However, the algorithm will identify different clusters depending on the choice of
the units of measure.
Rescaling
Attributes that are expressed in dollars are common in clustering analyses and can differ in
magnitude from the other attributes.
Classification – Definition:
The goal of classification is to find a model to accurately predict the target class for
each record in the data set.
The process of that describes and distinguishes data classes and concepts.
For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks.
Page 8 of 31
Steps in Classification:
A two-step process such as :
o Model Construction (Learning Step) (Training Phase): Construction of
Classification Model. Different Algorithms are used to build a classifier by
making the model learn using the training set available. The model has to be
trained for the prediction of accurate results.
o Model Usage (Classification Step): Model used to predict class labels and
testing the constructed model on test data and hence estimate the accuracy
of the classification rules.
Page 9 of 31
NAÏVE BAYES:
Naïve Bayes is a probabilistic classification method based on Bayes’ theorem.
Bayes’ theorem gives the relationship between the probabilities of two events and
their conditional probabilities.
A naïve Bayes classifier assumes that the presence or absence of a particular feature of
a class is unrelated to the presence or absence of other features. For example, an
object can be classified based on its attributes such as shape, color, and weight. A
reasonable classification for an object that is spherical, yellow, and less than 60 grams
in weight may be a tennis ball. Even if these features depend on each other or upon the
Page 10 of 31
existence of the other features, a naïve Bayes classifier considers all these properties to
contribute independently to the probability that the object is a tennis ball.
The input variables are generally categorical, but variations of the algorithm can accept
continuous variables. There are also ways to convert continuous variables into
categorical ones. This process is often referred to as the discretization of continuous
variables. In the tennis ball example, a continuous variable such as weight can be
grouped into intervals to be converted into a categorical variable. For an attribute such
as income, the attribute can be converted into categorical values as shown below.
o Low Income: income < $10,000
o Working Class: $10,000 ≤ income < $50,000
o Middle Class: $50,000 ≤ income < $1,000,000
o Upper Class: income ≥ $1,000,000
Examples:
o Classifying text documents, spam filtering
Bayes’ Theorem:
Class Probabilities: The probabilities of each class in the training dataset.
Conditional Probabilities: The conditional probabilities of each input value given each
class value.
The conditional probability of event C occurring is given by,
Here,
Page 11 of 31
o P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
o P(c) is the prior probability of class.
o P(x|c) is the likelihood which is the probability of predictor given class.
o P(x) is the prior probability of predictor.
Diagnostics:
Naïve Bayes classifiers can handle missing values.
Naïve Bayes is also robust to irrelevant variables.
Naïve Bayes is computationally efficient and is able to handle high-dimensional data
efficiently.
Naïve Bayes assumes the variables in the data are conditionally independent. Therefore, it
is sensitive to correlated variables
Data scarcity. Chances of loss of accuracy. Zero Frequency i.e. if the category of any
categorical variable is not seen in training data set then model assigns a zero probability to
that category and then a prediction cannot be made.
Page 12 of 31
Example 1:
To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the
illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task
is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the
currently exiting objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case
(which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED.
In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based
on previous experience, in this case the percentage of GREEN and RED objects, and often used to
predict outcomes before they actually happen.
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities
for class membership are:
Page 13 of 31
Having formulated our prior probability, we are now ready to classify a new object (WHITE
circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color.
To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen
a priori) of points irrespective of their class labels. Then we calculate the number of points in the
circle belonging to each class label. From this we calculate the likelihood:
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than
Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:
In the Bayesian analysis, the final classification is produced by combining both prior and
likelihood probabilities.
Finally, we classify X as RED since its class membership achieves the largest posterior probability.
Page 14 of 31
Example 2:
Page 15 of 31
DECISION TREE:
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label. The topmost node in the tree is the root node.
Page 16 of 31
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.
Page 17 of 31
Step1: The first step will be to create a root node.
Step2: If all results are yes, then the leaf node “yes” will be returned else the leaf node “no” will
be returned.
Step3: Find out the Entropy of all observations and entropy with attribute “x” that is E(S) and E(S,
x).
Step4: Find out the information gain and select the attribute with high information gain.
Step5: Repeat the above steps until all attributes are covered.
Page 18 of 31
Attribute selection measure is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of a class-labeled training tuples into individual classes.
It determines how the tuples at a given node are to be split.
Information Gain
This method is the main method that is used to build decision trees. It reduces the information that
is required to classify the tuples. It reduces the number of tests that are needed to classify the given
tuple. The attribute with the highest information gain is selected.
The original information needed for classification of a tuple in dataset D is given by:
Where p is the probability that the tuple belongs to class C. The information is encoded in bits,
therefore, log to the base 2 is used. E(s) represents the average amount of information required to
find out the class label of dataset D. This information gain is also called Entropy.
Page 19 of 31
The information required for exact classification after portioning is given by the formula:
Where P (c) is the weight of partition. This information represents the information needed to
classify the dataset D on portioning by X.
Information gain is the difference between the original and expected information that is required
to classify the tuples of dataset D.
Gain is the reduction of information that is required by knowing the value of X. The attribute with
the highest information gain is chosen as “best”.
Gain Ratio
Information gain might sometimes result in portioning useless for classification. However, the
Gain ratio splits the training data set into partitions and considers the number of tuples of the
outcome with respect to the total tuples. The attribute with the max gain ratio is used as a splitting
attribute.
Gini Index
Gini Index is calculated for binary variables only. It measures the impurity in training tuples of
dataset D, as
Page 20 of 31
P is the probability that tuple belongs to class C. The Gini index that is calculated for binary split
dataset D by attribute A is given by:
Page 21 of 31
CONFUSION MATRIX
The confusion matrix visualizes the accuracy of a classifier by comparing the actual and predicted
classes. The binary confusion matrix is composed of squares:
ROC, AUC
An ROC curve (receiver operating characteristic curve)
AUC stands for "Area under the ROC Curve."
AUC - ROC curve is a performance measurement for classification problem at various
thresholds settings. ROC is a probability curve and AUC represents degree or measure of
separability. It tells how much model is capable of distinguishing between classes. Higher
the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
Page 22 of 31
ENSEMBLE
Ensemble methods use multiple learning algorithms to obtain better predictive performance
Page 23 of 31
KMEANS – R PROGRAM:
#READ DATA
data<-read.csv("mark.csv")
Page 24 of 31
#PRINT MAXIMUM MARK IN MATHS
print(max(data$maths))
50
#DISPLAY SUMMARY
summary(data)
maths science
Min. : 5 Min. :10
1st Qu.:10 1st Qu.:20
Median :40 Median :40
Mean :29 Mean :33
3rd Qu.:40 3rd Qu.:45
Max. :50 Max. :50
#PRINT DATA
print(data)
maths science
1 10 20
2 40 50
3 5 10
4 40 45
5 50 40
Page 25 of 31
#DRAW PLOT FOR GIVEN DATA
plot(data)
#APPLY K-MEANS
km<-kmeans(data,centers=2)
#DISPLAY CENTERS
print(km$centers)
maths science
1 43.33333 45
2 7.50000 15
#DISPLAY CLUSTERS
print(km$cluster)
21211
#PLOT CENTERS
plot(km$centers)
Page 26 of 31
DECISION TREE – R PROGRAM:
#READ DATA
data<-read.csv("data.csv")
#DISPLAY DATA
print(data)
Page 27 of 31
6 8 7 9 9 6 S Pass
7 9 9 7 8 8 A Pass
8 8 7 8 9 6 A Pass
9 1 3 2 3 2 RA Fail
10 9 8 7 6 8 S Pass
#DISPLAY SUMMARY
summary(data)
Tamil English Maths Science
Min. :1.000 Min. :3.000 Min. :2.000 Min. :2.000
1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:3.000
Median :8.000 Median :7.000 Median :7.000 Median :6.000
Mean :6.222 Mean :6.333 Mean :6.111 Mean :6.111
3rd Qu.:8.000 3rd Qu.:8.000 3rd Qu.:8.000 3rd Qu.:9.000
Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000
Social Grade Result
Min. :1.000 A :18 Fail:3
1st Qu.:2.000 RA:27 Pass:7
Median :6.000 S :36
Mean :5.111
3rd Qu.:8.000
Max. :8.000
#FIND DIMENSIONS
dim(training)
5
dim(testing)
5
Page 28 of 31
#CHECK MISSING VALUE
anyNA(data)
FALSE
#ACCURACY
confusionMatrix(test_pred,testing$Result)
Reference
Prediction Fail Pass
Fail 1 0
Pass 0 4
Accuracy : 1
Page 29 of 31
#CHECK DATA FRAME
print(is.data.frame(data))
TRUE
#DISPLAY DATA
print(data)
#DISPLAY SUMMARY
summary(data)
Tamil English Maths Science
Min. :1.000 Min. :3.000 Min. :2.000 Min. :2.000
1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:3.000
Median :8.000 Median :7.000 Median :7.000 Median :6.000
Mean :6.222 Mean :6.333 Mean :6.111 Mean :6.111
3rd Qu.:8.000 3rd Qu.:8.000 3rd Qu.:8.000 3rd Qu.:9.000
Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000
Social Grade Result
Min. :1.000 A :18 Fail:3
1st Qu.:2.000 RA:27 Pass:7
Median :6.000 S :36
Mean :5.111
3rd Qu.:8.000
Page 30 of 31
Max. :8.000
#FIND DIMENSIONS
dim(training)
5
dim(testing)
5
#ACCURACY
confusionMatrix(test_pred,testing$Result)
Reference
Prediction Fail Pass
Fail 1 0
Pass 0 4
Accuracy : 1
Page 31 of 31