0% found this document useful (0 votes)
196 views31 pages

BDA Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
196 views31 pages

BDA Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CS8091 - BIG DATA ANALYTICS – Unit II – Lecture Notes

UNIT II - CLUSTERING AND CLASSIFICATION


Advanced Analytical Theory and Methods: Overview of Clustering - K-means - Use Cases -
Overview of the Method - Determining the Number of Clusters - Diagnostics - Reasons to Choose
and Cautions - Classification: Decision Trees - Overview of a Decision Tree - The General
Algorithm - Decision Tree Algorithms - Evaluating a Decision Tree - Decision Trees in R - Naïve
Bayes - Bayes‘ Theorem - Naïve Bayes Classifier.

INDEX:
 Clustering
o K-means
 Overview
 Use Cases
 Overview of the Method
 Steps
 Example
 Determining the Number of Clusters
 Diagnostics
 Reasons to Choose and Cautions
 Classification
o Working Model
o Decision Trees
 Overview of a Decision Tree
 The General Algorithm
 Decision Tree Algorithms
 Evaluating a Decision Tree
 Decision Trees in R
o Naïve Bayes
 Bayes‘ Theorem
 Naïve Bayes Classifier

Page 1 of 31
LECTURE NOTES:

Data Mining - Definition:


 Extraction of interesting knowledge from huge amount of data.

Why Data Mining?


 Explosive Growth of Data
 Data collection and data availability
o Automated data collection tools, database systems, Web, computerized society
o Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks,
 Science: Remote sensing, bioinformatics, scientific simulation,
 Society and everyone: news, digital cameras, YouTube
 Too much data, less knowledge.

Data Mining Techniques:


 Clustering
 Classification/ Prediction
 Association Rules
 Clustering

Supervised Learning:
 Learning with class label is called supervised learning.
 Example: Classification, Prediction
o Decision Tree, Naïve Bayes

Unsupervised Learning:
 Learning without using class label is called unsupervised learning.
 Example: Clustering, Association Rules
o K-Means, Apriori Algorithm

Clustering - Definition:
 Grouping of similar data items is called clustering.

Page 2 of 31
 Similar objects are grouped in one cluster and dissimilar objects are grouped in
another cluster.
 For example, based on customers’ personal income, it is straightforward to divide the
customers into three groups. The customers could be divided into three groups as follows:
o Earn less than $10,000
o Earn between $10,000 and $99,999
o Earn $100,000 or more

Clustering – Applications:
 Clustering can be helpful in many fields, such as:
o Marketing: Clustering helps to find group of customers with similar behavior from
a given data set customer record.
o Biology: Clustering of plants and animal according to their features.
o Library: Clustering of similar books.

Requirements of Clustering:
 Scalability − We need highly scalable clustering algorithms to deal with large databases.
 Ability to deal with different kinds of attributes − Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and binary
data.
 Discovery of clusters with attribute shape − The clustering algorithm should be capable
of detecting clusters of arbitrary shape. They should not be bounded to only distance
measures that tend to find spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and
usable.

Types of Clustering:
 Partitioning: In this approach, several partitions are created and then evaluated based on
given criteria.
o Example: K-Means, K-Medoids

Page 3 of 31
 Hierarchical method: In this method, the set of data objects are decomposed (multilevel)
hierarchically by using certain criteria.
 Density-based method: This method is based on density (density reachability and density
connectivity).
 Grid-based methods: This approach is based on multi-resolution grid data structure.

K-MEANS:
 It is one of the partitioning based clustering approach.
 In this approach, several partitions are created and then evaluated based on given
criteria.
 K-means clustering aims to partition n observations into k clusters.
 Each observation belongs to the cluster with the nearest mean.

Use Cases:
 Image Processing: Video is one example of the growing volumes of unstructured data
being collected. Within each frame of a video, k-means analysis can be used to identify
objects in the video. For each frame, the task is to determine which pixels are most similar
to each other. The attributes of each pixel can include brightness, color, and location, the x
and y coordinates in the frame. With security video images, for example, successive frames
are examined to identify any changes to the clusters. These newly identified clusters may
indicate unauthorized access to a facility.
 Medical: Patient attributes such as age, height, weight, systolic and diastolic blood
pressures, cholesterol level, and other attributes can identify naturally occurring clusters.
These clusters could be used to target individuals for specific preventive measures or
clinical trial participation. Clustering, in general, is useful in biology for the classification
of plants and animals as well as in the field of human genetics.
 Customer Segmentation: Marketing and sales groups use k-means to better identify
customers who have similar behaviors and spending patterns. For example, a wireless
provider may look at the following customer attributes: monthly bill, number of text
messages, data volume consumed, minutes used during various daily periods, and years as
a customer. The wireless company could then look at the naturally occurring clusters and
consider tactics to increase sales or reduce the customer churn rate, the proportion of
customers who end their relationship with a particular company.

Page 4 of 31
Overview of the Method:

Fig.: Flowchart – K-Means


 For a better understanding of k-means, let's take an example from cricket. Imagine you
received data on a lot of cricket players from all over the world, which gives information
on the runs scored by the player and the wickets taken by them in the last ten matches.
Based on this information, we need to group the data into two clusters, namely batsman
and bowlers.
 Let's take a look at the steps to create these clusters.

 Considering the same data set, let us solve the problem using K-Means clustering (taking
K = 2).

Page 5 of 31
 The first step in k-means clustering is the allocation of two centroids randomly (as K=2).
Two random points are assigned as centroids. Note that the points can be anywhere, as
they are random points

 The next step is to determine the actual centroid for these two clusters. The original
randomly allocated centroid is to be repositioned to the actual centroid of the clusters.

 This process of calculating the distance and repositioning the centroid continues until
we obtain our final cluster. Then the centroid repositioning stops.

 As seen above, the centroid doesn't need any more repositioning, and it means the
algorithm has converged, and we have the two clusters with a centroid.

Steps:

Input:

x: Records

k: Number of clusters

Working Principle:

1) Randomly select ‘k’ cluster centers.

2) Calculate the distance between each data point and cluster centers.

Page 6 of 31
Euclidean Distance =

3) Assign the data point to the cluster center whose distance from the cluster center is minimum
of all the cluster centers.

4) Recalculate the new cluster center.

5) Recalculate the distance between each data point and new obtained cluster centers.

6) If no data point was reassigned then stop, otherwise repeat from step 3.

Output:
Number of records in “K” clusters

Determining the Number of Clusters:


 k clusters can be identified in a given dataset, but what value of k should be selected?
The value of k can be chosen based on a reasonable guess or some predefined requirement.

 WSS is the sum of the squares of the distances between each data point and the closest
centroid.

Diagnostics:
 The heuristic using WSS can provide at least several possible k values to consider. When
number of attributes is relatively small, a common approach to further refine the choice of
k is to plot the data to determine how distinct the identified clusters are from each other. In
general, the following questions should be considered.
o Are the clusters well separated from each other?
o Do any of the clusters have only a few points?
o Do any of the centroids appear to be too close to each other?

Page 7 of 31
K-Means Clustering - Reasons to Choose (+) and Cautions (-):

Reasons to Choose (+) Cautions (-)


Easy to implement Doesn't handle categorical variables
Easy to assign new data to Sensitive to initialization (first guess)
existing clusters
Concise output Variables should all be measured on similar or compatible
scales
Coordinates the K cluster centers K (the number of clusters) must be known or decided a priori.
Wrong guess: possibly poor results.
Tends to produce "round" equi-sized clusters. Not always
desirable.

Object Attributes
 Regarding which object attributes (for example, age and income) to use in the analysis, it
is important to understand what attributes will be known at the time a new object will be
assigned to a cluster.

Units of Measure
 From a computational perspective, the k-means algorithm is somewhat indifferent to the
units of measure for a given attribute (for example, meters or centimeters for a patient’s
height). However, the algorithm will identify different clusters depending on the choice of
the units of measure.

Rescaling
 Attributes that are expressed in dollars are common in clustering analyses and can differ in
magnitude from the other attributes.

Classification – Definition:
 The goal of classification is to find a model to accurately predict the target class for
each record in the data set.
 The process of that describes and distinguishes data classes and concepts.
 For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks.

Page 8 of 31
Steps in Classification:
 A two-step process such as :
o Model Construction (Learning Step) (Training Phase): Construction of
Classification Model. Different Algorithms are used to build a classifier by
making the model learn using the training set available. The model has to be
trained for the prediction of accurate results.
o Model Usage (Classification Step): Model used to predict class labels and
testing the constructed model on test data and hence estimate the accuracy
of the classification rules.

Page 9 of 31
NAÏVE BAYES:
 Naïve Bayes is a probabilistic classification method based on Bayes’ theorem.
 Bayes’ theorem gives the relationship between the probabilities of two events and
their conditional probabilities.
 A naïve Bayes classifier assumes that the presence or absence of a particular feature of
a class is unrelated to the presence or absence of other features. For example, an
object can be classified based on its attributes such as shape, color, and weight. A
reasonable classification for an object that is spherical, yellow, and less than 60 grams
in weight may be a tennis ball. Even if these features depend on each other or upon the

Page 10 of 31
existence of the other features, a naïve Bayes classifier considers all these properties to
contribute independently to the probability that the object is a tennis ball.
 The input variables are generally categorical, but variations of the algorithm can accept
continuous variables. There are also ways to convert continuous variables into
categorical ones. This process is often referred to as the discretization of continuous
variables. In the tennis ball example, a continuous variable such as weight can be
grouped into intervals to be converted into a categorical variable. For an attribute such
as income, the attribute can be converted into categorical values as shown below.
o Low Income: income < $10,000
o Working Class: $10,000 ≤ income < $50,000
o Middle Class: $50,000 ≤ income < $1,000,000
o Upper Class: income ≥ $1,000,000
 Examples:
o Classifying text documents, spam filtering

Bayes’ Theorem:
 Class Probabilities: The probabilities of each class in the training dataset.
 Conditional Probabilities: The conditional probabilities of each input value given each
class value.
 The conditional probability of event C occurring is given by,

o Here A is the event that already occurred.


 Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). Look at the equation below:

 Here,

Page 11 of 31
o P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
o P(c) is the prior probability of class.
o P(x|c) is the likelihood which is the probability of predictor given class.
o P(x) is the prior probability of predictor.

Naïve Bayes Classifier:

Diagnostics:
 Naïve Bayes classifiers can handle missing values.
 Naïve Bayes is also robust to irrelevant variables.
 Naïve Bayes is computationally efficient and is able to handle high-dimensional data
efficiently.
 Naïve Bayes assumes the variables in the data are conditionally independent. Therefore, it
is sensitive to correlated variables
 Data scarcity. Chances of loss of accuracy. Zero Frequency i.e. if the category of any
categorical variable is not seen in training data set then model assigns a zero probability to
that category and then a prediction cannot be made.

Page 12 of 31
Example 1:

To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the
illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task
is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the
currently exiting objects.

Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case
(which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED.
In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based
on previous experience, in this case the percentage of GREEN and RED objects, and often used to
predict outcomes before they actually happen.

Thus, we can write:

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities
for class membership are:

Page 13 of 31
Having formulated our prior probability, we are now ready to classify a new object (WHITE
circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or
RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color.
To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen
a priori) of points irrespective of their class labels. Then we calculate the number of points in the
circle belonging to each class label. From this we calculate the likelihood:

From the illustration above, it is clear that Likelihood of X given GREEN is smaller than
Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:

In the Bayesian analysis, the final classification is produced by combining both prior and
likelihood probabilities.

Finally, we classify X as RED since its class membership achieves the largest posterior probability.

Page 14 of 31
Example 2:

Page 15 of 31
DECISION TREE:
 A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label. The topmost node in the tree is the root node.

Page 16 of 31
 The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.

 The benefits of having a decision tree are as follows −


o It does not require any domain knowledge.
o It is easy to comprehend.
o The learning and classification steps of a decision tree are simple and fast.
 Tree Pruning: Tree pruning is performed in order to remove anomalies in the training data
due to noise or outliers. The pruned trees are smaller and less complex.
 There are two approaches to prune a tree −
o Pre-pruning − The tree is pruned by halting its construction early.
o Post-pruning - This approach removes a sub-tree from a fully grown tree.

Example of Decision Tree Algorithm:

Page 17 of 31
Step1: The first step will be to create a root node.
Step2: If all results are yes, then the leaf node “yes” will be returned else the leaf node “no” will
be returned.
Step3: Find out the Entropy of all observations and entropy with attribute “x” that is E(S) and E(S,
x).
Step4: Find out the information gain and select the attribute with high information gain.
Step5: Repeat the above steps until all attributes are covered.

Page 18 of 31
 Attribute selection measure is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of a class-labeled training tuples into individual classes.
It determines how the tuples at a given node are to be split.
Information Gain
This method is the main method that is used to build decision trees. It reduces the information that
is required to classify the tuples. It reduces the number of tests that are needed to classify the given
tuple. The attribute with the highest information gain is selected.

The original information needed for classification of a tuple in dataset D is given by:

Where p is the probability that the tuple belongs to class C. The information is encoded in bits,
therefore, log to the base 2 is used. E(s) represents the average amount of information required to
find out the class label of dataset D. This information gain is also called Entropy.

Page 19 of 31
The information required for exact classification after portioning is given by the formula:

Where P (c) is the weight of partition. This information represents the information needed to
classify the dataset D on portioning by X.

Information gain is the difference between the original and expected information that is required
to classify the tuples of dataset D.

Gain is the reduction of information that is required by knowing the value of X. The attribute with
the highest information gain is chosen as “best”.

Gain Ratio
Information gain might sometimes result in portioning useless for classification. However, the
Gain ratio splits the training data set into partitions and considers the number of tuples of the
outcome with respect to the total tuples. The attribute with the max gain ratio is used as a splitting
attribute.

Gini Index
Gini Index is calculated for binary variables only. It measures the impurity in training tuples of
dataset D, as

Page 20 of 31
P is the probability that tuple belongs to class C. The Gini index that is calculated for binary split
dataset D by attribute A is given by:

Where n is the nth partition of the dataset D.

The most notable types of decision tree algorithms are:-


1. Iterative Dichotomiser 3 (ID3): This algorithm uses Information Gain to decide which attribute
is to be used classify the current subset of the data. For each level of the tree, information gain is
calculated for the remaining data recursively.
2. C4.5: This algorithm is the successor of the ID3 algorithm. This algorithm uses either
Information gain or Gain ratio to decide upon the classifying attribute. It is a direct improvement
from the ID3 algorithm as it can handle both continuous and missing attribute values.
3. Classification and Regression Tree(CART): It is a dynamic learning algorithm which can
produce a regression tree as well as a classification tree depending upon the dependent variable.

Page 21 of 31
CONFUSION MATRIX
The confusion matrix visualizes the accuracy of a classifier by comparing the actual and predicted
classes. The binary confusion matrix is composed of squares:

 TP: True Positive: Predicted values correctly predicted as actual positive


 FP: Predicted values incorrectly predicted an actual positive. i.e., Negative values
predicted as positive
 FN: False Negative: Positive values predicted as negative
 TN: True Negative: Predicted values correctly predicted as an actual negative
You can compute the accuracy test from the confusion matrix:

ROC, AUC
 An ROC curve (receiver operating characteristic curve)
 AUC stands for "Area under the ROC Curve."
 AUC - ROC curve is a performance measurement for classification problem at various
thresholds settings. ROC is a probability curve and AUC represents degree or measure of
separability. It tells how much model is capable of distinguishing between classes. Higher
the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

Page 22 of 31
ENSEMBLE
Ensemble methods use multiple learning algorithms to obtain better predictive performance

Page 23 of 31
KMEANS – R PROGRAM:
#READ DATA
data<-read.csv("mark.csv")

#CHECK DATA FRAME


print(is.data.frame(data))
TRUE

#PRINT NUMBER OF COLUMNS


print(ncol(data))
2

#PRINT NUMBER OF ROWS


print(nrow(data))
5

Page 24 of 31
#PRINT MAXIMUM MARK IN MATHS
print(max(data$maths))
50

#PRINT MINIMUM MARK IN MATHS


print(min(data$maths))
5

#DISPLAY SUMMARY
summary(data)
maths science
Min. : 5 Min. :10
1st Qu.:10 1st Qu.:20
Median :40 Median :40
Mean :29 Mean :33
3rd Qu.:40 3rd Qu.:45
Max. :50 Max. :50

#PRINT DATA
print(data)
maths science
1 10 20
2 40 50
3 5 10
4 40 45
5 50 40

Page 25 of 31
#DRAW PLOT FOR GIVEN DATA
plot(data)

#APPLY K-MEANS
km<-kmeans(data,centers=2)

#DISPLAY CENTERS
print(km$centers)
maths science
1 43.33333 45
2 7.50000 15

#DISPLAY CLUSTERS
print(km$cluster)
21211

#PLOT CENTERS
plot(km$centers)

Page 26 of 31
DECISION TREE – R PROGRAM:
#READ DATA
data<-read.csv("data.csv")

#CHECK DATA FRAME


print(is.data.frame(data))
TRUE

#READ NUMBER OF COLUMNS


print(ncol(data))
7

#READ NUMBER OF ROWS


print(nrow(data))
10

#DISPLAY DATA
print(data)

Tamil English Maths Science Social Grade Result


1 9 8 7 6 8 S Pass
2 8 7 8 9 6 S Pass
3 2 4 4 3 1 RA Fail
4 3 4 3 2 1 RA Fail
5 8 8 7 6 8 S Pass

Page 27 of 31
6 8 7 9 9 6 S Pass
7 9 9 7 8 8 A Pass
8 8 7 8 9 6 A Pass
9 1 3 2 3 2 RA Fail
10 9 8 7 6 8 S Pass

#DISPLAY SUMMARY
summary(data)
Tamil English Maths Science
Min. :1.000 Min. :3.000 Min. :2.000 Min. :2.000
1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:3.000
Median :8.000 Median :7.000 Median :7.000 Median :6.000
Mean :6.222 Mean :6.333 Mean :6.111 Mean :6.111
3rd Qu.:8.000 3rd Qu.:8.000 3rd Qu.:8.000 3rd Qu.:9.000
Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000
Social Grade Result
Min. :1.000 A :18 Fail:3
1st Qu.:2.000 RA:27 Pass:7
Median :6.000 S :36
Mean :5.111
3rd Qu.:8.000
Max. :8.000

#INSTALL CARET PACKAGE


install.packages("caret")
library(caret)

#SET SEED VALUE


set.seed(3033)

#TRAIN AND TEST DATA


intrain <- createDataPartition(y = data$Result,list=FALSE)
training <- data[intrain,]
testing <- data[-intrain,]

#FIND DIMENSIONS
dim(training)
5

dim(testing)
5

Page 28 of 31
#CHECK MISSING VALUE
anyNA(data)
FALSE

#TRAIN THE DATA


t <- trainControl(method = "repeatedcv")
set.seed(3333)
dtree_fit <- train(Result ~., data = training, method = "rpart")

#TEST THE MODEL


test_pred<-predict(dtree_fit,newdata=testing)

#ACCURACY
confusionMatrix(test_pred,testing$Result)
Reference
Prediction Fail Pass
Fail 1 0
Pass 0 4

Accuracy : 1

#INSTALL TREE PACKAGES


install.packages("tree")
library(tree)

#PLOT DECISION TREE


f<-tree(Result~., data)
plot(f)
text(f,all=TRUE)

NAÏVE BAYES – R PROGRAM:


#READ DATA
data<-read.csv("data.csv")

Page 29 of 31
#CHECK DATA FRAME
print(is.data.frame(data))
TRUE

#READ NUMBER OF COLUMNS


print(ncol(data))
7

#READ NUMBER OF ROWS


print(nrow(data))
10

#DISPLAY DATA
print(data)

Tamil English Maths Science Social Grade Result


1 9 8 7 6 8 S Pass
2 8 7 8 9 6 S Pass
3 2 4 4 3 1 RA Fail
4 3 4 3 2 1 RA Fail
5 8 8 7 6 8 S Pass
6 8 7 9 9 6 S Pass
7 9 9 7 8 8 A Pass
8 8 7 8 9 6 A Pass
9 1 3 2 3 2 RA Fail
10 9 8 7 6 8 S Pass

#DISPLAY SUMMARY
summary(data)
Tamil English Maths Science
Min. :1.000 Min. :3.000 Min. :2.000 Min. :2.000
1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:3.000
Median :8.000 Median :7.000 Median :7.000 Median :6.000
Mean :6.222 Mean :6.333 Mean :6.111 Mean :6.111
3rd Qu.:8.000 3rd Qu.:8.000 3rd Qu.:8.000 3rd Qu.:9.000
Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000
Social Grade Result
Min. :1.000 A :18 Fail:3
1st Qu.:2.000 RA:27 Pass:7
Median :6.000 S :36
Mean :5.111
3rd Qu.:8.000

Page 30 of 31
Max. :8.000

#SET SEED VALUE


set.seed(3033)

#TRAIN AND TEST DATA


intrain <- createDataPartition(y = data$Result,list=FALSE)
training <- data[intrain,]
testing <- data[-intrain,]

#FIND DIMENSIONS
dim(training)
5

dim(testing)
5

#CHECK MISSING VALUE


anyNA(data)
FALSE

#TRAIN THE DATA


t <- trainControl(method = "repeatedcv")
set.seed(3333)
nb <- train(Result ~., data = training, method = "nb")

#TEST THE MODEL


test_pred<-predict(nb,newdata=testing)

#ACCURACY
confusionMatrix(test_pred,testing$Result)
Reference
Prediction Fail Pass
Fail 1 0
Pass 0 4

Accuracy : 1

Page 31 of 31

You might also like