UNIT 3 Updated
UNIT 3 Updated
UNIT 3: CLASSIFICATION
1 25 M 150 5
2 34 F 200 3
3 29 M 120 7
4 46 F 300 2
5 31 M 180 6
Descriptive Analysis:
• Clustering could be used to segment customers into groups:
o Group A: Young frequent shoppers (e.g., ages 20-30, high frequency).
o Group B: Middle-aged occasional shoppers (e.g., ages 30-50, low frequency).
Example 2: Market Basket Analysis
DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 2
Dataset: Transaction data showing items purchased together.
1 Bread, Milk
2 Milk, Eggs
3 Bread, Butter
5 Milk, Butter
Descriptive Analysis:
• Association Rule Mining can reveal patterns, such as:
o Customers who buy Milk often buy Bread (e.g., rule: If Milk, then Bread).
o Support for Milk and Bread: 3 out of 5 transactions.
1 28 50,000 20,000 0
2 40 75,000 30,000 1
3 35 60,000 25,000 0
4 50 100,000 50,000 1
5 22 30,000 15,000 0
Predictive Analysis:
January 10,000
February 12,000
March 15,000
April 18,000
May 20,000
Predictive Analysis:
• A time series analysis could be applied to forecast future sales.
• Based on the trend, the model may predict:
o June Sales: $22,000
o July Sales: $25,000
Summary
• Descriptive Modeling helps in summarizing and understanding data patterns, such as
customer segmentation and market basket analysis.
• Predictive Modeling involves forecasting future outcomes, such as predicting loan
defaults and sales forecasting.
Classification in data mining is a supervised learning technique used to categorize data into
predefined classes or categories based on their attributes. It's a crucial part of both descriptive
and predictive modeling. Let me explain each concept with examples:
Classification is a fundamental problem in data mining that involves predicting the category
or class label of new observations based on past observations with known labels. Here's an
overview of the key components of classification in data mining.
1. Problem Definition
• Objective: The main goal is to build a model that can classify data points into
predefined classes based on input features.
• Input: A dataset consisting of features (independent variables) and corresponding
class labels (dependent variable).
• Output: A classification model that can predict the class label for new, unseen data.
3. Evaluation of Classifiers
Evaluation is critical to determine how well a classifier performs. Common evaluation
metrics include:
• Accuracy: The ratio of correctly predicted instances to the total instances.
Accuracy=True Positives + True Negatives/Total Instances
• Precision: The ratio of true positives to the sum of true positives and false positives.
Precision=True Positives/(True Positives + False Positives)
• Recall (Sensitivity): The ratio of true positives to the sum of true positives and false
negatives.
Recall=True Positives/(True Positives + False)
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
F1=2×(Precision×Recall/Precision + Recall)
Confusion Matrix: A table that summarizes the performance of a classification algorithm,
showing true positives, false positives, true negatives, and false negatives.
4. Classification Techniques
There are various techniques used in classification, including:
• Decision Trees: A flowchart-like structure where each internal node represents a
decision based on a feature, leading to class labels.
• Random Forest: An ensemble of decision trees that improves classification accuracy
by averaging the predictions of multiple trees.
1 200 Yes No 1
2 150 No Yes 0
4 100 No No 0
5 250 Yes No 1
Classification Process
1. Data Collection: The dataset contains features of emails and their corresponding
spam labels.
2. Data Preprocessing: Convert categorical variables into numerical format (e.g., "Yes"
= 1, "No" = 0).
3. Feature Selection: Use features like "Word Count," "Contains Links," and "Contains
Attachments."
4. Model Training: Train a classification algorithm (e.g., Decision Tree) on the training
dataset.
5. Model Evaluation: Use a test dataset to evaluate the model's performance:
o Accuracy: Percentage of correctly classified emails.
o Precision: Correctly predicted spam emails divided by all predicted spam
emails.
o Recall: Correctly predicted spam emails divided by all actual spam emails.
Applications of Classification:
• Spam Filtering: Classifying emails as spam or not spam.
• Medical Diagnosis: Predicting the presence of a disease based on patient data.
• Credit Scoring: Classifying loan applicants as high or low risk.
• Image Recognition: Identifying objects or people in images.
Actual Predicted
1 1
0 1
1 1
0 0
1 0
0 1
1 1
0 0
Actual
1 0
1 [ TP FP ]
Predicted 0 [ FN TN ]
Example:
Suppose we have a binary classification problem where we want to classify whether an email
is "spam" or "not spam." We have the following dataset (predictions vs actual values):
1 Spam Spam
5 Spam Spam
Let’s assume:
• "Spam" is the positive class.
• "Not Spam" is the negative class.
Step 1: Create the Confusion Matrix
The confusion matrix is a table that shows the true positive, false positive, false negative, and
true negative counts:
Example 2:
Actual Predicted
1 1
0 1
1 1
0 0
1 0
0 1
1 1
0 0
Precision:
Recall (Sensitivity):
Metric Value
Accuracy 67%
Precision 67%
Recall 80%
F1-Score 73%
Example 3:
Let's consider another simple dataset example, this time focusing on a binary classification
task such as predicting whether a patient has a disease based on certain features.
Example Dataset: Disease Prediction
Consider a dataset for predicting whether a patient has a specific disease (1 = Disease, 0 = No
Disease) based on two features: age and cholesterol level. Here’s the test dataset with the
actual and predicted labels:
1 1 1
2 0 1
3 1 1
4 0 0
5 1 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 1
Step 1: Count the True Positives, False Positives, True Negatives, and False Negatives
From the dataset, we can summarize the results:
• True Positives (TP): Correctly predicted patients with the disease
• False Positives (FP): Incorrectly predicted patients with
the disease (predicted disease but actually no disease)
• True Negatives (TN): Correctly predicted patients without the disease
• False Negatives (FN): Incorrectly predicted patients without the disease (predicted
no disease but actually has disease)
Counts from the Dataset
• TP = 3 (Patients 1, 3, and 7)
• FP = 3 (Patients 2, 5, and 10)
• TN = 4 (Patients 4, 6, and 8)
• FN = 3 (Patients 5, 9)
Step 2: Calculate Performance Metrics
Metric Value
Accuracy 53.85%
Precision 50.00%
Recall 50.00%
F1-Score 50.00%
Conclusion
In this example, we calculated performance metrics based on a simple disease prediction
dataset.
• The model has a 53.85% accuracy, indicating that it correctly classified a little over
half of the patients.
• The precision and recall are both 50%, suggesting that the model has room for
improvement, particularly in distinguishing patients with the disease from those
without it.
• The F1-Score of 50% reflects this balance between precision and recall.
Decision Trees
Decision Trees are a popular method for classification and regression tasks in data mining
and machine learning. They model decisions and their possible consequences as a tree-like
structure, making them easy to interpret and visualize.
Age
/|\
Young | Middle-aged Senior
/\ /\ /\
Yes No Fair Excellent Fair Excellent
| | | |
No Yes Yes Yes
Example 2:
We will classify whether a fruit is an Apple or an Orange based on two attributes: Color and Size.
Dataset
Color
/ \
Red Orange
/\ /\
In this example, the split based on the Fuel Type attribute allows us to categorize the cars into two
groups: those that run on Petrol and those that are Diesel. Each group can then be further analyzed
or split based on other attributes (like Price or Year) to refine classifications.
Using splits effectively helps in making decisions or predictions about car types based on
various features, which is useful for both consumers and car manufacturers in India.
• where S is the dataset and Sv is the subset of SS for which attribute A takes the
value v.
2. Gini Impurity
• Definition: Gini Impurity measures the probability of incorrectly classifying a
randomly chosen element if it was randomly labeled according to the distribution of
labels in the subset.
• Calculation: Lower Gini Impurity values indicate a better split.
• Formula:
where pi is the probability of a class ii in the set S and C is the number of classes.
3. Entropy
• Definition: Entropy measures the amount of disorder or unpredictability in the
dataset. It quantifies the impurity of the dataset.
• Calculation: Higher entropy indicates more disorder, while lower entropy indicates
more predictability.
• Formula:
These measures help in determining the best attribute to split the dataset at each node
in a decision tree. By evaluating the quality of splits using these metrics, we can build
a more accurate and efficient decision tree model.
Here are some examples of datasets that are commonly used for decision tree
classification tasks:
1. Drugs A, B, C, X, Y for Decision Trees:
o Features: Age, Sex, Blood Pressure, Cholesterol
Example:
Let's imagine we're working with the Drugs A, B, C, X, Y for Decision Trees dataset. Here's a
simplified example of how a decision tree might work to predict which drug is best for a
patient:
Algorithm and its implementation, using the ID3 (Iterative Dichotomiser 3) algorithm as an
example.
ID3 Algorithm
1. Start with the root node: This node represents the entire dataset.
2. Calculate Information Gain for each attribute:
o Entropy: A measure of the impurity of the dataset. A pure dataset (all
instances belong to the same class) has an entropy of 0.
Explanation:
• Import libraries: pandas for data
manipulation, DecisionTreeClassifier from sklearn.tree for the decision tree
model, train_test_split for splitting data, and accuracy_score for evaluation.
• Load data: Replace 'your_dataset.csv' with the path to your dataset.
• Prepare data: Separate features (X) and the target variable (y).
• Split data: Create training and testing sets for model training and evaluation.
• Create classifier: DecisionTreeClassifier(criterion='entropy') creates a decision tree
model using ID3 (entropy-based information gain).
• Train the model: tree.fit(X_train, y_train) fits the model to the training data.
• Make predictions: tree.predict(X_test) uses the trained model to predict the target
variable for the test data.
• Evaluate accuracy: accuracy_score(y_test, y_pred) calculates the accuracy of the
model's predictions.
Important Notes:
• This is a simplified example. Real-world implementations might involve data
preprocessing, feature engineering, hyperparameter tuning, and other techniques.
• There are many other decision tree algorithms besides ID3 (e.g., C4.5, CART). The
choice of algorithm depends on the specific dataset and task.
Here are a few examples of datasets that are commonly used for decision tree
classification tasks:
Naive-Bayes Classifier
The Naive Bayes Classifier is a family of probabilistic algorithms based on Bayes' Theorem,
which is particularly effective for classification tasks. Here’s a detailed overview of its key
features and how it works:
Key Features of Naive Bayes Classifier
1. Probabilistic Nature: It calculates the probability of each class given the features and
assigns the class with the highest probability to the instance.
1.
o Select the class with the highest posterior probability.
Applications
• Text Classification: Commonly used for spam detection and sentiment analysis.
• Medical Diagnosis: Helps in predicting diseases based on symptoms.
• Recommendation Systems: Used to recommend products based on user preferences.
Advantages
• Simple and Fast: Easy to implement and computationally efficient.
• Effective with Large Datasets: Performs well even with a large number of features.
Disadvantages
• Independence Assumption: The assumption of feature independence may not hold
true in real-world scenarios, which can affect accuracy.
25 50 Yes
30 60 Yes
35 70 Yes
40 80 No
45 90 No
50 100 No
Explanation:
• Age: The age of the individual in years.
• Income: The individual's annual income in thousands of dollars.
• Purchase: Whether the individual purchased the product (Yes or No).
Using this dataset, we can apply algorithms like Decision Trees or Naive Bayes to build
models that predict the likelihood of a purchase based on age and income.
For example, a simple decision tree might look like this:
• If Income > 75, then Purchase = No
• If Income <= 75, then Purchase = Yes
This is a very basic example, and real-world models would be more complex. However, it
illustrates how you can use a simple dataset to explore classification tasks.
Second Approach
Naive Bayesian
The Naive Bayesian classifier is based on Bayes’ theorem with the independence assumptions
between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter
estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive
Bayesian classifier often does surprisingly well and is widely used because it
often outperforms more sophisticated classification methods.
Algorithm
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x),
and P(x|c). Naive Bayes classifier assume that the effect of the value of a predictor (x) on a given
class (c) is independent of the values of other predictors. This assumption is called class conditional
independence.
In ZeroR model there is no predictor, in OneR model we try to find the single best predictor, naïve
Bayesian includes all predictors using Bayes' rule and the independence assumptions between
predictors.
Example 1:
In this example we have 4 inputs (predictors). The final posterior probabilities can be standardized
between 0 and 1.
Add 1 to the count for every attribute value-class combination (Laplace estimator) when an attribute
value (Outlook=Overcast) doesn’t occur with every class value (Play Golf=no).
Numerical Predictors
The probability density function for the normal distribution is defined by two parameters
(mean and standard deviation).
Example:
Kononenko's information gain as a sum of information contributed by each attribute can offer an
explanation on how values of the predictors influence the class probability.
The contribution of predictors can also be visualized by plotting nomograms. Nomogram plots log
odds ratios for each value of each predictor. Lengths of the lines correspond to spans of odds ratios,
suggesting importance of the related predictor. It also shows impacts of individual values of the
predictor.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
In general for each variable Xi, we can write the equation as:
Solution:
o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
List of all events occurring in this network:
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
o We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability
distribution:
o P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
o =P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
o = P [D| A]. P [ S| A, B, E]. P[ A, B, E]
o = P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
o = P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
Let's take the observed probability for the Burglary and earthquake component:
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases
based on a similarity measure (e.g., distance functions). KNN has been used in statistical
estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.
Algorithm
A case is classified by a majority vote of its neighbors, with the case being assigned to the class most
common amongst its K nearest neighbors measured by a distance function. If K = 1, then the
case is simply assigned to the class of its nearest neighbor.
It should also be noted that all three distance measures are only valid for continuous variables.
In the instance of categorical variables the Hamming distance must be used. It also brings up the
issue of standardization of the numerical variables between 0 and 1 when there is a mixture of
numerical and categorical variables in the dataset.
Example:
Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target.
With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction
for the unknown case is again Default=Y.
Standardized Distance
One major drawback in calculating distance measures directly from the training set is in the case
where variables have different measurement scales or there is a mixture of numerical and
categorical variables. For example, if one variable is based on annual income in dollars, and the
other is based on age in years then income will have a much higher influence on the distance
calculated. One solution is to standardize the training set as shown below.
Second Approach:
K-Nearest Neighbors
K-nearest neighbors (KNN) is a type of supervised learning machine learning algorithm and
is used for both regression and classification tasks.
KNN is used to make predictions on the test data set based on the characteristics of the current
training data points. This is done by calculating the distance between the test data and training
data, assuming that similar things exist within close proximity.
The algorithm will have stored learned data, making it more effective at predicting and
categorising new data points. When a new data point is inputted, the KNN algorithm will learn
its characteristics/features. It will then place the new data point at closer proximity to the
current training data points that share the same characteristics or features.
The ‘K’ in KNN is a parameter that refers to the number of nearest neighbors. K is a positive
integer and is typically small in value and is recommended to be an odd number.
In Layman's terms, the K-value creates an environment for the data points. This makes it easier
to assign which data point belongs to which category.
The example below shows 3 graphs. The first, the ‘Initial Data’ is a graph where data points
are plotted and clustered into classes, and a new example to classify is present. In the ‘Calculate
DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 56
Distance’ graph, the distance from the new example data point to the closest trained data points
is calculated. However, this still does not categorise the new example data point. Therefore,
using k-value, essentially created a neighborhood where we can classify the new example data
point.
We would say that k=3 and the new data point will belong to Class B as there are more trained
Class B data points with similar characteristics to the new data point in comparison to Class A.
If we increase the k-value to 7, we will see that the new data point will belong to Class A as
there are more trained Class A data points with similar characteristics to the new data point in
comparison to Class B.
KNN calculates the distance between data points in order to classify new data points. The most
common methods used to calculate this distance in KNN are Euclidian, Manhattan, and
Minkowski.
Euclidean Distance is the distance between two points using the length of a line between the
two points. The formula for Euclidean Distance is the square root of the sum of the squared
differences between a new data point (x) and an existing trained data point (y).
Manhattan Distance is the distance between two points is the sum of the absolute difference
of their Cartesian coordinates. The formula for Manhattan Distance is the sum of the lengths
between a new data point (x) and an existing trained data point (y) using a line segment on the
coordinate axes.
Minkowski Distance is the distance between two points in the normed vector space and is a
generalization of the Euclidean distance and the Manhattan distance. In the formula for
Minkowski Distance when p=2, we get Euclidian distance, also known as L2 Distance. When
p=1 we get Manhattan distance, also known as L1 distance, city-block distance, and LASSO.
The image below is the formulas:
A graph representing data trends for red and white wine based on the amount of myricetin and
rutine (sic). | Image: Dhilip Subramanian
The “K” in KNN is a parameter that refers to the number of nearest neighbors to include in the
majority of the voting process.
Now suppose we add a new glass of wine in the data set, and we want to know whether this
new wine is red or white.
Identifying a glass of wine based on its nearest neighbors on the chart. | Image: Dhilip
Subramanian
Defining a glass of
wine based on its nearest neighbors with a k value of five. | Image: Dhilip Subramanian
Predicting Andrew’s default status using Euclidean distance with data from other customers. |
Image: Dhilip Subramanian
We need to predict Andrew’s default status — either yes or no.
Euclidean Distance
Euclidean distance is the most popular distance measure. It helps to find the distance between
two real-valued vectors, like integers or floats. Before using Euclidean distance, we must
normalize or standardize the data, otherwise, data with larger values will dominate the
outcome.
Mathematically, it’s represented by the following formula.
Manhattan Distance
Manhattan distance is the simplest measure, and it’s used to calculate the distance between two
real-valued vectors. It is called “Taxicab” or “City Block” distance measure.
If we start from one place and move to another, Manhattan distance will calculate the absolute
value between starting and destination points. Manhattan is preferred over Euclidean if the two
data points are in an integer space.
The Manhattan distance between two points (X1, Y1) and (X2, Y2) is represented by |X1 – X2|
+ |Y1 – Y2|.
Minkowski Distance
Minkowski distance is used to calculate the distance between two real value vectors. It is a
generalized form for Euclidean and Manhattan distance. In addition, it adds a parameter “p,”
which helps to calculate the different distance measures.
Mathematically it’s represented by the following formula. Note that in Euclidean distance p =
2, and p =1 if it is Manhattan distance.