0% found this document useful (0 votes)
3 views68 pages

UNIT 3 Updated

The document discusses classification in data mining, a supervised learning technique that categorizes data into predefined classes based on training datasets. It covers various classification techniques such as Decision Trees, Naive Bayes, and K-Nearest Neighbors, as well as the evaluation metrics for classifiers like accuracy and precision. Additionally, it distinguishes between descriptive modeling, which summarizes data patterns, and predictive modeling, which forecasts future outcomes based on historical data.

Uploaded by

Justsharing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views68 pages

UNIT 3 Updated

The document discusses classification in data mining, a supervised learning technique that categorizes data into predefined classes based on training datasets. It covers various classification techniques such as Decision Trees, Naive Bayes, and K-Nearest Neighbors, as well as the evaluation metrics for classifiers like accuracy and precision. Additionally, it distinguishes between descriptive modeling, which summarizes data patterns, and predictive modeling, which forecasts future outcomes based on historical data.

Uploaded by

Justsharing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

DATA MINING

UNIT 3: CLASSIFICATION

Classification: Problem Definition, General Approaches to solving a classification problem,


Evaluation of Classifiers, Classification techniques, Decision Trees-Decision tree
Construction, Methods for Expressing attribute test conditions, Measures for Selecting the Best
Split, Algorithm for Decision tree Induction ; Naive-Bayes Classifier, Bayesian Belief
Networks; K- Nearest neighbor classification-Algorithm and characteristics.

Classification in Data Mining


Classification is a fundamental technique in data mining used to categorize data into
predefined classes or groups. It involves creating a model that can predict the class label of
new, unseen data based on the patterns learned from a training dataset. Classification is
widely used in various applications, including spam detection, sentiment analysis, credit
scoring, and medical diagnosis.
Key Points about Classification
• Supervised Learning: Classification is a type of supervised learning where the model
is trained on labeled data.
• Algorithms: Common algorithms include Decision Trees, Random Forests, Support
Vector Machines (SVM), and Neural Networks.
• Output: The output is a discrete label from a finite set of classes.
Descriptive Modeling
Descriptive modeling focuses on understanding the underlying patterns in the data. It does
not predict future outcomes but provides insights into the characteristics and relationships
within the data.
Characteristics of Descriptive Modeling
• Goal: To summarize and describe the data.
• Techniques: It often includes clustering, association rule mining, and summarization
techniques.
• Example: A retail company analyzes customer purchase data to identify common
purchasing patterns. By using clustering, they might discover distinct customer
segments based on buying behavior, such as "frequent buyers" and "occasional
buyers."
Example of Descriptive Modeling
• Data: Customer transaction data from a supermarket.
• Output: Segmentation of customers into different groups based on their purchasing
habits, revealing insights like:
o Group A: Regular shoppers of organic products.
o Group B: Seasonal shoppers during holidays.
Predictive Modeling
Predictive modeling, on the other hand, aims to forecast future outcomes based on historical
data. It involves using statistical techniques and machine learning algorithms to predict an
unknown or future value.
Characteristics of Predictive Modeling
• Goal: To make accurate predictions about future events or behaviors.
• Techniques: Often employs regression analysis, time series analysis, and
classification algorithms.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 1


• Example: A bank uses historical data on customer transactions to predict whether a
new customer is likely to default on a loan.
Example of Predictive Modeling
• Data: Historical loan data, including customer characteristics and default status.
• Output: A model that predicts whether a new applicant is likely to default, providing
a probability score:
o Probability of default: 0.75 (indicating a high risk).
o Probability of repayment: 0.25 (indicating a lower risk).
Summary
• Classification is a powerful technique in data mining that helps categorize data into
classes.
• Descriptive Modeling provides insights and understanding of data patterns without
making predictions.
• Predictive Modeling uses historical data to forecast future outcomes.
By leveraging both descriptive and predictive modeling, organizations can gain a deeper
understanding of their data and make informed decisions based on insights and forecasts.

Descriptive Modeling Examples


Example 1: Customer Segmentation in Retail
Dataset: Customer transaction data from a supermarket.

Customer Total Spend Purchase Frequency (per


Age Gender
ID ($) month)

1 25 M 150 5

2 34 F 200 3

3 29 M 120 7

4 46 F 300 2

5 31 M 180 6

Descriptive Analysis:
• Clustering could be used to segment customers into groups:
o Group A: Young frequent shoppers (e.g., ages 20-30, high frequency).
o Group B: Middle-aged occasional shoppers (e.g., ages 30-50, low frequency).
Example 2: Market Basket Analysis
DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 2
Dataset: Transaction data showing items purchased together.

Transaction ID Items Purchased

1 Bread, Milk

2 Milk, Eggs

3 Bread, Butter

4 Bread, Milk, Eggs, Butter

5 Milk, Butter

Descriptive Analysis:
• Association Rule Mining can reveal patterns, such as:
o Customers who buy Milk often buy Bread (e.g., rule: If Milk, then Bread).
o Support for Milk and Bread: 3 out of 5 transactions.

Predictive Modeling Examples


Example 1: Loan Default Prediction
Dataset: Historical loan data with features relevant to predicting defaults.

Customer Annual Income Loan Amount Default (0=No,


Age
ID ($) ($) 1=Yes)

1 28 50,000 20,000 0

2 40 75,000 30,000 1

3 35 60,000 25,000 0

4 50 100,000 50,000 1

5 22 30,000 15,000 0

Predictive Analysis:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 3


• A logistic regression model might be used to predict the likelihood of default.
• Model output for a new applicant (Age: 30,
Income: 55,000,Loan:55,000,Loan:25,000):
o Probability of Default: 0.35 (indicating a moderate risk).
Example 2: Sales Forecasting
Dataset: Historical sales data for a product.

Month Sales ($)

January 10,000

February 12,000

March 15,000

April 18,000

May 20,000

Predictive Analysis:
• A time series analysis could be applied to forecast future sales.
• Based on the trend, the model may predict:
o June Sales: $22,000
o July Sales: $25,000
Summary
• Descriptive Modeling helps in summarizing and understanding data patterns, such as
customer segmentation and market basket analysis.
• Predictive Modeling involves forecasting future outcomes, such as predicting loan
defaults and sales forecasting.

Classification in data mining is a supervised learning technique used to categorize data into
predefined classes or categories based on their attributes. It's a crucial part of both descriptive
and predictive modeling. Let me explain each concept with examples:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 4


• Classification in Data Mining: Classification is the task of assigning items in a dataset
to predefined categories or classes. The algorithm learns from a labeled training
dataset and then applies that knowledge to classify new, unlabeled data.
Example: Classifying emails as spam or not spam based on their content and metadata.
• Descriptive Modeling: Descriptive modeling focuses on understanding and
summarizing the main characteristics of a dataset. It aims to uncover patterns or
relationships within the data without making predictions.
Example Dataset: Customer Segmentation
• Dataset: Customer demographics and purchase history
• Goal: Group customers into segments based on their characteristics
• Method: Clustering algorithms (e.g., K-means)
• Outcome: Identifying distinct customer groups like "high-value customers," "price-
sensitive buyers," etc.
• Predictive Modeling: Predictive modeling uses historical data to forecast future
outcomes or behaviors. It builds a mathematical model to make predictions on new,
unseen data.
Example Dataset: Credit Risk Assessment
• Dataset: Loan applicant information (income, credit score, debt, etc.)
• Goal: Predict whether an applicant is likely to default on a loan
• Method: Classification algorithms (e.g., Logistic Regression, Random Forest)
• Outcome: A model that can classify new loan applications as "high risk" or "low risk"

Classification is a fundamental problem in data mining that involves predicting the category
or class label of new observations based on past observations with known labels. Here's an
overview of the key components of classification in data mining.

1. Problem Definition
• Objective: The main goal is to build a model that can classify data points into
predefined classes based on input features.
• Input: A dataset consisting of features (independent variables) and corresponding
class labels (dependent variable).
• Output: A classification model that can predict the class label for new, unseen data.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 5


2. General Approaches to Solving a Classification Problem
• Supervised Learning: This approach uses labeled training data to learn the mapping
from inputs to outputs.
o Training Phase: The model learns from labeled data (features + labels).
o Testing Phase: The model is tested on unseen data to evaluate its
performance.
• Unsupervised Learning: In some cases, you might use clustering to group similar
data points before applying classification.
• Semi-Supervised Learning: Combines labeled and unlabeled data to improve
learning accuracy.

3. Evaluation of Classifiers
Evaluation is critical to determine how well a classifier performs. Common evaluation
metrics include:
• Accuracy: The ratio of correctly predicted instances to the total instances.
Accuracy=True Positives + True Negatives/Total Instances
• Precision: The ratio of true positives to the sum of true positives and false positives.
Precision=True Positives/(True Positives + False Positives)
• Recall (Sensitivity): The ratio of true positives to the sum of true positives and false
negatives.
Recall=True Positives/(True Positives + False)
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
F1=2×(Precision×Recall/Precision + Recall)
Confusion Matrix: A table that summarizes the performance of a classification algorithm,
showing true positives, false positives, true negatives, and false negatives.

4. Classification Techniques
There are various techniques used in classification, including:
• Decision Trees: A flowchart-like structure where each internal node represents a
decision based on a feature, leading to class labels.
• Random Forest: An ensemble of decision trees that improves classification accuracy
by averaging the predictions of multiple trees.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 6


• Support Vector Machines (SVM): A technique that finds the optimal hyperplane to
separate different classes in the feature space.
• K-Nearest Neighbors (KNN): A non-parametric method where the class of a data
point is determined by the majority class among its k-nearest neighbors.
• Naïve Bayes: A probabilistic classifier based on Bayes' theorem, assuming
independence among predictors.
• Neural Networks: A set of algorithms modeled after the human brain, ideal for
capturing complex patterns in data.
• Logistic Regression: A statistical method for predicting binary classes based on one
or more predictor variables.

Classification in Data Mining


Classification is a key component of data mining that focuses on assigning items in a
dataset to target categories or classes. It is a type of supervised learning, meaning the model
learns from a labeled training dataset, where the outcome (class label) is already known.
Key Concepts in Classification
• Training Dataset: A dataset used to train the classification model, consisting of input
features and known labels.
• Test Dataset: A separate dataset used to evaluate the performance of the model after
training.
• Class Labels: The target categories that the model is trying to predict (e.g., spam or
not spam, disease or no disease).
• Algorithms: Various algorithms can be used for classification, including:
o Decision Trees
o Random Forest
o Support Vector Machines (SVM)
o Naive Bayes
o Neural Networks
Classification Process
1. Data Collection: Gather the dataset that contains features and corresponding labels.
2. Data Preprocessing: Clean and prepare the data for analysis (handle missing values,
normalize/standardize data).

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 7


3. Feature Selection: Identify the most relevant features for classification.
4. Model Training: Use the training dataset to train the classification algorithm.
5. Model Evaluation: Assess the model's performance using a test dataset and metrics
such as accuracy, precision, recall, and F1-score.
6. Prediction: Use the trained model to classify new, unseen data.
Example of Classification
Scenario: Email Spam Detection
Dataset: Email Features

Email Word Contains Contains Spam (0=No,


ID Count Links Attachments 1=Yes)

1 200 Yes No 1

2 150 No Yes 0

3 300 Yes Yes 1

4 100 No No 0

5 250 Yes No 1

Classification Process
1. Data Collection: The dataset contains features of emails and their corresponding
spam labels.
2. Data Preprocessing: Convert categorical variables into numerical format (e.g., "Yes"
= 1, "No" = 0).
3. Feature Selection: Use features like "Word Count," "Contains Links," and "Contains
Attachments."
4. Model Training: Train a classification algorithm (e.g., Decision Tree) on the training
dataset.
5. Model Evaluation: Use a test dataset to evaluate the model's performance:
o Accuracy: Percentage of correctly classified emails.
o Precision: Correctly predicted spam emails divided by all predicted spam
emails.
o Recall: Correctly predicted spam emails divided by all actual spam emails.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 8


6. Prediction: Classify new emails based on the trained model.
Performance Metrics
• Accuracy: (True Positives + True Negatives) / Total Samples
• Precision: True Positives / (True Positives + False Positives)
• Recall: True Positives / (True Positives + False Negatives)
• F1-Score: 2 * (Precision * Recall) / (Precision + Recall)
Example of Results
Assuming the model was evaluated and showed the following performance metrics:
• Accuracy: 90%
• Precision: 85%
• Recall: 80%
• F1-Score: 82%
Key Note:
• Classification is a powerful tool in data mining used to categorize data into
predefined classes based on historical data.
• The spam detection example illustrates the complete classification process, from data
collection to model evaluation.
• Understanding the performance metrics is crucial for assessing the effectiveness of
classification models.

Applications of Classification:
• Spam Filtering: Classifying emails as spam or not spam.
• Medical Diagnosis: Predicting the presence of a disease based on patient data.
• Credit Scoring: Classifying loan applicants as high or low risk.
• Image Recognition: Identifying objects or people in images.

calculating performance metrics for a classification model


Let's go through an example of calculating performance metrics for a classification model on
a simple dataset. Common classification performance metrics include:
• Accuracy
• Precision

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 9


• Recall
• F1-Score

Understanding Confusion Matrix Components


In binary classification, a confusion matrix is a helpful tool that summarizes the
performance of a classification algorithm. It allows us to break down the predictions
into four categories:
• True Positives (TP)
• True Negatives (TN)
• False Positives (FP)
• False Negatives (FN)
Let’s clarify each of these terms using the simple dataset provided earlier.
Sample Dataset Recap

Actual Predicted

1 1

0 1

1 1

0 0

1 0

0 1

1 1

0 0

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 10


Definitions

1. True Positives (TP):


o Definition: The number of positive instances that were correctly predicted
as positive.
o Example: In our dataset, when the actual value is 1 (positive) and the
predicted value is also 1, it counts as a true positive.
o Count:
▪ Rows:
▪ (1, 1)
▪ (1, 1)
▪ (1, 1)
▪ TP = 3
2. True Negatives (TN):
o Definition: The number of negative instances that were correctly
predicted as negative.
o Example: In our dataset, when the actual value is 0 (negative) and the
predicted value is also 0, it counts as a true negative.
o Count:
▪ Rows:
▪ (0, 0)
▪ (0, 0)
▪ TN = 2
3. False Positives (FP):
o Definition: The number of negative instances that were incorrectly
predicted as positive.
o Example: In our dataset, when the actual value is 0 (negative) but the
predicted value is 1 (positive), it counts as a false positive.
o Count:
▪ Rows:
▪ (0, 1)

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 11


▪ (0, 1)
▪ FP = 2
4. False Negatives (FN):
o Definition: The number of positive instances that were incorrectly
predicted as negative.
o Example: In our dataset, when the actual value is 1 (positive) but the
predicted value is 0 (negative), it counts as a false negative.
o Count:
▪ Rows:
▪ (1, 0)
▪ FN = 1
Summary of Counts

Metric Count Explanation

TP 4 Actual = 1, Predicted = 1 (Correctly identified positives)

TN 2 Actual = 0, Predicted = 0 (Correctly identified negatives)

FP 2 Actual = 0, Predicted = 1 (Incorrectly identified positives)

FN 1 Actual = 1, Predicted = 0 (Incorrectly identified negatives)

Visual Representation of the Confusion Matrix


Below is a visual representation of the confusion matrix based on the above counts:

Actual
1 0
1 [ TP FP ]
Predicted 0 [ FN TN ]

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 12


Understanding TP, TN, FP, and FN is crucial for evaluating the performance of
classification models. These metrics provide insight into how well the model is
performing in terms of correctly identifying positive and negative instances, allowing
for informed decisions about model improvements and adjustments.

Example:
Suppose we have a binary classification problem where we want to classify whether an email
is "spam" or "not spam." We have the following dataset (predictions vs actual values):

Email ID Actual Label (True) Predicted Label

1 Spam Spam

2 Not Spam Spam

3 Spam Not Spam

4 Not Spam Not Spam

5 Spam Spam

Let’s assume:
• "Spam" is the positive class.
• "Not Spam" is the negative class.
Step 1: Create the Confusion Matrix
The confusion matrix is a table that shows the true positive, false positive, false negative, and
true negative counts:

Predicted: Spam Predicted: Not Spam

Actual: Spam True Positive (TP) = 2 False Negative (FN) = 1

Actual: Not Spam False Positive (FP) = 1 True Negative (TN) = 1

From the confusion matrix:


• True Positives (TP) = 2 (emails correctly classified as spam)
• True Negatives (TN) = 1 (emails correctly classified as not spam)
• False Positives (FP) = 1 (emails incorrectly classified as spam)

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 13


• False Negatives (FN) = 1 (emails incorrectly classified as not spam)
Step 2: Calculate Performance Metrics
1. Accuracy: The proportion of correct predictions (both spam and not spam) out of all
predictions

So, the accuracy is 60%.


Precision (for spam): The proportion of correctly predicted spam emails out of all emails
predicted as spam.

So, the precision is 67%.


Recall (for spam): The proportion of actual spam emails that were correctly predicted.

So, the recall is 67%.


F1-Score: The harmonic mean of precision and recall, which balances both metrics.

So, the F1-Score is 67%.


Summary of Metrics:
• Accuracy: 60%

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 14


• Precision: 67%
• Recall: 67%
• F1-Score: 67%
These metrics provide a clear picture of the model's performance in detecting spam emails,
balancing both false positives and false negatives.

Example 2:

Performance Metric Calculation with a Dataset


Let's walk through an example of calculating performance metrics using a small dataset for
binary classification. We will use the following dataset with actual and predicted values:
Sample Dataset

Actual Predicted

1 1

0 1

1 1

0 0

1 0

0 1

1 1

0 0

Step 1: Define the Confusion Matrix


From the dataset, we can summarize the outcomes in a confusion matrix:
• True Positives (TP): The model correctly predicts positive instances.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 15


• True Negatives (TN): The model correctly predicts negative instances.
• False Positives (FP): The model incorrectly predicts positive instances.
• False Negatives (FN): The model incorrectly predicts negative instances.
Confusion Matrix Calculation
Count the values based on the dataset:
• TP (True Positives): 4 (Predicted 1 when Actual 1)
• TN (True Negatives): 2 (Predicted 0 when Actual 0)
• FP (False Positives): 2 (Predicted 1 when Actual 0)
• FN (False Negatives): 1 (Predicted 0 when Actual 1)
Step 2: Calculate Performance Metrics
Now, we can calculate the performance metrics using the counts from the confusion matrix:
Accuracy:

Precision:

Recall (Sensitivity):

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 16


F1-Score:

Summary of Performance Metrics

Metric Value

Accuracy 67%

Precision 67%

Recall 80%

F1-Score 73%

Example 3:

Let's consider another simple dataset example, this time focusing on a binary classification
task such as predicting whether a patient has a disease based on certain features.
Example Dataset: Disease Prediction
Consider a dataset for predicting whether a patient has a specific disease (1 = Disease, 0 = No
Disease) based on two features: age and cholesterol level. Here’s the test dataset with the
actual and predicted labels:

Patient Actual Label (Disease = 1, No Predicted Label (Disease = 1, No


ID Disease = 0) Disease = 0)

1 1 1

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 17


Patient Actual Label (Disease = 1, No Predicted Label (Disease = 1, No
ID Disease = 0) Disease = 0)

2 0 1

3 1 1

4 0 0

5 1 0

6 0 0

7 1 1

8 0 0

9 1 0

10 0 1

Step 1: Count the True Positives, False Positives, True Negatives, and False Negatives
From the dataset, we can summarize the results:
• True Positives (TP): Correctly predicted patients with the disease
• False Positives (FP): Incorrectly predicted patients with
the disease (predicted disease but actually no disease)
• True Negatives (TN): Correctly predicted patients without the disease
• False Negatives (FN): Incorrectly predicted patients without the disease (predicted
no disease but actually has disease)
Counts from the Dataset
• TP = 3 (Patients 1, 3, and 7)
• FP = 3 (Patients 2, 5, and 10)
• TN = 4 (Patients 4, 6, and 8)
• FN = 3 (Patients 5, 9)
Step 2: Calculate Performance Metrics

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 18


1. Accuracy:
Accuracy=TP+TNTP+TN+FP+FN=3+43+4+3+3=713≈0.5385 or 53.85%Accuracy=TP+TN+
FP+FNTP+TN=3+4+3+33+4=137≈0.5385 or 53.85%
2. Precision:
Precision=TPTP+FP=33+3=36=0.5 or 50%Precision=TP+FPTP=3+33=63=0.5 or 50%
3. Recall:
Recall=TPTP+FN=33+3=36=0.5 or 50%Recall=TP+FNTP=3+33=63=0.5 or 50%
4. F1-Score:
F1-Score=2×Precision×RecallPrecision+Recall=2×0.5×0.50.5+0.5=0.5 or 50%F1-
Score=2×Precision+RecallPrecision×Recall=2×0.5+0.50.5×0.5=0.5 or 50%

Summary of Performance Metrics

Metric Value

Accuracy 53.85%

Precision 50.00%

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 19


Metric Value

Recall 50.00%

F1-Score 50.00%

Conclusion
In this example, we calculated performance metrics based on a simple disease prediction
dataset.
• The model has a 53.85% accuracy, indicating that it correctly classified a little over
half of the patients.
• The precision and recall are both 50%, suggesting that the model has room for
improvement, particularly in distinguishing patients with the disease from those
without it.
• The F1-Score of 50% reflects this balance between precision and recall.

Decision Trees
Decision Trees are a popular method for classification and regression tasks in data mining
and machine learning. They model decisions and their possible consequences as a tree-like
structure, making them easy to interpret and visualize.

1. Decision Tree Construction


The construction of a decision tree typically involves the following steps:
• Step 1: Select the Best Attribute
o Choose the attribute that best separates the data into classes. Common criteria
include:
▪ Gini Impurity: Measures the impurity of a dataset. Lower values are
preferred.
▪ Information Gain: Measures the reduction in entropy after a dataset is
split on an attribute. Higher values indicate better attributes.
▪ Gain Ratio: Adjusts Information Gain based on the intrinsic
information of a split, balancing the criteria for better performance.
• Step 2: Create a Node
o Create a node in the tree for the selected attribute.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 20


• Step 3: Split the Dataset
o Divide the dataset into subsets based on the selected attribute’s values.
• Step 4: Recur for Each Subset
o Apply the same process recursively for each subset, creating child nodes.
• Step 5: Stop Condition
o The recursion stops when:
▪ All instances in a subset belong to the same class.
▪ There are no remaining attributes to split the data.
▪ A predefined tree depth or minimum number of samples per leaf is
reached.

2. Methods for Expressing Attribute Test Conditions


When constructing decision trees, expressing attribute test conditions is essential for
determining how to split the data. Here are common methods:
• Binary Tests:
o Categorical Attributes: Test if an attribute equals a specific value (e.g., If
Color = "Red").
o Example:
▪ If Color is an attribute with values {Red, Blue, Green}, splits can be
made based on equality.
• Numeric Tests:
o Continuous Attributes: Test if an attribute is greater than or less than a
threshold (e.g., If Age > 30).
o Example:
▪ If Age is a numeric attribute, splits can be made based on inequalities.
• Range Tests:
o Define a test condition that checks if an attribute falls within a specific range
(e.g., If Income >= 50000 and Income < 100000).
o Example:
▪ Useful for continuous attributes where multiple splits might be
required.
• Multi-way Tests:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 21


o For categorical attributes with multiple values, tests can be expressed as multi-
way conditions (e.g., If City in {"New York", "Los Angeles"}).
o Example:
▪ Allows for splitting based on multiple categories simultaneously.

Example 1 of Decision Tree Construction


Let's illustrate the decision tree construction process with a simple example. Consider a
dataset used for classifying whether a person will buy a computer based on certain attributes.
Dataset

Age Income Student Credit Rating Buys Computer

Young High No Fair No

Young High No Excellent No

Middle-aged High No Fair Yes

Senior Medium No Fair Yes

Senior Low Yes Fair Yes

Senior Low Yes Excellent No

Middle-aged Low Yes Excellent Yes

Young Medium No Fair No

Young Medium Yes Excellent Yes

Senior Medium Yes Excellent Yes

Middle-aged Medium No Excellent Yes

Steps for Decision Tree Construction


Step 1: Select the Best Attribute

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 22


Using Information Gain or Gini Impurity, we evaluate each attribute to determine which
one best separates the data.
• Let's assume Age is the attribute that provides the highest Information Gain.
Step 2: Create a Node
• Create a decision node for Age.
Step 3: Split the Dataset
• Based on the Age attribute, we split the dataset into three branches: Young, Middle-
aged, and Senior.
Step 4: Recur for Each Subset
Now we apply the same process to each subset:
1. For Young:
o Remaining attributes: Income, Student, Credit Rating.
o Assume we choose Student next based on Information Gain.
2. For Middle-aged:
o Remaining attributes: Income, Credit Rating.
o Assume we choose Credit Rating.
3. For Senior:
o Remaining attributes: Income, Student, Credit Rating.
o Assume we choose Credit Rating.
Step 5: Create Child Nodes
• Young:
o If Student = Yes: Buys Computer = Yes
o If Student = No: Buys Computer = No
• Middle-aged:
o If Credit Rating = Fair: Buys Computer = Yes
o If Credit Rating = Excellent: Buys Computer = Yes
• Senior:
o If Credit Rating = Fair: Buys Computer = Yes
o If Credit Rating = Excellent: Buys Computer = Yes

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 23


Final Decision Tree

Age
/|\
Young | Middle-aged Senior
/\ /\ /\
Yes No Fair Excellent Fair Excellent
| | | |
No Yes Yes Yes

Example 2:

We will classify whether a fruit is an Apple or an Orange based on two attributes: Color and Size.

Dataset

Color Size Fruit

Red Small Apple

Red Large Apple

Orange Small Orange

Orange Large Orange

Yellow Small Orange

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 24


Color Size Fruit

Red Small Apple

Steps for Decision Tree Construction


Step 1: Select the Best Attribute
We will first look at the attributes to decide which one helps us differentiate
between Apple and Orange.
• Color is a good choice because it can directly separate the fruits.
Step 2: Create a Node
• Create a decision node based on Color.
Step 3: Split the Dataset
Now, we split the dataset based on the Color attribute:
• If Color = Red: Possible fruit = Apple
• If Color = Orange: Possible fruit = Orange
• If Color = Yellow: Possible fruit = Orange
Step 4: Create Child Nodes
Next, we analyze the Size attribute for more specific distinctions:
1. For Red:
o Size = Small: Fruit = Apple
o Size = Large: Fruit = Apple
2. For Orange:
o Size = Small: Fruit = Orange
o Size = Large: Fruit = Orange
Final Decision Tree
plaintext

Color
/ \
Red Orange
/\ /\

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 25


Apple Apple Orange Orange
Explanation
• If the fruit is Red, it is classified as an Apple.
• If the fruit is Orange, it can be classified as an Orange.
• The Size attribute confirms our classification, but since we already know the colors,
we can directly classify them.

Split in Decision Trees


In decision trees, a split refers to the process of dividing the dataset into subsets based on the
value of a chosen attribute (feature). The goal of a split is to create groups of data points that
are more homogeneous (similar) with respect to the target variable. The effectiveness of
a split can be evaluated using various measures, such as Information Gain, Gini Impurity,
or Entropy.
Example of a Split in Decision Trees
Let’s consider a simple example to illustrate how a split works in a decision tree.
Dataset
Imagine we have a dataset of animals with the following attributes:

Animal Color Size Type

Dog Brown Large Mammal

Cat Black Small Mammal

Parrot Green Small Bird

Dog Black Large Mammal

Parrot Red Small Bird

Cat White Small Mammal

Step 1: Choose an Attribute to Split


Let's say we want to split the dataset based on the Color attribute. The unique values for
the Color attribute in this dataset are Brown, Black, Green, Red, and White.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 26


Step 2: Create the Splits
Based on the Color attribute, we can make the following splits:
1. If Color = Brown:
o Animals: Dog
o Type: Mammal
2. If Color = Black:
o Animals: Cat, Dog
o Types: Mammal, Mammal
3. If Color = Green:
o Animals: Parrot
o Type: Bird
4. If Color = Red:
o Animals: Parrot
o Type: Bird
5. If Color = White:
o Animals: Cat
o Type: Mammal
Visualization of the Split
Here’s how the splits look visually:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 27


In this example, the split based on the Color attribute helps us classify animals into different
types (Mammal or Bird). The decision tree can continue to split further based on other
attributes (like Size) to refine the classification.
By making strategic splits, decision trees can effectively categorize data points and provide
clear, interpretable models for classification tasks.

Example of an Indian Cars Dataset


Here’s a simple dataset of popular Indian cars with various attributes such
as Make, Model, Year, Color, Fuel Type, and Price.
Dataset

Make Model Year Color Fuel Type Price (INR)

Maruti Suzuki Swift 2020 Blue Petrol 550000

Hyundai Creta 2021 White Diesel 1000000

Tata Nexon 2019 Grey Petrol 700000

Mahindra Thar 2021 Red Diesel 1500000

Kia Seltos 2020 Silver Petrol 900000

Honda City 2021 Black Petrol 1100000

Step 1: Choosing an Attribute to Split


Let’s say we want to split the dataset based on the Fuel Type attribute. The unique values for
the Fuel Type attribute in this dataset are Petrol and Diesel.
Step 2: Creating the Splits
Based on the Fuel Type attribute, we can create the following splits:
1. If Fuel Type = Petrol:
o Cars:
▪ Maruti Suzuki Swift (2020, Blue, ₹5,50,000)
▪ Tata Nexon (2019, Grey, ₹7,00,000)
▪ Kia Seltos (2020, Silver, ₹9,00,000)
▪ Honda City (2021, Black, ₹11,00,000)

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 28


o Price Range: ₹5,50,000 - ₹11,00,000
2. If Fuel Type = Diesel:
o Cars:
▪ Hyundai Creta (2021, White, ₹10,00,000)
▪ Mahindra Thar (2021, Red, ₹15,00,000)
o Price Range: ₹10,00,000 - ₹15,00,000
Visualization of the Split
Here is a simple representation of how the splits look:

In this example, the split based on the Fuel Type attribute allows us to categorize the cars into two
groups: those that run on Petrol and those that are Diesel. Each group can then be further analyzed
or split based on other attributes (like Price or Year) to refine classifications.

Using splits effectively helps in making decisions or predictions about car types based on
various features, which is useful for both consumers and car manufacturers in India.

Measures for Selecting the Best Split in Decision Trees


When constructing a decision tree, selecting the best attribute to split the data at each node is
crucial for creating an effective model. Here are some common measures used to evaluate the
quality of splits:
1. Information Gain
• Definition: Information Gain measures the reduction in entropy or uncertainty about
the target variable after the dataset is split based on an attribute.
• Calculation:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 29


o Compute the Entropy before the split.
o Compute the Entropy for each branch after the split.
o Information Gain = Entropy (before) - Weighted Average Entropy (after).
• Formula:

• where S is the dataset and Sv is the subset of SS for which attribute A takes the
value v.
2. Gini Impurity
• Definition: Gini Impurity measures the probability of incorrectly classifying a
randomly chosen element if it was randomly labeled according to the distribution of
labels in the subset.
• Calculation: Lower Gini Impurity values indicate a better split.
• Formula:

where pi is the probability of a class ii in the set S and C is the number of classes.
3. Entropy
• Definition: Entropy measures the amount of disorder or unpredictability in the
dataset. It quantifies the impurity of the dataset.
• Calculation: Higher entropy indicates more disorder, while lower entropy indicates
more predictability.
• Formula:

where pi is the proportion of class i in the dataset SS.


DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 30
4. Chi-Squared Statistic
• Definition: The Chi-Squared test assesses whether the observed distribution of data
across different categories is significantly different from what would be expected if
there were no relationship between the variables.
• Calculation: A higher Chi-Squared value indicates a stronger association between the
attribute and the target variable.
• Formula:

where Oi is the observed frequency and Ei is the expected frequency.


5. Mean Squared Error (MSE) (For Regression Trees)
Definition: MSE measures the average of the squares of the errors—that is, the average
squared difference between estimated values and actual value.
• Calculation: Used for regression trees to evaluate the quality of splits.
• Formula:

here yi is the true value and y^i is the predicted value.

These measures help in determining the best attribute to split the dataset at each node
in a decision tree. By evaluating the quality of splits using these metrics, we can build
a more accurate and efficient decision tree model.

Here are some examples of datasets that are commonly used for decision tree
classification tasks:
1. Drugs A, B, C, X, Y for Decision Trees:
o Features: Age, Sex, Blood Pressure, Cholesterol

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 31


o Target: The drug that each patient responded to.
o This dataset is useful for understanding how different patient characteristics
influence the effectiveness of various drugs.
2. Diabetes Dataset:
o This dataset is often used to build and optimize decision tree classifiers. It
includes various health metrics and the target variable indicates whether a
patient has diabetes or not.
o It demonstrates how to apply decision trees using the Python Scikit-learn
package.
3. Iris Dataset:
o Features: Sepal Length, Sepal Width, Petal Length, Petal Width
o Target: Species of the iris flower (Setosa, Versicolor, Virginica).
o This classic dataset is frequently used for classification tasks and is ideal for
beginners.
4. Titanic Dataset:
o Features: Passenger Class, Sex, Age, Siblings/Spouses Aboard,
Parents/Children Aboard, Fare
o Target: Survival (Yes/No).
o This dataset is popular for demonstrating classification techniques and
understanding survival rates based on various factors.
5. Wine Quality Dataset:
o Features: Fixed Acidity, Volatile Acidity, Citric Acid, Residual Sugar,
Chlorides, Free Sulfur Dioxide, Total Sulfur Dioxide, Density, pH, Sulphates,
Alcohol
o Target: Quality score (ranging from 0 to 10).
o This dataset is used to predict the quality of wine based on its chemical
properties.

Example:

Let's imagine we're working with the Drugs A, B, C, X, Y for Decision Trees dataset. Here's a
simplified example of how a decision tree might work to predict which drug is best for a
patient:

Scenario: We have a patient with the following characteristics:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 32


• Age: 55
• Sex: Male
• Blood Pressure: High
• Cholesterol: Normal
Decision Tree:
1. Root Node: The tree starts by asking a question about the most important feature.
Let's say the tree decides to split based on Blood Pressure.
2. First Split: The tree branches into two paths: "High Blood Pressure" and
"Normal Blood Pressure".
3. High Blood Pressure Branch: Since our patient has high blood pressure, we follow
this branch. The next question might be about Age. The tree could split into "Age <
60" and "Age >= 60".
4. Age < 60 Branch: Our patient is 55, so we follow this branch. The tree might then
split based on Sex. Let's say it splits into "Male" and "Female".
5. Male Branch: Our patient is male, so we follow this branch. The tree might have
reached a leaf node, meaning it's made a prediction. Let's say the tree predicts
that Drug X is the best choice for this patient.
Outcome: The decision tree predicts that Drug X is the best option for this patient
based on their age, sex, blood pressure, and cholesterol.
Important Note: This is a very simplified example. Real-world decision trees can be
much more complex with many levels of splits and multiple features considered at
each node.

Algorithm for Decision tree Induction

The decision tree induction algorithm is a popular method used in data


mining and machine learning to create a model that predicts the value of a target
variable based on several input features. Here’s a structured overview of how the
algorithm typically works:
Key Steps in Decision Tree Induction
1. Select the Best Attribute:
o The algorithm evaluates the attributes in the dataset to determine which one best
separates the data into distinct classes. This is often done using metrics
like Information Gain, Gini Impurity, or Chi-Squared.
DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 33
2. Create a Decision Node:
o Once the best attribute is identified, a decision node is created in the tree. This
node represents the attribute that will be used to split the data.
3. Split the Dataset:
o The dataset is divided into subsets based on the values of the selected attribute.
Each subset corresponds to a branch of the tree.
4. Repeat the Process:
o For each subset, the algorithm recursively applies the same process: selecting
the best attribute, creating decision nodes, and splitting the data until one of the
stopping criteria is met (e.g., all instances in a subset belong to the same class,
or a maximum tree depth is reached).
5. Assign Class Labels:
o Once the tree is fully grown, each leaf node is assigned a class label based on
the majority class of the instances that reach that leaf.
Stopping Criteria
The induction process can stop based on several criteria:
• All instances in a node belong to the same class.
• There are no remaining attributes to split on.
• A predefined maximum depth of the tree is reached.
• The number of instances in a node is below a certain threshold.

Example of Decision Tree Induction


A well-known algorithm for decision tree induction is ID3 (Iterative Dichotomiser 3),
developed by J. Ross Quinlan in the 1980s. It uses the concept of Information Gain to
select the attribute that best separates the data.

Algorithm and its implementation, using the ID3 (Iterative Dichotomiser 3) algorithm as an
example.
ID3 Algorithm
1. Start with the root node: This node represents the entire dataset.
2. Calculate Information Gain for each attribute:
o Entropy: A measure of the impurity of the dataset. A pure dataset (all
instances belong to the same class) has an entropy of 0.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 34


o Information Gain (IG): The difference in entropy before and after splitting
the dataset based on an attribute. The attribute with the highest information
gain is chosen for the split.
3. Create a decision node: The chosen attribute becomes the decision node, with
branches for each of its possible values.
4. Split the dataset: Instances are directed down the branches of the decision node
based on their attribute value.
5. Recursively repeat steps 2-4: The process is repeated for each subset of data created
by the splits until one of the stopping criteria is met (e.g., all instances in a node
belong to the same class, or a maximum depth is reached).
6. Assign class labels: Leaf nodes (nodes with no further splits) are assigned the class
label that is most common among the instances that reach that node.

Implementation Example (Python)


import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load your dataset (replace 'your_dataset.csv' with your file)


data = pd.read_csv('your_dataset.csv')

# Separate features (X) and target variable (y)


X = data.drop('target_variable', axis=1) # Replace 'target_variable' with your target column
y = data['target_variable']

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree classifier (using ID3 implicitly)


tree = DecisionTreeClassifier(criterion='entropy') # Use 'entropy' for ID3

# Train the model

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 35


tree.fit(X_train, y_train)

# Make predictions on the test set


y_pred = tree.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Explanation:
• Import libraries: pandas for data
manipulation, DecisionTreeClassifier from sklearn.tree for the decision tree
model, train_test_split for splitting data, and accuracy_score for evaluation.
• Load data: Replace 'your_dataset.csv' with the path to your dataset.
• Prepare data: Separate features (X) and the target variable (y).
• Split data: Create training and testing sets for model training and evaluation.
• Create classifier: DecisionTreeClassifier(criterion='entropy') creates a decision tree
model using ID3 (entropy-based information gain).
• Train the model: tree.fit(X_train, y_train) fits the model to the training data.
• Make predictions: tree.predict(X_test) uses the trained model to predict the target
variable for the test data.
• Evaluate accuracy: accuracy_score(y_test, y_pred) calculates the accuracy of the
model's predictions.
Important Notes:
• This is a simplified example. Real-world implementations might involve data
preprocessing, feature engineering, hyperparameter tuning, and other techniques.
• There are many other decision tree algorithms besides ID3 (e.g., C4.5, CART). The
choice of algorithm depends on the specific dataset and task.

Here are a few examples of datasets that are commonly used for decision tree
classification tasks:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 36


1. Iris Dataset:
o Description: This classic dataset contains measurements of iris flowers from
three different species (Setosa, Versicolor, and Virginica). The features
include sepal length, sepal width, petal length, and petal width.
o Target Variable: Species of the iris flower.
2. Titanic Dataset:
o Description: This dataset includes information about the passengers on the
Titanic, such as age, sex, ticket class, and whether they survived the disaster.
o Target Variable: Survival (0 = No, 1 = Yes).
3. Breast Cancer Wisconsin Dataset:
o Description: This dataset contains features computed from a digitized image
of a fine needle aspirate (FNA) of a breast mass. It includes attributes like
radius, texture, perimeter, area, and smoothness.
o Target Variable: Diagnosis (0 = Benign, 1 = Malignant).
4. Drugs A, B, C, X, Y Dataset:
o Description: This dataset features attributes such as age, sex, blood pressure,
and cholesterol levels of patients, with the target being the drug that each
patient responded to.
o Target Variable: Drug response (A, B, C, X, Y).
5. Diabetes Dataset:
o Description: This dataset includes various medical predictor variables and one
target variable, indicating whether a patient has diabetes.
o Target Variable: Diabetes status (0 = No, 1 = Yes).

Naive-Bayes Classifier

The Naive Bayes Classifier is a family of probabilistic algorithms based on Bayes' Theorem,
which is particularly effective for classification tasks. Here’s a detailed overview of its key
features and how it works:
Key Features of Naive Bayes Classifier
1. Probabilistic Nature: It calculates the probability of each class given the features and
assigns the class with the highest probability to the instance.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 37


2. Assumption of Independence: The "naive" aspect comes from the assumption that all
features are conditionally independent given the class label. This simplifies the
computation significantly.
3. Types of Naive Bayes Classifiers:
o Gaussian Naive Bayes: Assumes that the features follow a normal distribution.
o Multinomial Naive Bayes: Suitable for discrete counts, often used in text
classification.
o Bernoulli Naive Bayes: Works with binary/boolean features.
How It Works
1. Training Phase:
o Calculate the prior probability for each class.
o For each feature, calculate the likelihood of the feature given each class.
2. Prediction Phase:
o For a new instance, compute the posterior probability for each class
using Bayes' Theorem:

1.
o Select the class with the highest posterior probability.
Applications
• Text Classification: Commonly used for spam detection and sentiment analysis.
• Medical Diagnosis: Helps in predicting diseases based on symptoms.
• Recommendation Systems: Used to recommend products based on user preferences.
Advantages
• Simple and Fast: Easy to implement and computationally efficient.
• Effective with Large Datasets: Performs well even with a large number of features.
Disadvantages
• Independence Assumption: The assumption of feature independence may not hold
true in real-world scenarios, which can affect accuracy.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 38


• Zero Probability Problem: If a category has a feature that was not present in the
training set, it can lead to zero probability. This can be mitigated using techniques
like Laplace smoothing.
In summary, the Naive Bayes Classifier is a powerful tool for classification tasks, especially
when dealing with large datasets and text data. Its simplicity and effectiveness make it a popular
choice in various applications.

Example Dataset (Product Purchase)

Age (Years) Income (Thousands) Purchase

25 50 Yes

30 60 Yes

35 70 Yes

40 80 No

45 90 No

50 100 No

Explanation:
• Age: The age of the individual in years.
• Income: The individual's annual income in thousands of dollars.
• Purchase: Whether the individual purchased the product (Yes or No).
Using this dataset, we can apply algorithms like Decision Trees or Naive Bayes to build
models that predict the likelihood of a purchase based on age and income.
For example, a simple decision tree might look like this:
• If Income > 75, then Purchase = No
• If Income <= 75, then Purchase = Yes
This is a very basic example, and real-world models would be more complex. However, it
illustrates how you can use a simple dataset to explore classification tasks.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 39


Naive Bayes for Product Purchase
1. Prior Probabilities:
o P(Purchase = Yes) = 3/6 = 0.5
o P(Purchase = No) = 3/6 = 0.5
2. Likelihoods:
o We'll calculate the likelihoods for each feature (Age and Income) given each
purchase outcome. Let's simplify by grouping ages into ranges:
▪ Age Range 1 (25-35):
▪ P(Age Range 1 | Purchase = Yes) = 3/3 = 1
▪ P(Age Range 1 | Purchase = No) = 0/3 = 0
▪ Age Range 2 (40-50):
▪ P(Age Range 2 | Purchase = Yes) = 0/3 = 0
▪ P(Age Range 2 | Purchase = No) = 3/3 = 1
▪ Income: We'll do the same for income, grouping it into ranges (e.g.,
50-60, 70-80, etc.).
3. Prediction:
o Suppose we have a new customer with Age = 32 and Income = 65.
o We would calculate the posterior probability of Purchase = Yes and Purchase
= No using Bayes' Theorem:
▪ P(Purchase = Yes | Age = 32, Income = 65) = P(Age = 32, Income =
65 | Purchase = Yes) * P(Purchase = Yes) / P(Age = 32, Income = 65)
▪ P(Purchase = No | Age = 32, Income = 65) = P(Age = 32, Income = 65
| Purchase = No) * P(Purchase = No) / P(Age = 32, Income = 65)
o The outcome with the higher probability would be our prediction.
Key Points:
• Naive Assumption: Naive Bayes assumes that features are independent, meaning
knowing one feature doesn't affect the probability of another. This is often not true in
real-world data, but it can still be a surprisingly effective algorithm.
• Data Representation: The way we group data into ranges can impact the model's
performance. You might want to experiment with different groupings.
• Real-World Complexity: This is a very simplified example. Real-world Naive
Bayes models would typically use more features, handle continuous data more
precisely, and involve more sophisticated calculations.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 40


Example 2:
let's create an example using the Iris dataset and the Naive Bayes classifier. We'll use a simplified
version for demonstration purposes.

Example Dataset (Simplified Iris)

Sepal Length Sepal Width Petal Length Petal Width


Species
(cm) (cm) (cm) (cm)

5.1 3.5 1.4 0.2 Setosa

4.9 3.0 1.4 0.2 Setosa

4.7 3.2 1.3 0.2 Setosa

7.0 3.2 4.7 1.4 Versicolor

6.4 3.2 4.5 1.5 Versicolor

6.9 3.1 4.9 1.5 Versicolor

6.3 3.3 6.0 2.5 Virginica

5.8 2.7 5.1 1.9 Virginica

7.1 3.0 5.9 2.1 Virginica

Naive Bayes Algorithm (Simplified)


1. Calculate Prior Probabilities:
o P(Setosa) = 3/9 = 1/3
o P(Versicolor) = 3/9 = 1/3
o P(Virginica) = 3/9 = 1/3
2. Calculate Likelihoods:
o For each feature, calculate the probability of observing that feature value given
each species. For example:
▪ P(Sepal Length = 5.1 | Setosa) = 1/3 (since there's one instance with
5.1 sepal length in Setosa)
▪ P(Sepal Length = 7.0 | Versicolor) = 1/3

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 41


▪ P(Petal Width = 0.2 | Setosa) = 1/3
3. Prediction:
o Suppose we have a new iris with features: Sepal Length = 6.0 cm, Sepal
Width = 3.0 cm, Petal Length = 4.0 cm, Petal Width = 1.0 cm.
o Calculate the posterior probability for each species using Bayes' Theorem:
▪ P(Setosa | Features) = P(Features | Setosa) * P(Setosa) / P(Features)
▪ P(Versicolor | Features) = P(Features | Versicolor) * P(Versicolor) /
P(Features)
▪ P(Virginica | Features) = P(Features | Virginica) * P(Virginica) /
P(Features)
o The species with the highest posterior probability is the predicted class.

Second Approach

Naive Bayesian

The Naive Bayesian classifier is based on Bayes’ theorem with the independence assumptions
between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter
estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive
Bayesian classifier often does surprisingly well and is widely used because it
often outperforms more sophisticated classification methods.

Algorithm

Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x),
and P(x|c). Naive Bayes classifier assume that the effect of the value of a predictor (x) on a given
class (c) is independent of the values of other predictors. This assumption is called class conditional
independence.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 42


• P(c|x) is the posterior probability of class (target) given predictor (attribute).
• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of predictor given class.
• P(x) is the prior probability of predictor.

In ZeroR model there is no predictor, in OneR model we try to find the single best predictor, naïve
Bayesian includes all predictors using Bayes' rule and the independence assumptions between
predictors.

Example 1:

We use the same simple Weather dataset here.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 43


The posterior probability can be calculated by first, constructing a frequency table for each attribute
against the target. Then, transforming the frequency tables to likelihood tables and finally use
the Naive Bayesian equation to calculate the posterior probability for each class. The class with the
highest posterior probability is the outcome of prediction.

The likelihood tables for all four predictors.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 44


Example 2:

In this example we have 4 inputs (predictors). The final posterior probabilities can be standardized
between 0 and 1.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 45


The zero-frequency problem

Add 1 to the count for every attribute value-class combination (Laplace estimator) when an attribute
value (Outlook=Overcast) doesn’t occur with every class value (Play Golf=no).

Numerical Predictors

Numerical variables need to be transformed to their categorical counterparts (binning) before


constructing their frequency tables. The other option we have is using the distribution of the
numerical variable to have a good guess of the frequency. For example, one common practice is to
assume normal distributions for numerical variables.

The probability density function for the normal distribution is defined by two parameters
(mean and standard deviation).

Example:

Humidity Mean StDev

yes 86 96 80 65 70 80 70 90 75 79.1 10.2


Play Golf
no 85 90 70 95 91 86.2 9.7

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 46


Predictors Contribution

Kononenko's information gain as a sum of information contributed by each attribute can offer an
explanation on how values of the predictors influence the class probability.

The contribution of predictors can also be visualized by plotting nomograms. Nomogram plots log
odds ratios for each value of each predictor. Lengths of the lines correspond to spans of odds ratios,
suggesting importance of the related predictor. It also shows impacts of individual values of the
predictor.

Bayesian Belief Network:


Bayesian belief network is key computer technology for dealing with probabilistic events and
to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 47


Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:

o Each node corresponds to the random variables, and a variable can


be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in
the graph.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 48


These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by
the nodes of the network graph.
o If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as
a directed acyclic graph or DAG.
The Bayesian network has mainly two components:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.

Bayesian network is based on Joint probability distribution and conditional probability. So


let's first understand the joint probability distribution:

Joint probability distribution:


If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of
x1, x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Explanation of Bayesian network:


Let's understand the Bayesian network through an example by creating a directed acyclic
graph:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 49


xample: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two
neighbors David and Sophia, who have taken a responsibility to inform Harry at work when
they hear the alarm. David always calls Harry when he hears the alarm, but sometimes he got
confused with the phone ringing and calls at that time too. On the other hand, Sophia likes to
listen to high music, so sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.

Solution:
o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
List of all events occurring in this network:
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
o We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability
distribution:
o P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
o =P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
o = P [D| A]. P [ S| A, B, E]. P[ A, B, E]
o = P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
o = P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 50


o

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not occurred.

We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 51


False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:


The Conditional probability of David that he will call depends on the probability of Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:


The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).


= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is given below:
1. To understand the network as the representation of the Joint probability distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional independence
statements.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 52


K- Nearest neighbor classification:

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases
based on a similarity measure (e.g., distance functions). KNN has been used in statistical
estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.

Algorithm

A case is classified by a majority vote of its neighbors, with the case being assigned to the class most
common amongst its K nearest neighbors measured by a distance function. If K = 1, then the
case is simply assigned to the class of its nearest neighbor.

It should also be noted that all three distance measures are only valid for continuous variables.
In the instance of categorical variables the Hamming distance must be used. It also brings up the
issue of standardization of the numerical variables between 0 and 1 when there is a mixture of
numerical and categorical variables in the dataset.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 53


Choosing the optimal value for K is best done by first inspecting the data. In general, a large K value
is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another
way to retrospectively determine a good K value by using an independent dataset to validate
the K value. Historically, the optimal K for most datasets has been between 3-10. That produces much
better results than 1NN.

Example:

Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 54


We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using
Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set with
Default=Y.

D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y

With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction
for the unknown case is again Default=Y.

Standardized Distance

One major drawback in calculating distance measures directly from the training set is in the case
where variables have different measurement scales or there is a mixture of numerical and
categorical variables. For example, if one variable is based on annual income in dollars, and the
other is based on age in years then income will have a much higher influence on the distance
calculated. One solution is to standardize the training set as shown below.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 55


Using the standardized distance on the same training set, the unknown case returned a different
neighbor which is not a good sign of robustness.

Second Approach:
K-Nearest Neighbors

K-nearest neighbors (KNN) is a type of supervised learning machine learning algorithm and
is used for both regression and classification tasks.
KNN is used to make predictions on the test data set based on the characteristics of the current
training data points. This is done by calculating the distance between the test data and training
data, assuming that similar things exist within close proximity.
The algorithm will have stored learned data, making it more effective at predicting and
categorising new data points. When a new data point is inputted, the KNN algorithm will learn
its characteristics/features. It will then place the new data point at closer proximity to the
current training data points that share the same characteristics or features.

What Is The ‘k’ in KNN?

The ‘K’ in KNN is a parameter that refers to the number of nearest neighbors. K is a positive
integer and is typically small in value and is recommended to be an odd number.
In Layman's terms, the K-value creates an environment for the data points. This makes it easier
to assign which data point belongs to which category.
The example below shows 3 graphs. The first, the ‘Initial Data’ is a graph where data points
are plotted and clustered into classes, and a new example to classify is present. In the ‘Calculate
DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 56
Distance’ graph, the distance from the new example data point to the closest trained data points
is calculated. However, this still does not categorise the new example data point. Therefore,
using k-value, essentially created a neighborhood where we can classify the new example data
point.
We would say that k=3 and the new data point will belong to Class B as there are more trained
Class B data points with similar characteristics to the new data point in comparison to Class A.

If we increase the k-value to 7, we will see that the new data point will belong to Class A as
there are more trained Class A data points with similar characteristics to the new data point in
comparison to Class B.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 57


The k-value is typically a small number because, as we increase the k-value, the error rate
also increases. The below graph shows this:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 58


However, If the k-value is small then it causes a low bias but a high variance, leading to
overfitting of the model.
It is also recommended that the k-value is an odd number. This is because if we are trying to
classify a new data point and we only have an even number of categories/classes (e.g. Class A
and Class B) it can produce inaccurate outputs. Therefore it is highly recommended to choose
a K-value with an odd number to avoid a tie.

Calculating The Distance

KNN calculates the distance between data points in order to classify new data points. The most
common methods used to calculate this distance in KNN are Euclidian, Manhattan, and
Minkowski.
Euclidean Distance is the distance between two points using the length of a line between the
two points. The formula for Euclidean Distance is the square root of the sum of the squared
differences between a new data point (x) and an existing trained data point (y).
Manhattan Distance is the distance between two points is the sum of the absolute difference
of their Cartesian coordinates. The formula for Manhattan Distance is the sum of the lengths
between a new data point (x) and an existing trained data point (y) using a line segment on the
coordinate axes.
Minkowski Distance is the distance between two points in the normed vector space and is a
generalization of the Euclidean distance and the Manhattan distance. In the formula for
Minkowski Distance when p=2, we get Euclidian distance, also known as L2 Distance. When
p=1 we get Manhattan distance, also known as L1 distance, city-block distance, and LASSO.
The image below is the formulas:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 59


The image below explains the difference between the three:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 60


How Does The KNN Algorithm Work?
Below are the steps of how a KNN algorithm works:
1. Load in your dataset
2. Choose a k-value. An odd number is recommended to avoid a tie.
3. Calculate the distance between the new data point and the neighboring existing
trained data points.
4. Find the K nearest neighbor to the new data point
Below is an image that gives an overview of these steps:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 61


Approach 3:
K-nearest neighbor (KNN) is a simple algorithm that stores all available cases and classifies
new data or cases based on a similarity measure. It is mostly used to classify a data point based
on how its neighbors are classified.
What Is a K-Nearest Neighbor (KNN)?
K-nearest neighbor (KNN) is an algorithm that is used to classify a data point based on how its
neighbors are classified. The “K” value refers to the number of nearest neighbor data points to
include in the majority voting process.
Let’s break it down with a wine example examining two chemical components called rutin and
myricetin. Consider a measurement of the rutin vs. myricetin level with two data points — red

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 62


and white wines. After being tested, they’re placed on a graph based on how much rutin and
how much myricetin chemical content is present in the wines.

A graph representing data trends for red and white wine based on the amount of myricetin and
rutine (sic). | Image: Dhilip Subramanian
The “K” in KNN is a parameter that refers to the number of nearest neighbors to include in the
majority of the voting process.
Now suppose we add a new glass of wine in the data set, and we want to know whether this
new wine is red or white.

Identifying a glass of wine based on its nearest neighbors on the chart. | Image: Dhilip
Subramanian

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 63


To do so, we need to find out what the neighbors are in this case. Let’s say k = 5, and the new
data point is classified by the majority of votes from its five neighbors. The new point would
be classified as a red wine since four out of five neighbors are red.

Defining a glass of
wine based on its nearest neighbors with a k value of five. | Image: Dhilip Subramanian

Determining the K-Nearest Neighbor Algorithm’s ‘K’ Value


The “K” in KNN algorithm is based on feature similarity. Choosing the right value for K is a
process called parameter tuning, which improves the algorithm accuracy. Finding the value of
K is not easy.

How to Define a ‘K’ Value


Below are some ideas on how to pick the value of K in a K-nearest neighbor algorithm:
1. There is no structured method for finding the best value for K. We need to assume that
the training data is unknown and find the best value through trial and error.
2. Choosing smaller values for K can be noisy and will have a higher influence on the
result.
3. Larger values of K will have smoother decision boundaries, which means a lower
variance but increased bias. Also, it can be computationally expensive.
4. Another way to choose K is through cross-validation. One way to select the cross-
validation data set from the training data set is to take a small portion from the training
data set and call it a validation data set. Then use the same process to evaluate different
possible values of K. In this way, we are able to predict the label for every instance in
the validation set using K equals to one, K equals to two, K equals to three, and so on.
Then we look at what value of K gives us the best performance on the validation set.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 64


From there, we can take that value and use that as the final setting of our algorithm to
minimize the validation error.
5. In general practice, choosing the value of K is k = sqrt(N) where “N” stands for the
number of samples in your training data set.
6. Try to keep the value of K odd in order to avoid confusion between two classes of data.

How Does a K-Nearest Neighbor Algorithm Work?


In the classification setting, the K-nearest neighbor algorithm essentially boils down to forming
a majority vote between the K with most similar instances to a given unseen observation.
Similarity is defined according to a distance metric between two data points. A popular one is
the Euclidean distance method.

Euclidean distance equation. | Image: Dhilip Subramanian


Other methods are Manhattan, Minkowski, and Hamming distance methods. For categorical
variables, the Hamming distance must be used.
Let’s take a small example examining age vs. loan amount.

Predicting Andrew’s default status using Euclidean distance with data from other customers. |
Image: Dhilip Subramanian
We need to predict Andrew’s default status — either yes or no.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 65


Calculating the Euclidean distance based on other age and loan data points. |
Then calculate the Euclidean distance for all the data points.

Calculating Euclidean distance with k=5. |


With K=5, there are two Default=N and three Default=Y out of five closest neighbors. We can
safely say the default status for Andrew is “Y” based on the majority similarity in three points
out of five.
KNN is also a lazy learner because it doesn’t learn a discriminative function from the training
data but “memorizes” the training data set instead.

Computing K-Nearest Neighbor Distance Metrics


Hamming Distance
Hamming distance is mostly used in text data, which calculates the distance between two binary
vectors. Here, binary vector means the data represented in the form of binary digits 0 and 1. It
is also called binary strings.
Mathematically, it’s represented by the following formula:

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 66


Hamming distance equation. | Image: Dhilip Subramanian

Euclidean Distance
Euclidean distance is the most popular distance measure. It helps to find the distance between
two real-valued vectors, like integers or floats. Before using Euclidean distance, we must
normalize or standardize the data, otherwise, data with larger values will dominate the
outcome.
Mathematically, it’s represented by the following formula.

Euclidean distance equation. | Image: Dhilip Subramanian

Manhattan Distance
Manhattan distance is the simplest measure, and it’s used to calculate the distance between two
real-valued vectors. It is called “Taxicab” or “City Block” distance measure.
If we start from one place and move to another, Manhattan distance will calculate the absolute
value between starting and destination points. Manhattan is preferred over Euclidean if the two
data points are in an integer space.
The Manhattan distance between two points (X1, Y1) and (X2, Y2) is represented by |X1 – X2|
+ |Y1 – Y2|.

Minkowski Distance
Minkowski distance is used to calculate the distance between two real value vectors. It is a
generalized form for Euclidean and Manhattan distance. In addition, it adds a parameter “p,”
which helps to calculate the different distance measures.
Mathematically it’s represented by the following formula. Note that in Euclidean distance p =
2, and p =1 if it is Manhattan distance.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 67


K-Nearest Neighbor Pros
1. It’s simple to implement.
2. It’s flexible to different feature/distance choices.
3. It naturally handles multi-class cases.
4. It can do well in practice with enough representative data.

K-Nearest Neighbor Cons


1. We need to determine the value of parameter “K” (number of nearest neighbors).
2. Computation cost is quite high because we need to compute the distance of each
query instance to all training samples.
3. It requires a large storage of data.
4. We must have a meaningful distance function.

DATA MINING-UNIT 3 | By Siliveri Kiran Kumar 68

You might also like