0% found this document useful (0 votes)
5 views

Unit-4 Data Mining

The document provides an overview of classification and prediction techniques in data mining, detailing the processes involved in each, including data collection, preprocessing, feature selection, model selection, training, and evaluation. It emphasizes the importance of decision trees in classification tasks, explaining their structure, advantages, and disadvantages, along with examples and applications. Additionally, it outlines key steps in decision tree induction and includes a practical example using a dataset to illustrate the concepts.

Uploaded by

paramt1315
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit-4 Data Mining

The document provides an overview of classification and prediction techniques in data mining, detailing the processes involved in each, including data collection, preprocessing, feature selection, model selection, training, and evaluation. It emphasizes the importance of decision trees in classification tasks, explaining their structure, advantages, and disadvantages, along with examples and applications. Additionally, it outlines key steps in decision tree induction and includes a practical example using a dataset to illustrate the concepts.

Uploaded by

paramt1315
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT-4 CLASSIFICATION AND PREDICTION

CLASSIFICATION:
Classification is a supervised learning technique in data mining used to categorize data into
predefined groups or classes. The goal is to build a model that can accurately classify new,
unseen data based on past observations. Example: email spam detection.
Process of Classification:
1. Data Collection: In this step, relevant data is collected. The data should contain all the
necessary attributes and labels needed for classification. The data can be collected from
various sources, such as surveys, questionnaires, websites, and databases.
2. Data Pre-processing: The collected data needs to be pre-processed to ensure its quality.
This involves handling missing values, dealing with outliers, and transforming the data into
suitable form for analysis.
3. Feature Selection: It involves identifying the most relevant attributes in the dataset for
classification. This can be done using various techniques, such as:
Correlation Analysis measures the statistical relationship between two or more
variables to identify patterns, dependencies, and associations in the data.
Information Gain is a measure of the amount of information that a feature provides for
classification. Features with high information gain are selected for classification.
Principal Component Analysis is a technique used to reduce the dimensionality of the
dataset. It identifies most important features in dataset and removes redundant ones.
4. Model Selection: It involves selecting the appropriate classification algorithm. These are:
Decision Tree is hierarchical model that splits data into subsets based on feature values,
creating a tree-like structure to predict the target class by following decision rules.
Bayesian classification is a probabilistic classification technique that applies Bayes'
Theorem to predict probability of class based on prior knowledge and observed features.
Support Vector Machine (SVM) is a supervised learning algorithm that finds optimal
hyper plane to separate data points of different classes with the maximum margin.
Neural Networks is a computational model inspired by the human brain, consisting of
layers of interconnected nodes (neurons) that learn complex patterns from data to
classify inputs into predefined categories.
5. Model Training: It involves the selected classification algorithm to learn the patterns in
the data. The data is divided into a training set and a validation set. The model is trained
using the training set, and its performance is evaluated on validation set.
6. Model Evaluation: It involves assessing the performance of the trained model on a test
set. This is done to ensure that the model generalizes well.
1
PREDICTION:
Prediction is a technique used to estimate unknown values or future outcomes based on
patterns found in historical data. It helps businesses and researchers make data-driven
decisions by forecasting trends, behaviours, or numerical values. For example:
 Predicting house prices based on location, size, and facilities.
 Forecasting stock market trends using historical stock data.
 Estimating customer purchases based on past shopping behaviour.
Process of Prediction:
1. Data Collection and Preparation: It gathers relevant data from various sources. It cleans
and pre-process the data to handle missing values, inconsistencies, and noise. It transforms
the data into a suitable format for analysis.
2. Feature Selection: It identifies the most important variables (features) that influence the
outcome you want to predict. This step helps simplify the model and improve its accuracy.
3. Model Selection: Chooses an appropriate prediction model based on the type of data and
the prediction task. Some common models include:
o Regression: Predicts continuous values (e.g., house prices, stock values)
o Classification: Predicts categories or classes (e.g., spam/not spam)
o Time Series Analysis: Predicts values over time (e.g, sales forecasts, weather patterns)
4. Model Training: It uses a portion of our data (training data) to teach the model the
relationships between the features and the outcomes. The model learns patterns and rules
that it can use to make predictions on new data.
5. Model Evaluation: It evaluate model's performance using separate portion of data (testing
data) that the model hasn't seen before. It evaluate that how the model's predictions matches
actual outcomes. It fine-tunes the model to improve its accuracy and generalization ability.
6. Deployment and Monitoring: Once we are satisfied with the model's performance, deploy
it to make predictions on real-world data. Continuously monitor the model's performance
and retrain it as needed to maintain accuracy over time.

Training Data:
 Used to build and train the model.
Test Data:
 Used to evaluate the model’s
performance after training.

2
Issues in Classification and Prediction:
1. Data cleaning:
 This defines the pre-processing of data to eliminate or reduce noise by using smoothing
methods and the operation of missing values. Although various classification algorithms
have some structures for managing noisy or missing information, this step can support
reducing confusion during learning.
2. Relevance analysis:
 There are various attributes in the data that can be irrelevant to the classification or
prediction task. For example, data recording the day of the week on which a dealer sold
a bike was filed is unlikely to be relevant to the success of the application. Moreover,
some other attributes can be redundant.
 Therefore, Relevance Analysis is performed on the data to delete some irrelevant or
redundant attributes from the learning procedure. In machine learning, this step is
referred to as feature selection. Hence the main objective is to improve classification
efficiency at the time of load increment without reducing the performance.
3. Data transformation:
 The data can be generalized to a higher-level approach. Here, Concept hierarchies can
be used. This is especially helpful for continuous-valued attributes. For example,
numeric values for the attribute income can be generalized to the discrete field
including low, medium, and high. Similarly, nominal-valued attributes, such as the
street, can be generalized to higher-level concepts, such as the city.
 Since generalization reduces the initial training data, the data can also be normalized.
Normalization includes scaling all values for a given attribute inside a small specified
area, including -1 to 1 or 0 to 1.

DECISION TREE INDUCTION:

3
 Decision Tree is a supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
 It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
a) Decision nodes are used to make any decision and have multiple branches.
b) Leaf nodes are output of those decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions. A decision tree simply asks a question, and based on the answer
(Yes/No), it further splits tree into sub-trees. It can contain categorical data (YES/NO) and
numeric data.
 In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm. It is called a decision tree because, similar to a tree, it starts
with the root node, which expands on further branches and constructs a tree-like structure.
Decision Tree Terminologies:
1. Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
2. Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
3. Splitting: Splitting is the process of dividing the decision node/root node into subnodes
according to the given conditions.
4. Branch/Sub Tree: A tree formed by splitting the tree.
5. Parent/Child node: The root node of tree is parent node, & other nodes are child nodes.
Advantages of the Decision Tree:
1. It is simple to understand.
2. It can be very useful for solving decision-related problems.
3. It helps to think about all the possible outcomes for a problem.
Disadvantages of the Decision Tree:
1. The decision tree contains lots of layers, which makes it complex.
2. It may have over-fitting issue, which can be resolved using Random Forest algorithm.
3. For more class labels, the computational complexity of the decision tree may increase.

4
For Example:
 Suppose there is a candidate who has a job
offer and wants to decide whether he
should accept the offer or Not. So, to solve
this problem, the decision tree starts with
the root node (Salary attribute by ASM).
 The root node splits further into the next
decision node (distance from the office)
and one leaf node based on the
corresponding labels.
 The next decision node further gets split
into one decision node (Cab facility) and
one leaf node. Finally, decision node splits
into two leaf nodes (Accepted offers and
Declined offer).

Applications of Decision Trees:


1. Loan Approval in Banking: A bank needs to decide whether to approve a loan application
based on customer profiles. Input features include income, credit score, employment status,
and loan history. The decision tree predicts loan approval or rejection, helping the bank
make quick and reliable decisions.
2. Medical Diagnosis: A healthcare provider wants to predict whether a patient has diabetes
based on clinical test results. Features like glucose levels, BMI, and blood pressure are used
to make decision tree. Tree classifies patients into diabetic or non-diabetic, assisting doctors
in diagnosis.

STEPS IN DECISION TREE INDUCTION:


1. Selecting the Best Attribute for Splitting:
 This algorithm selects the attribute that best divides data into distinct groups. Various
measures are used for attribute selection, such as:
Entropy of the entire dataset: Information Gain of each attribute:
|𝐒𝐯|
Entropy(S) = − ∑ pi log2 pi IG(S,A) = Entropy(S) − ∑ ⋅ Entropy(Sv)
|𝐒|
Where, Where,
 pi is the probability of class i. S is the dataset, A is the attribute and Sv are subsets of S
 S is the dataset. after splitting by A.
5
2. Splitting the Data: Once the best attribute is chosen, the dataset is divided based on its
values. Each subset forms a child node.
3. Recursion:
 The process is repeated for each child node until a stopping condition is met, such as:
 All instances in a node belong to the same class.
 No remaining attributes to split.
 The tree reaches a predefined depth.
4. Assigning Class Labels: Once splitting stops, each leaf node is assigned a class label based
on majority voting or averaging (for regression tasks).

EXAMPLE DATASET:
We will use the "Play Tennis" dataset where the goal is to predict whether a person will play
tennis based on the weather conditions.
 Target Variable: Play Tennis (Yes/No)
 Features: Outlook, Temperature, Humidity, Wind
ID Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rainy Mild High Weak Yes
5 Rainy Cool Normal Weak Yes
6 Rainy Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rainy Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rainy Mild High Strong No

6
Step 1: Calculate Entropy of the entire Dataset Entropy(S) = − ∑ pi log2 pi
Total Yes = 9; Total No = 5; Total Samples = 14
9 9 5 5
Entropy(S) = − ( log2 + log2 )
14 14 14 14

 Entropy(S) = −{ (0.642 × −0.64) + (0.357×−1.48) }


 Entropy(S) = − { − 0.94 }
 Entropy(S) = 0.94
Step 2: Compute Information Gain for Each Attribute
|𝐒𝐯|
IG(S,A) = Entropy(S) − ∑ ⋅ Entropy(Sv)
|𝐒|

1. Information Gain for "Outlook": We split the dataset based on the Outlook attribute:
 Sunny: 5 instances → (2 Yes, 3 No) → Entropy = 0.97
 Overcast: 4 instances → (4 Yes, 0 No) → Entropy = 0
 Rainy: 5 instances → (3 Yes, 2 No) → Entropy = 0.97
Entropy for Sunny:
2 2 3 3
Entropy(Sunny) = − ( log2 + log2 )
5 5 5 5

 Entropy(Sunny) = − { (0.4 × −1.32) + (0.6 × −0.74 ) }


 Entropy(Sunny) = − { − 0.97 }
 Entropy(Sunny) = 0.97
Entropy for Overcast: (Since all are yes, so entropy is 0)
4 4 0 0
Entropy(Overcast) = − ( log2 + log2 ) => Entropy(Overcast) = 0
4 4 4 4

Entropy for Rainy:


3 3 2 2
Entropy(Rainy) = − ( log2 + log2 )
5 5 5 5

 Entropy(Sunny) = − { (0.6 × −0.74 ) + (0.4 × −1.32) }


 Entropy(Sunny) = − { − 0.97 }
 Entropy(Sunny) = 0.97
Now, calculate Information Gain of Outlook

Information Gain(S, Outlook) = 0.94 – [ 14


5
×0.97 +
4
14
×0 +
5
14
×0.97 ]
 IG(S, Outlook) = 0.94 – 0.69
 IG(S, Outlook) = 0.25
7
2. Information Gain for "Humidity": We split the dataset based on the Humidity attribute:
 High: 7 instances → (3 No, 4 Yes) → Entropy = 0.98
 Normal: 7 instances → (2 No, 5 Yes) → Entropy = 0.86

Information Gain(S, Humidity) = 0.94 – [ 7


14
×0.98 +
7
14
×0.86 ]
 IG(S, Humidity) = 0.94 – 0.93
 IG(S, Humidity) = 0.01
3. Information Gain for "Wind": We split the dataset based on the Wind attribute:
 Weak: 8 instances → (2 No, 6 Yes) → Entropy = 0.81
 Strong: 6 instances → (3 No, 3 Yes) → Entropy = 1.00

Information Gain(S, Wind) = 0.94 – [ 8


14
×0.81 +
6
14
×1.00 ]
 IG(S, Wind) = 0.94 – 0.89
 IG(S, Wind) = 0.05
Step 3: Construct the Decision Tree
The attribute with the highest Information Gain is selected as the root. Since "Outlook" has
the highest IG (0.25), it becomes the root.
If Outlook is Overcast: If Outlook is Sunny: If Outlook is Rainy:
Play Tennis = Yes Check Humidity: Check Wind:
If High → Play Tennis = No If Strong → Play Tennis = No
If Normal → Play Tennis = Yes If Weak → Play Tennis = Yes

8
BAYESIAN CLASSIFICATION:
 Bayesian classification is a probabilistic approach to classification based on Bayes'
theorem, which calculates the probability of a class given observed data. It determines the
most likely class for a given input by updating prior beliefs with new evidence.
 Bayesian classifiers belong to generative models, which learn the joint probability
distribution P(X, Y) of the features X and labels Y.
Bayes' Theorem:
Bayes' theorem provides a way to update our beliefs about an event given some evidence. It
helps us calculate the probability of a data point belonging to a certain class, given its features.
In classification, we use Bayes' theorem to calculate P(C|X) for each possible class C. We then
assign the data point to the class with the highest posterior probability.
P(X|C) * P(C)
Bayes' Theorem: P(C|X) =
P(X)
Where:
 P(C|X): Posterior probability - The probability of the data point belonging to class C,
given the observed features X. This is what we want to calculate.
 P(X|C): Likelihood - The probability of observing features X, given that the data point
belongs to class C.
 P(C): Prior probability - The prior probability of the data point belonging to class C,
before considering the features X.
 P(X): Evidence - The probability of observing features X, regardless of the class. This
acts as a normalizing constant.
Numerical Example: Spam Email Classification
We will classify an email as Spam (S) or Not Spam (¬S) based on the presence of the word
"free" using Bayes’ Theorem:
P(X|C) * P(C)
P(C|X) =
P(X)

Step 1: Given Data, (X means word “free”)


Prior Probabilities: Likelihood (Probability of "free" given class):
P(S)=0.4 (40% of emails are spam) P(X∣S)=0.8 (80% of spam emails contain "free")
P(¬S)=0.6 (60% of emails are not spam) P(X∣¬S)=0.2 (20% of non-spam emails contain "free")

9
Step 2: Compute Evidence P(X) - The total probability of an email containing "free" is:
P(X) = [ P(X∣S) * P(S) ] + [ P(X∣¬S) * P(¬S) ]
 (0.8×0.4) + (0.2×0.6)
 0.32 + 0.12
 0.44
Step 3: Compute Posterior Probability P(S∣X)
P(X∣S) * P(S)
P(S∣X) =
P(X)
(0.8 × 0.4)

0.44
0.32
 ≈ 0.727
0.44
Step 4: Compute P(¬S∣X)
P(X∣¬S) * P(¬S)
P(¬S∣X) =
P(X)
(0.2×0.6)

0.44
0.12
 ≈ 0.273
0.44
Step 5: Classification Decision
Since, P(S∣X) = 0.727 is greater than P(¬S∣X)=0.273, the email is more likely to be Spam.
Final Answer: The email is classified as Spam.
Advantages of Bayesian Classification:
1. They are easy to implement, computationally inexpensive & suitable for large datasets.
2. They can handle large number of features suitable for text classification.
3. They works well with limited data.
Disadvantages:
1. It Assumes features are independent, which is often unrealistic.
2. Performance drops if features are dependent.
3. Poor priors can lead to biased results.
Applications:
1. Spam filtering: Classifies emails as spam or not spam.
2. Medical diagnosis: Predicts diseases based on symptoms.
3. Fraud detection: Identifies fraudulent transactions based on past data.
10
Naive Bayes Classifier:
 The Naïve Bayes classifier makes a naïve assumption that all features are conditionally
independent given the class label. This means that the presence of one feature does not
affect the probability of another feature occurring, given the class.
 Naïve Bayes is widely used in text classification because it is fast, scalable, and effective
even with simple assumptions.
 This assumption is "naïve" because, in real-world data, features are often correlated. For
example, in email classification, words like "free" and "offer" frequently appear together
in spam emails.
Benefits of the Naïve Assumption:
1. Fast and Efficient
2. Works Well with Small Data
3. Handles High-Dimensional Data Well means Performs well in text classification.
4. Works well with categorical & binary data :- suitable for problems like medical diagnoses.

Types of Naïve Bayes Classifiers:


1. Gaussian Naïve Bayes (GNB):
 It is used for continuous numerical features. It assumes data follows a normal (Gaussian)
distribution. Example: Classifying iris flowers based on petal width and length.
2. Multinomial Naïve Bayes (MNB):
 It is used for discrete features (word counts, frequencies). It is common in text
classification (eg, spam detection). Example: Email classification, where words are
treated as features with probabilities based on their occurrence in spam vs. non-spam
emails.
3. Bernoulli Naïve Bayes (BNB):
 It is used for binary features (presence or absence of a feature). It is suitable for text
classification where words are represented as binary indicators (word appears = 1,
word doesn’t appear = 0). Example: Document classification, where each word’s
presence determines class probability.
Example: Multinomial Naïve Bayes for Text Classification
In spam filtering, words in an email are treated as features. The probability of an email being
spam is calculated based on word frequencies in spam vs. non-spam emails. If words like
"free," "win," "prize" appear frequently in spam emails, an email containing them is more
likely classified as spam.
11
CLASSIFICATION BASED ON CONCEPTS FROM ASSOCIATION RULE MINING:
Introduction to Association Rule Mining:
 Association Rule Mining is a technique used to identify relationships, patterns, and
correlations between items in large datasets. It is commonly applied in market basket
analysis, where business analyse purchasing patterns to recommend related products.
 The goal of association rule mining is to discover interesting and useful rules from
transaction data, helping businesses make data-driven decisions, such as cross-selling,
product placement, and recommendation systems.
Key Concepts:
1. Itemsets: A collection of items bought together in a transaction. Example: {bread,
milk, eggs} is an itemset in a shopping basket.
2. Support: Measures how frequently an itemset appears in the dataset.

Number of transactions containing X


Support(X) =
Total number of transactions
Example: If 3 out of 10 transactions include {bread, milk}, the support is 0.3 (30%).
3. Confidence: Measures the likelihood that if an item X is purchased, item Y is also
purchased.
Support(X∪Y)
Confidence(X⇒Y) =
Support(X)
Example: If 80% of people who buy bread also buy milk, the confidence of the rule
{bread} → {milk} is 0.8 (80%).
4. Lift: Measures how much more likely Y is bought when X is bought, compared to
random chance.
Confidence(X⇒Y)
Lift(X⇒Y) =
Support(Y)
 Lift > 1: X and Y are positively correlated (Buying X increases chance of buying Y).
 Lift = 1: No correlation.
 Lift < 1: Negative correlation (Buying X reduces the chance of buying Y).

Example of an Association Rule: "Customers who buy bread also tend to buy milk"
 Support: 30% (30% of transactions include both bread and milk).
 Confidence: 80% (80% of people who buy bread also buy milk).
 Lift: 1.5 (Buying bread increases the likelihood of buying milk by 1.5 times).
12
ASSOCIATIVE CLASSIFICATION:
 Associative Classification (AC) is a classification technique that integrates association
rule mining with classification. Instead of using traditional classifiers like decision trees or
Naïve Bayes, it derives classification rules from frequent patterns in the dataset.
 The goal is to build accurate and interpretable classifiers by discovering patterns in the data.
It is particularly useful when traditional classifiers struggle with complex relationships
between attributes.

Algorithms for Associative Classification: (CBA and CMAR)


1. CBA (Classification Based on Associations):
 CBA is one of the earliest associative classification algorithms. It integrates
association rule mining with classification by generating rules from frequent itemsets.
 Its Limitations are: It is computationally expensive due to Apriori is needed for
multiple database scans. It may generate too many rules, requiring extra pruning to
keep only useful ones.
How CBA Works?
1. Frequent Itemset Mining:
 The dataset is analyzed to find frequent itemsets that appear together. It uses the
Apriori algorithm to find frequent itemsets.
 Example-1: {Outlook=Sunny, Temperature=Hot} → {Play=No}.
 Example-2: {Age=Young, Income=Low} → {Buy=No}.
2. Rule Generation:
 From frequent itemsets, association rules are generated in form of IF-THEN rules.
 Example-1: IF (Outlook=Sunny AND Temperature=Hot), THEN Play=No.
 Example-2: IF (Age=Young AND Income=Low), THEN Buy=No.
3. Rule Pruning and Selection:
 Since many rules can be generated, Only the best rules are taken for classification
based on measures like support, confidence, and lift. The rules are sorted based on
confidence, support, and rule length.
4. Classification:
 When a new instance needs to be classified, the classifier finds the most relevant rule
and assigns the corresponding class label. If there are multiple rules applied, a
ranking mechanism (like confidence-based selection) is used.
13
2. CMAR (Classification based on Multiple Association Rules)
 CMAR improves limitations of CBA by making rule mining more efficient and using
multiple rules to improve classification accuracy.
 It is more efficient than CBA due to FP-tree’s ability to compress data. It handles rule
conflicts better by considering multiple rules instead of just one.
How CMAR Works?
1. Frequent Pattern Mining with FP-tree:
 Instead of Apriori, CMAR uses an FP-tree (Frequent Pattern Tree) to mine
frequent patterns more efficiently. The FP-tree reduces the number of database
scans, improving speed and memory usage.
2. Rule Selection and Classification:
 CMAR stores classification association rules (CARs) in a classification rule tree
(CR-tree), a tree-structured rule database. Instead of using just one rule like CBA,
CMAR applies multiple strong rules for classification.
3. Rule Weighting and Conflict Resolution:
 CMAR assigns different weights to rules based on support and confidence. If
multiple rules predict different class labels, it chooses one with highest weight.

Advantages of Associative Classification:


1. Interpretability and Transparency: The generated rules are easy to understand.
2. High Accuracy: It achieves better classification accuracy compared to traditional
classifiers (e.g., decision trees or Naïve Bayes) because it considers multiple rules.
3. Handles Numeric and Categorical Data: It can work with both categorical and numeric
attributes, making it flexible across different datasets.
4. Effective Feature Selection: Since it selects strong rules based on support and confidence,
it acts as a feature selection method, reducing noise.
Disadvantages of Associative Classification:
1. High Computational Cost: Generating association rules is expensive, especially with
large datasets, as it involves mining frequent itemsets and generating multiple rules.
2. Rule Pruning Complexity: Effective rule pruning is required to remove redundant or weak
rules, which adds complexity to the model-building process.
3. Large Number of Rules: It can generate large number of rules, making rule selection and
management complex.

14
Example Associative Classification:
 Classifying Customers Based on Purchase History.
 A retail store wants to classify customers as "High Spender" or "Low Spender" based
on their purchasing behavior.
Step 1: Transaction Data.
The store collects data on customer purchases:
Customer Items Purchased Spender Class
C1 Laptop, Mouse, Headphones High Spender
C2 Laptop, Mouse High Spender
C3 Notebook, Pen Low Spender
C4 Laptop, Headphones High Spender
C5 Notebook, Pen, Mouse Low Spender

Step 2: Finding Frequent Itemsets.


Using association rule mining, we find frequently occurring patterns:
 {Laptop, Mouse} → High Spender (Support = 40%, Confidence = 100%)
 {Notebook, Pen} → Low Spender (Support = 40%, Confidence = 100%)
 {Laptop, Headphones} → High Spender (Support = 40%, Confidence = 100%)

Step 3: Generating Classification Rules.


From frequent itemsets, we derive classification rules:
 Rule 1: IF (Laptop AND Mouse) → High Spender
 Rule 2: IF (Notebook AND Pen) → Low Spender
 Rule 3: IF (Laptop AND Headphones) → High Spender
Step 4: Classifying a New Customer.
A new customer C6 purchases: Laptop, Mouse, Notebook.
 Matching Rules:
o {Laptop, Mouse} → High Spender
o {Notebook} → No rule found
 Final Classification: Since {Laptop, Mouse} strongly indicates High Spender,
customer C6 is classified as a High Spender.

15
OTHER METHODS OF CLASSIFICATION:

k-NEAREST NEIGHBOURS (k-NN) ALGORITHM FOR CLASSIFICATION:


 The k-Nearest Neighbours algorithm is a simple, non-parametric, and instance-based
classification technique. It classifies a new data point based on the majority class of its k
nearest neighbours in the feature space.
 This is an effective, easy-to-understand classification method used in pattern recognition,
recommendation systems, and medical diagnosis. However, it requires careful selection of
k and proper feature scaling for better results.

Working of k-Nearest Neighbour Algorithm:


1. Choose the number of neighbours (k).
2. Compute Euclidean distance between the new data point and all training samples.
3. Select the k nearest neighbours (smallest distance values).
4. Assign the majority class label among the k neighbours to the new data point.

Example of k-Nearest Neighbour Classification:


 Suppose we have a dataset where we classify whether a fruit is an apple or an orange
based on weight and texture.
Weight (grams) Texture (1 = Smooth, 0 = Rough) Fruit Label
150 1 Apple
180 0 Orange
170 0 Orange
140 1 Apple

1. The New Data Point: Suppose we have a new fruit with:


 Weight: 160 grams
 Texture: 1 (Smooth)
We want to classify this fruit as either apple or orange using k-NN.
2. Calculating Distances (k=3):
We'll use Euclidean distance to calculate the distance between the new fruit and each fruit
in the training data.

16
The formula for Euclidean distance between two points (x1, y1) and (x2, y2) is:

√((x2 − x1)² + (y2 − y1)²)

Here,
x1: weights of given data
x2: weight of new data point
y1: texture of given data
y2: texture of new data point
 Distance to Apple 1 (150, 1):

 √((160 − 150)² + (1 − 1)²) = √(100 + 0) = 10


 Distance to Orange 1 (180, 0):

 √((160 − 180)² + (1 − 0)²) = √(400 + 1) = √401 ≈ 20.02


 Distance to Orange 2 (170, 0):

 √((160 − 170)² + (1 − 0)²) = √(100 + 1) = √101 ≈ 10.05


 Distance to Apple 2 (140, 1)

 √((160 − 140)² + (1 − 1)²) = √(400 + 0) = 20


3. Selecting the k-Nearest Neighbors (k=3):
We're using k=3, so we select the three closest neighbors:
1. Apple 1 (distance = 10)
2. Orange 2 (distance ≈ 10.05)
3. Apple 2 (distance = 20)
4. Classification by Majority Vote:
Out of the 3 nearest neighbors, 2 are apples and 1 is an orange. Therefore, by majority vote,
the new fruit is classified as an apple.

Advantages and Disadvantages of k-Nearest Neighbour Algorithm:


Advantages: Disadvantages:
1. Simple to implement 1. Computationally expensive for large datasets
2. Works well for small datasets 2. Sensitive to irrelevant features and noise
3. No training phase (lazy learner) 3. Choosing the optimal k value is challenging

17
PREDICTION AND CLASSIFIER ACCURACY:
Prediction and classifier accuracy are fundamental concepts in evaluating the performance of
machine learning models, particularly in classification.
Prediction:
 Prediction refers to the process of using a trained model to estimate the output for new,
unseen data. For classification, this involves assigning a data point to one of the
predefined categories.
 The goal is to create a model that generalizes well and can accurately predict class labels
of data. The quality of predictions is assessed using various metrics and simple accuracy.
Classifier Accuracy: It is a single metric that quantifies overall correctness of a classification
model's predictions. Accuracy is typically expressed as a percentage. For example, if a model
correctly classifies 80 out of 100 instances, its accuracy is 80%. It's calculated as:
(Number of correctly classified instances)
Accuracy =
(Total number of instances)
Limitations of Accuracy and the Need for Other Metrics: Imagine a dataset with 95
instances of class A and 5 instances of class B. A model that always predicts class A would
have an accuracy of 95%, which seems excellent. However, it completely fails to recognize
class B. This highlights the limitation of accuracy in imbalanced scenarios. Therefore, we need
other metrics to provide a more evaluation.
Other Important Metrics:
 Precision: Measures how many of the positive predictions were actually correct. It's
calculated as: Precision = (True Positives) / (True Positives + False Positives)
 Recall: Measures how many of the actual positive instances were correctly predicted. It's
calculated as: Recall = (True Positives) / (True Positives + False Negatives)
 F1-score: The harmonic mean of precision and recall, providing a balanced measure. It's
calculated as: F1-score = 2 * (Precision * Recall) / (Precision + Recall)
Example: Let's say we're building a spam email classifier. We have 100 emails, and the model
predicts 60 as spam. Out of those 60, 50 were actually spam (TP), and 10 were not (FP). There
were 40 emails that were not predicted as spam. Out of those, 30 were correctly identified as
not spam (TN), and 10 were actually spam but were missed by the model (FN).
Confusion Matrix:
Predicted Spam Predicted Not Spam
Actual Spam 50 10
Actual Not Spam 10 30

18
Given Information:
Total Emails: 100
Predicted True Positives (TP): 50 (Correctly predicted spam)
Spam: 60 False Positives (FP): 10 (Incorrectly predicted spam - actually not spam)
Predicted Not True Negatives (TN): 30 (Correctly predicted not spam)
Spam: 40 False Negatives (FN): 10 (Incorrectly predicted not spam - actually spam)

Calculations:
Accuracy  (TP + TN) / Total
 (50 + 30) / 100 = 80/100 = 0.8 (approx. 80%)
Precision  TP / (TP + FP)
(for Spam)  50 / (50 + 10) = 50/60 = 0.83 (approx. 83% )
 Out of all emails predicted as spam, 83% were actually spam.
Recall  TP / (TP + FN)
(for Spam)  50 / (50 + 10) = 50/60 = 0.83 (approx. 83%)
 Out of all actual spam emails, 83% were correctly identified.
Precision  TN / (TN + FN)
(for Not Spam)  30 / (30 + 10) = 30/40 = 0.75 (approx. 75%)
 Out of all emails predicted as not spam, 75% were actually not spam.
Recall  TN / (TN + FP)
(for Not Spam)  30 / (30 + 10) = 30/40 = 0.75 (approx. 75%)
 Out of all actual not spam emails, 75% were correctly identified.

19

You might also like