Unit-4 Data Mining
Unit-4 Data Mining
CLASSIFICATION:
Classification is a supervised learning technique in data mining used to categorize data into
predefined groups or classes. The goal is to build a model that can accurately classify new,
unseen data based on past observations. Example: email spam detection.
Process of Classification:
1. Data Collection: In this step, relevant data is collected. The data should contain all the
necessary attributes and labels needed for classification. The data can be collected from
various sources, such as surveys, questionnaires, websites, and databases.
2. Data Pre-processing: The collected data needs to be pre-processed to ensure its quality.
This involves handling missing values, dealing with outliers, and transforming the data into
suitable form for analysis.
3. Feature Selection: It involves identifying the most relevant attributes in the dataset for
classification. This can be done using various techniques, such as:
Correlation Analysis measures the statistical relationship between two or more
variables to identify patterns, dependencies, and associations in the data.
Information Gain is a measure of the amount of information that a feature provides for
classification. Features with high information gain are selected for classification.
Principal Component Analysis is a technique used to reduce the dimensionality of the
dataset. It identifies most important features in dataset and removes redundant ones.
4. Model Selection: It involves selecting the appropriate classification algorithm. These are:
Decision Tree is hierarchical model that splits data into subsets based on feature values,
creating a tree-like structure to predict the target class by following decision rules.
Bayesian classification is a probabilistic classification technique that applies Bayes'
Theorem to predict probability of class based on prior knowledge and observed features.
Support Vector Machine (SVM) is a supervised learning algorithm that finds optimal
hyper plane to separate data points of different classes with the maximum margin.
Neural Networks is a computational model inspired by the human brain, consisting of
layers of interconnected nodes (neurons) that learn complex patterns from data to
classify inputs into predefined categories.
5. Model Training: It involves the selected classification algorithm to learn the patterns in
the data. The data is divided into a training set and a validation set. The model is trained
using the training set, and its performance is evaluated on validation set.
6. Model Evaluation: It involves assessing the performance of the trained model on a test
set. This is done to ensure that the model generalizes well.
1
PREDICTION:
Prediction is a technique used to estimate unknown values or future outcomes based on
patterns found in historical data. It helps businesses and researchers make data-driven
decisions by forecasting trends, behaviours, or numerical values. For example:
Predicting house prices based on location, size, and facilities.
Forecasting stock market trends using historical stock data.
Estimating customer purchases based on past shopping behaviour.
Process of Prediction:
1. Data Collection and Preparation: It gathers relevant data from various sources. It cleans
and pre-process the data to handle missing values, inconsistencies, and noise. It transforms
the data into a suitable format for analysis.
2. Feature Selection: It identifies the most important variables (features) that influence the
outcome you want to predict. This step helps simplify the model and improve its accuracy.
3. Model Selection: Chooses an appropriate prediction model based on the type of data and
the prediction task. Some common models include:
o Regression: Predicts continuous values (e.g., house prices, stock values)
o Classification: Predicts categories or classes (e.g., spam/not spam)
o Time Series Analysis: Predicts values over time (e.g, sales forecasts, weather patterns)
4. Model Training: It uses a portion of our data (training data) to teach the model the
relationships between the features and the outcomes. The model learns patterns and rules
that it can use to make predictions on new data.
5. Model Evaluation: It evaluate model's performance using separate portion of data (testing
data) that the model hasn't seen before. It evaluate that how the model's predictions matches
actual outcomes. It fine-tunes the model to improve its accuracy and generalization ability.
6. Deployment and Monitoring: Once we are satisfied with the model's performance, deploy
it to make predictions on real-world data. Continuously monitor the model's performance
and retrain it as needed to maintain accuracy over time.
Training Data:
Used to build and train the model.
Test Data:
Used to evaluate the model’s
performance after training.
2
Issues in Classification and Prediction:
1. Data cleaning:
This defines the pre-processing of data to eliminate or reduce noise by using smoothing
methods and the operation of missing values. Although various classification algorithms
have some structures for managing noisy or missing information, this step can support
reducing confusion during learning.
2. Relevance analysis:
There are various attributes in the data that can be irrelevant to the classification or
prediction task. For example, data recording the day of the week on which a dealer sold
a bike was filed is unlikely to be relevant to the success of the application. Moreover,
some other attributes can be redundant.
Therefore, Relevance Analysis is performed on the data to delete some irrelevant or
redundant attributes from the learning procedure. In machine learning, this step is
referred to as feature selection. Hence the main objective is to improve classification
efficiency at the time of load increment without reducing the performance.
3. Data transformation:
The data can be generalized to a higher-level approach. Here, Concept hierarchies can
be used. This is especially helpful for continuous-valued attributes. For example,
numeric values for the attribute income can be generalized to the discrete field
including low, medium, and high. Similarly, nominal-valued attributes, such as the
street, can be generalized to higher-level concepts, such as the city.
Since generalization reduces the initial training data, the data can also be normalized.
Normalization includes scaling all values for a given attribute inside a small specified
area, including -1 to 1 or 0 to 1.
3
Decision Tree is a supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
a) Decision nodes are used to make any decision and have multiple branches.
b) Leaf nodes are output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions. A decision tree simply asks a question, and based on the answer
(Yes/No), it further splits tree into sub-trees. It can contain categorical data (YES/NO) and
numeric data.
In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm. It is called a decision tree because, similar to a tree, it starts
with the root node, which expands on further branches and constructs a tree-like structure.
Decision Tree Terminologies:
1. Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
2. Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
3. Splitting: Splitting is the process of dividing the decision node/root node into subnodes
according to the given conditions.
4. Branch/Sub Tree: A tree formed by splitting the tree.
5. Parent/Child node: The root node of tree is parent node, & other nodes are child nodes.
Advantages of the Decision Tree:
1. It is simple to understand.
2. It can be very useful for solving decision-related problems.
3. It helps to think about all the possible outcomes for a problem.
Disadvantages of the Decision Tree:
1. The decision tree contains lots of layers, which makes it complex.
2. It may have over-fitting issue, which can be resolved using Random Forest algorithm.
3. For more class labels, the computational complexity of the decision tree may increase.
4
For Example:
Suppose there is a candidate who has a job
offer and wants to decide whether he
should accept the offer or Not. So, to solve
this problem, the decision tree starts with
the root node (Salary attribute by ASM).
The root node splits further into the next
decision node (distance from the office)
and one leaf node based on the
corresponding labels.
The next decision node further gets split
into one decision node (Cab facility) and
one leaf node. Finally, decision node splits
into two leaf nodes (Accepted offers and
Declined offer).
EXAMPLE DATASET:
We will use the "Play Tennis" dataset where the goal is to predict whether a person will play
tennis based on the weather conditions.
Target Variable: Play Tennis (Yes/No)
Features: Outlook, Temperature, Humidity, Wind
ID Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rainy Mild High Weak Yes
5 Rainy Cool Normal Weak Yes
6 Rainy Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rainy Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rainy Mild High Strong No
6
Step 1: Calculate Entropy of the entire Dataset Entropy(S) = − ∑ pi log2 pi
Total Yes = 9; Total No = 5; Total Samples = 14
9 9 5 5
Entropy(S) = − ( log2 + log2 )
14 14 14 14
1. Information Gain for "Outlook": We split the dataset based on the Outlook attribute:
Sunny: 5 instances → (2 Yes, 3 No) → Entropy = 0.97
Overcast: 4 instances → (4 Yes, 0 No) → Entropy = 0
Rainy: 5 instances → (3 Yes, 2 No) → Entropy = 0.97
Entropy for Sunny:
2 2 3 3
Entropy(Sunny) = − ( log2 + log2 )
5 5 5 5
8
BAYESIAN CLASSIFICATION:
Bayesian classification is a probabilistic approach to classification based on Bayes'
theorem, which calculates the probability of a class given observed data. It determines the
most likely class for a given input by updating prior beliefs with new evidence.
Bayesian classifiers belong to generative models, which learn the joint probability
distribution P(X, Y) of the features X and labels Y.
Bayes' Theorem:
Bayes' theorem provides a way to update our beliefs about an event given some evidence. It
helps us calculate the probability of a data point belonging to a certain class, given its features.
In classification, we use Bayes' theorem to calculate P(C|X) for each possible class C. We then
assign the data point to the class with the highest posterior probability.
P(X|C) * P(C)
Bayes' Theorem: P(C|X) =
P(X)
Where:
P(C|X): Posterior probability - The probability of the data point belonging to class C,
given the observed features X. This is what we want to calculate.
P(X|C): Likelihood - The probability of observing features X, given that the data point
belongs to class C.
P(C): Prior probability - The prior probability of the data point belonging to class C,
before considering the features X.
P(X): Evidence - The probability of observing features X, regardless of the class. This
acts as a normalizing constant.
Numerical Example: Spam Email Classification
We will classify an email as Spam (S) or Not Spam (¬S) based on the presence of the word
"free" using Bayes’ Theorem:
P(X|C) * P(C)
P(C|X) =
P(X)
9
Step 2: Compute Evidence P(X) - The total probability of an email containing "free" is:
P(X) = [ P(X∣S) * P(S) ] + [ P(X∣¬S) * P(¬S) ]
(0.8×0.4) + (0.2×0.6)
0.32 + 0.12
0.44
Step 3: Compute Posterior Probability P(S∣X)
P(X∣S) * P(S)
P(S∣X) =
P(X)
(0.8 × 0.4)
0.44
0.32
≈ 0.727
0.44
Step 4: Compute P(¬S∣X)
P(X∣¬S) * P(¬S)
P(¬S∣X) =
P(X)
(0.2×0.6)
0.44
0.12
≈ 0.273
0.44
Step 5: Classification Decision
Since, P(S∣X) = 0.727 is greater than P(¬S∣X)=0.273, the email is more likely to be Spam.
Final Answer: The email is classified as Spam.
Advantages of Bayesian Classification:
1. They are easy to implement, computationally inexpensive & suitable for large datasets.
2. They can handle large number of features suitable for text classification.
3. They works well with limited data.
Disadvantages:
1. It Assumes features are independent, which is often unrealistic.
2. Performance drops if features are dependent.
3. Poor priors can lead to biased results.
Applications:
1. Spam filtering: Classifies emails as spam or not spam.
2. Medical diagnosis: Predicts diseases based on symptoms.
3. Fraud detection: Identifies fraudulent transactions based on past data.
10
Naive Bayes Classifier:
The Naïve Bayes classifier makes a naïve assumption that all features are conditionally
independent given the class label. This means that the presence of one feature does not
affect the probability of another feature occurring, given the class.
Naïve Bayes is widely used in text classification because it is fast, scalable, and effective
even with simple assumptions.
This assumption is "naïve" because, in real-world data, features are often correlated. For
example, in email classification, words like "free" and "offer" frequently appear together
in spam emails.
Benefits of the Naïve Assumption:
1. Fast and Efficient
2. Works Well with Small Data
3. Handles High-Dimensional Data Well means Performs well in text classification.
4. Works well with categorical & binary data :- suitable for problems like medical diagnoses.
Example of an Association Rule: "Customers who buy bread also tend to buy milk"
Support: 30% (30% of transactions include both bread and milk).
Confidence: 80% (80% of people who buy bread also buy milk).
Lift: 1.5 (Buying bread increases the likelihood of buying milk by 1.5 times).
12
ASSOCIATIVE CLASSIFICATION:
Associative Classification (AC) is a classification technique that integrates association
rule mining with classification. Instead of using traditional classifiers like decision trees or
Naïve Bayes, it derives classification rules from frequent patterns in the dataset.
The goal is to build accurate and interpretable classifiers by discovering patterns in the data.
It is particularly useful when traditional classifiers struggle with complex relationships
between attributes.
14
Example Associative Classification:
Classifying Customers Based on Purchase History.
A retail store wants to classify customers as "High Spender" or "Low Spender" based
on their purchasing behavior.
Step 1: Transaction Data.
The store collects data on customer purchases:
Customer Items Purchased Spender Class
C1 Laptop, Mouse, Headphones High Spender
C2 Laptop, Mouse High Spender
C3 Notebook, Pen Low Spender
C4 Laptop, Headphones High Spender
C5 Notebook, Pen, Mouse Low Spender
15
OTHER METHODS OF CLASSIFICATION:
16
The formula for Euclidean distance between two points (x1, y1) and (x2, y2) is:
Here,
x1: weights of given data
x2: weight of new data point
y1: texture of given data
y2: texture of new data point
Distance to Apple 1 (150, 1):
17
PREDICTION AND CLASSIFIER ACCURACY:
Prediction and classifier accuracy are fundamental concepts in evaluating the performance of
machine learning models, particularly in classification.
Prediction:
Prediction refers to the process of using a trained model to estimate the output for new,
unseen data. For classification, this involves assigning a data point to one of the
predefined categories.
The goal is to create a model that generalizes well and can accurately predict class labels
of data. The quality of predictions is assessed using various metrics and simple accuracy.
Classifier Accuracy: It is a single metric that quantifies overall correctness of a classification
model's predictions. Accuracy is typically expressed as a percentage. For example, if a model
correctly classifies 80 out of 100 instances, its accuracy is 80%. It's calculated as:
(Number of correctly classified instances)
Accuracy =
(Total number of instances)
Limitations of Accuracy and the Need for Other Metrics: Imagine a dataset with 95
instances of class A and 5 instances of class B. A model that always predicts class A would
have an accuracy of 95%, which seems excellent. However, it completely fails to recognize
class B. This highlights the limitation of accuracy in imbalanced scenarios. Therefore, we need
other metrics to provide a more evaluation.
Other Important Metrics:
Precision: Measures how many of the positive predictions were actually correct. It's
calculated as: Precision = (True Positives) / (True Positives + False Positives)
Recall: Measures how many of the actual positive instances were correctly predicted. It's
calculated as: Recall = (True Positives) / (True Positives + False Negatives)
F1-score: The harmonic mean of precision and recall, providing a balanced measure. It's
calculated as: F1-score = 2 * (Precision * Recall) / (Precision + Recall)
Example: Let's say we're building a spam email classifier. We have 100 emails, and the model
predicts 60 as spam. Out of those 60, 50 were actually spam (TP), and 10 were not (FP). There
were 40 emails that were not predicted as spam. Out of those, 30 were correctly identified as
not spam (TN), and 10 were actually spam but were missed by the model (FN).
Confusion Matrix:
Predicted Spam Predicted Not Spam
Actual Spam 50 10
Actual Not Spam 10 30
18
Given Information:
Total Emails: 100
Predicted True Positives (TP): 50 (Correctly predicted spam)
Spam: 60 False Positives (FP): 10 (Incorrectly predicted spam - actually not spam)
Predicted Not True Negatives (TN): 30 (Correctly predicted not spam)
Spam: 40 False Negatives (FN): 10 (Incorrectly predicted not spam - actually spam)
Calculations:
Accuracy (TP + TN) / Total
(50 + 30) / 100 = 80/100 = 0.8 (approx. 80%)
Precision TP / (TP + FP)
(for Spam) 50 / (50 + 10) = 50/60 = 0.83 (approx. 83% )
Out of all emails predicted as spam, 83% were actually spam.
Recall TP / (TP + FN)
(for Spam) 50 / (50 + 10) = 50/60 = 0.83 (approx. 83%)
Out of all actual spam emails, 83% were correctly identified.
Precision TN / (TN + FN)
(for Not Spam) 30 / (30 + 10) = 30/40 = 0.75 (approx. 75%)
Out of all emails predicted as not spam, 75% were actually not spam.
Recall TN / (TN + FP)
(for Not Spam) 30 / (30 + 10) = 30/40 = 0.75 (approx. 75%)
Out of all actual not spam emails, 75% were correctly identified.
19