0% found this document useful (0 votes)

7 views14 pages

Classification Notes

Classification is a data mining technique used to categorize data objects into predefined classes based on their attributes, essential for supervised machine learning. It involves a two-step process of model learning and classification, with applications in areas like spam detection, medical diagnosis, and image recognition. Key algorithms include Decision Trees, Bayesian classifiers, and K-nearest neighbors, while challenges include data cleaning and relevance.

Uploaded by

uthsahak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views14 pages

Classification Notes

Uploaded by

uthsahak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Unit-4

Classification
Classification is a technique in data mining that involves categorizing or classifying data objects into
predefined classes, categories, or groups based on their features or attributes. Or

Data classification is a process in which data is organized or labeled into predefined categories or classes
based on its characteristics. The goal of classification is to assign new, unseen data instances to the
correct predefined categories. This process is a fundamental task in supervised machine learning and data
mining.

General Approach to Classification:

How does classification work?

Data classification is a two-step process, consisting of a learning step (where a classification model is
constructed) and a classification step (where the model is used to predict class labels for given data).

 In the first step (Learning), we build a classification model based on previous data (training or
sample data).
 In the second step (classification), we determine if the model accuracy is acceptable and if so we
use the model to classify the new data.

Applications:

Data classification has numerous applications across various domains due to its ability to automatically
categorize data into predefined classes. Here are some key applications:

1. Spam Detection: Identifying and filtering out spam emails from legitimate emails. Example: Email
services like Gmail and Outlook use classification algorithms to move spam emails to a spam folder.

2. Medical Diagnosis: Predicting diseases or health conditions based on patient data. Example:
Classifying patients as diabetic or non-diabetic based on their medical history and test results.

3. Credit Scoring: Assessing the creditworthiness of individuals or businesses. Example: Banks and
financial institutions classify loan applicants into risk categories (e.g., low risk, high risk) based on their
financial history and credit score.

4. Image Recognition: Identifying objects, people, or scenes in images. Example: Facial recognition
systems classify images of faces to identify individuals for security purposes.

5. Sentiment Analysis: Determining the sentiment expressed in text data (e.g., positive, negative,
neutral). Example: Companies analyze customer reviews to classify them as positive or negative
feedback.

6. Document Classification: Categorizing documents into predefined categories. Example: News

articles can be classified into topics such as sports, politics, or entertainment.
7. Customer Segmentation: Grouping customers based on their behavior or characteristics for targeted
marketing. Example: Retailers classify customers into segments (e.g., frequent buyers, occasional
shoppers) to personalize marketing campaigns.

8. Fraud Detection: Identifying fraudulent activities in transactions. Example: Credit card companies
classify transactions as fraudulent or legitimate based on transaction patterns.

9. Voice Recognition: Identifying spoken words or phrases. Example: Virtual assistants like Siri and
Alexa classify audio input to recognize commands and provide appropriate responses.

10. Behavioral Targeting: Delivering personalized advertisements based on user behavior. Example:
Online advertising platforms classify users based on their browsing history to show relevant ads.

Prediction
Prediction, in the context of data science and machine learning, refers to the process of using a trained
model to make forecasts or estimations about future or unseen data. The goal of prediction is to generate
accurate and meaningful insights by using patterns learned from historical or labeled data.

Classification and Regression are the two major types of prediction problems, where classification is used
to predict discrete or nominal values, while regression is used to predict continuous or ordered values.

Issues Regarding Classification

Classification tasks can face several issues related to data preparation, including data cleaning, relevance,
and data transformation.

1. Data Cleaning: Data cleaning involves identifying and correcting (or removing) inaccuracies and
inconsistencies in the data to improve its quality.
 Missing Values: Missing data points can lead to biased or incorrect models.
 Noisy Data: Data with errors, outliers, or irrelevant information can distort model predictions.
 Duplicate Records: Multiple entries of the same data can skew results.

2. Relevance: Relevance involves ensuring that the features (variables) used in the classification task are
important and useful for making accurate predictions.
 Irrelevant Features: Features that do not contribute to the prediction can add noise and reduce model
performance.
 Redundant Features: Highly correlated features can provide duplicate information, making the model
more complex than necessary.

3. Data Transformation: Data transformation involves converting data into a suitable format or structure
for analysis, which can include scaling, encoding, or creating new features.
 Scaling Issues: Features with different scales can disproportionately affect the model.
 Categorical Data: Many algorithms require numerical input, so categorical data must be transformed
appropriately.
 Non-linear Relationships: Some relationships between features and the target variable may be non-linear
and need transformation.
Algorithms
1. Decision Tree Induction
 ID3 (Iterative Dichotomiser 3)
 C4.5
 CART (Classification and Regression Trees)
 Random forest(an ensemble of Decision tree)
2. Bayes classification method
 Bayes Theorem
 Naive Bayesian Classification
3. Rule Based Classification
4. Lazy Learning (Learn from your neighbors)
 K- nearest neighbors
Decision Tree Induction
Decision tree induction is a popular and powerful method for classification and regression tasks in
machine learning. It involves creating a model that predicts the value of a target variable by learning
simple decision rules inferred from the data features.

Steps in Decision Tree Induction:

1. Select the Best Attribute: Choose the feature that best splits the data according to a specific
criterion (such as information gain for ID3 or Gini impurity for CART).
2. Create a Decision Node: Create a node in the tree that represents the selected attribute.
3. Split the Data: Divide the dataset into subsets based on the selected attribute's values.
4. Repeat: Recursively apply the above steps to each subset.
5. Stopping Criteria: The recursion stops when one of the following conditions is met:
o All instances in a subset belong to the same class.
o No further attributes are left to split the data.
o The tree reaches a predefined maximum depth or a minimum number of instances per
node.
Advantages:

 Interpretability: Decision trees are easy to understand and interpret.

 Non-Parametric: They make no assumptions about the distribution of the data.
 Feature Importance: They provide insights into feature importance based on the splits.

Disadvantages:

 Overfitting: Decision trees can easily overfit the training data, especially if they are deep.
 Instability: Small changes in the data can lead to significant changes in the tree structure.
 Bias towards Features with More Levels: Features with more levels can dominate splits,
leading to biased models.

ID3 (Iterative Dichotomiser 3)

 Uses information gain to select the best attribute for splitting.
 Constructs a tree by recursively partitioning the data.
 Stops when all instances in a node belong to the same class or when no more attributes are
available.

C4.5

 Uses gain ratio (an extension of information gain) to handle attributes with many values.
 Handles both categorical and continuous attributes.
 Prunes the tree after creation to remove branches that do not provide additional power in
classification. Can handle missing values in the data.
CART (Classification and Regression Trees)

 Uses Gini impurity or entropy to split the data for classification tasks.
 Uses variance reduction to split the data for regression tasks.
 Constructs binary trees, where each node has exactly two children.
 Prunes the tree using cost-complexity pruning to avoid overfitting.

Random Forest (An Ensemble of Decision Trees)

 Uses bootstrap aggregating (bagging) to create multiple subsets of the training data.
 Each tree is trained on a different subset, and a random subset of features is used to split each
node.
 Reduces overfitting by averaging the results of many trees.
 Provides a measure of feature importance based on how much each feature improves the split
criterion.

Bayes Classification Methods

“What are Bayesian classifiers?”

Bayesian classifiers are statistical classifiers. They can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Bayesian classification is based on Bayes’ theorem, described next. Studies comparing classification
algorithms have found a simple Bayesian classifier known as the naive Bayesian classifier to be
comparable in performance with decision tree and selected neural network classifiers. Bayesian classifiers
have also exhibited high accuracy and speed when applied to large databases. Naive Bayesian classifiers
assume that the effect of an attribute value on a given class is independent of the values of the other
attributes. This assumption is called class conditional independence. It is made to simplify the
computations involved and, in this sense, is considered “naive”.

 Bayesian classifiers are statistical classifiers.

 They can predict class membership probabilities, such as the probability that a given tuple
belongs to a particular class.
 Bayesian classification is based on Bayes’ theorem.

Bayes Theorem
Bayes theorem is one of the most popular machine learning concepts that helps to calculate the
probability of occurring one event with uncertain knowledge while other one has already occurred.
Bayes' theorem can be derived using product rule and conditional probability of event X with known
event Y:
o According to the product rule we can express as the probability of event X with known event Y as
follows;

P(X ? Y)= P(X|Y) P(Y) {equation 1}

o Further, the probability of event Y with known event X:

P(X ? Y)= P(Y|X) P(X) {equation 2}

Bayes theorem can be expressed by combining both equations on right hand side. We will get:

Here, both events X and Y are independent events which means probability of outcome of both events
does not depends on one another. The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is defined as updated probability

after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.
o P(X) is called the prior probability, probability of hypothesis before considering the evidence
o P(Y) is called marginal probability. It is defined as the probability of evidence under any
consideration.

Hence, Bayes Theorem can be written as: posterior = likelihood * prior / evidence
Naive Bayesian Classification

Solution:
Lazy Learning (Learn from your neighbors)
K- nearest neighbors
Problem
Problem 2
∴ Majority of the classification where there are rank 1and 2 are classify as 1 and
rank3 are classify as 0.

∴ new instance BMI=43.6, Age40 Sugar=1

Problems
1. Decision Tree Induction (ID3 and CART)
2. Naive Bayesian Classification
3. K-NN

Matrix in Classification and Prediction:

A classification matrix, also known as a confusion matrix, is a table used to evaluate the performance of a
classification algorithm. It compares the actual labels with the predicted labels generated by the model.

Structure of the Confusion Matrix

A confusion matrix for a binary classification problem typically looks like this:

Key Metrics Derived from the Confusion Matrix

1. Accuracy: The proportion of correctly classified instances (both true positives and true
negatives) among the total instances.
2. Precision (Positive Predictive Value): The proportion of true positive predictions among all
positive predictions.

3. Recall (Sensitivity, True Positive Rate): The proportion of true positive predictions among all
actual positive instances.

4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two
metrics.

5. Specificity (True Negative Rate): The proportion of true negative predictions among all actual
negative instances.

6. False Positive Rate (FPR): The proportion of false positive predictions among all actual
negative instances.

7. False Negative Rate (FNR): The proportion of false negative predictions among all actual
positive instances.

Example

Consider a binary classification problem where a model is used to predict whether an email is spam
(positive class) or not spam (negative class). Here's an example confusion matrix:
Difference between prediction and classification

Classification Prediction

Classification is the process of identifying which Predication is the process of identifying the
category a new observation belongs to based on a missing or unavailable numerical data for a
training data set containing observations whose new observation.
category membership is known.

In classification, the accuracy depends on finding In prediction, the accuracy depends on how
the class label correctly. well a given predictor can guess the value of a
predicated attribute for new data.

In classification, the model can be known as the In prediction, the model can be known as the
classifier. predictor.
A model or the classifier is constructed to find the A model or a predictor will be constructed that
categorical labels. predicts a continuous-valued function or
ordered value.

For example, the grouping of patients based on For example, We can think of prediction as
their medical records can be considered a predicting the correct treatment for a particular
classification. disease for a person.

Difference between classification and clustering

Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
7 Classification
100% (3)
7 Classification
63 pages
Classification
No ratings yet
Classification
23 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
DataMining Unit-3
No ratings yet
DataMining Unit-3
8 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
5.classification and Prediction
No ratings yet
5.classification and Prediction
9 pages
V1-CH-6-Classification and Prediction
No ratings yet
V1-CH-6-Classification and Prediction
38 pages
ML Unit4
No ratings yet
ML Unit4
10 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
22 pages
Classification
No ratings yet
Classification
33 pages
Astrologielehrer Astrology Book1
No ratings yet
Astrologielehrer Astrology Book1
15 pages
Unit 3 (DWDM)
No ratings yet
Unit 3 (DWDM)
23 pages
Classifiction
No ratings yet
Classifiction
42 pages
Classification Chapter 5
No ratings yet
Classification Chapter 5
26 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Csir-Institute of Genomics & Integrative Biology Mall Road, Delhi - 110007
No ratings yet
Csir-Institute of Genomics & Integrative Biology Mall Road, Delhi - 110007
2 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
New Classification11
No ratings yet
New Classification11
98 pages
Unit 4
No ratings yet
Unit 4
20 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Chapter 3
No ratings yet
Chapter 3
67 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
CH 5
No ratings yet
CH 5
84 pages
Unit 5
No ratings yet
Unit 5
25 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
Classification and Clustering Techniques in Data Mining
No ratings yet
Classification and Clustering Techniques in Data Mining
18 pages
Down 4
No ratings yet
Down 4
83 pages
Unit 3 DM
No ratings yet
Unit 3 DM
34 pages
Chapter 02 - DM Tasks - Part I - Classification
No ratings yet
Chapter 02 - DM Tasks - Part I - Classification
58 pages
CH-5 DM Classification
No ratings yet
CH-5 DM Classification
31 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
Pe 10 Performance Task New
50% (2)
Pe 10 Performance Task New
4 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Unit 4, DWDM, IT Dept, III Year - II Semester
No ratings yet
Unit 4, DWDM, IT Dept, III Year - II Semester
87 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
DWDM 4
No ratings yet
DWDM 4
58 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
Module 04
No ratings yet
Module 04
75 pages
Classification
No ratings yet
Classification
50 pages
4 Classification
No ratings yet
4 Classification
20 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
224 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
Unit 3
No ratings yet
Unit 3
16 pages
INT354 - Unit 2
No ratings yet
INT354 - Unit 2
26 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
R20 DMT Unit-Iii
No ratings yet
R20 DMT Unit-Iii
21 pages
3 DM Classification
No ratings yet
3 DM Classification
62 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
DWDM Unit IV Note
No ratings yet
DWDM Unit IV Note
21 pages
Budget Analyst Interview Questions
No ratings yet
Budget Analyst Interview Questions
6 pages
Kusum Resume
No ratings yet
Kusum Resume
3 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
7 pages
Competition Conditions LR
100% (1)
Competition Conditions LR
104 pages
Psychotherapy For Psychosis Integrating Cognitive Behavioral and Psychodynamic Treatment Complete Chapter Download
100% (11)
Psychotherapy For Psychosis Integrating Cognitive Behavioral and Psychodynamic Treatment Complete Chapter Download
14 pages
MPhil Viva
No ratings yet
MPhil Viva
3 pages
Essay Useful Language PDF
No ratings yet
Essay Useful Language PDF
3 pages
Year 4 - Mark Scheme - Reasoning and Problem Solving
No ratings yet
Year 4 - Mark Scheme - Reasoning and Problem Solving
4 pages
Supreme Court: Epicharis T Garcia in Her Own Behalf. Bengzon, Villegas, Zarraga, Narciso and Cudala For Respondents
No ratings yet
Supreme Court: Epicharis T Garcia in Her Own Behalf. Bengzon, Villegas, Zarraga, Narciso and Cudala For Respondents
46 pages
NapoleonHill - Quotes
No ratings yet
NapoleonHill - Quotes
4 pages
Nursing Care of A Family With An Infant
100% (1)
Nursing Care of A Family With An Infant
26 pages
English Language Learning Strategy
No ratings yet
English Language Learning Strategy
7 pages
Mini Performance Task 1 - Concept Map On Sets
0% (1)
Mini Performance Task 1 - Concept Map On Sets
2 pages
Educational Administration
No ratings yet
Educational Administration
26 pages
1 Ms Pre Sequence
No ratings yet
1 Ms Pre Sequence
34 pages
Theoretical Framework
No ratings yet
Theoretical Framework
3 pages
JANELA BASCULANTE 60x120
No ratings yet
JANELA BASCULANTE 60x120
2 pages
CRITERIA
No ratings yet
CRITERIA
1 page
Module 5.1 - Entrepreneural Innovations in Tourism and Hospitality
No ratings yet
Module 5.1 - Entrepreneural Innovations in Tourism and Hospitality
10 pages
Justify Your Diagnosis
No ratings yet
Justify Your Diagnosis
2 pages
Reading Sample Test 4 Answer Key Part BC
No ratings yet
Reading Sample Test 4 Answer Key Part BC
2 pages
School Calendar
No ratings yet
School Calendar
3 pages
Practical Mark Sheet
No ratings yet
Practical Mark Sheet
1 page
Olofin Letter 1-2024
No ratings yet
Olofin Letter 1-2024
2 pages
Motivation: A Study On Employee Motivation
No ratings yet
Motivation: A Study On Employee Motivation
8 pages
Half Yearly Exam Time Table - 2024
No ratings yet
Half Yearly Exam Time Table - 2024
1 page
BOSA Training FAQ
No ratings yet
BOSA Training FAQ
3 pages
Oracle HCM Cloud Training Outline
No ratings yet
Oracle HCM Cloud Training Outline
6 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet