0% found this document useful (0 votes)

6 views34 pages

Classification

Classification is a supervised machine learning technique used to assign labels to data points, commonly applied in areas like spam detection and medical diagnosis. The process involves data collection, preprocessing, feature selection, model training, evaluation, and deployment, utilizing various algorithms such as Decision Trees, Naïve Bayes, and Neural Networks. Evaluation metrics like accuracy, precision, and recall are essential for assessing model performance across applications in healthcare, finance, and social media.

Uploaded by

dharshutae

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views34 pages

Classification

Uploaded by

dharshutae

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Classification in Data Mining

•Classification is a supervised
machine learning technique used
in data mining to assign
categories or labels to data points
based on their features.
•It is widely used in applications
such as spam detection, fraud
detection, sentiment analysis, and
medical diagnosis.
Steps in Classification

Data Collection – Gather the dataset with labeled

examples.
Preprocessing – Clean and normalize the data,
handle missing values.
Feature Selection – Identify the most relevant
attributes for classification.
Model Training – Use a classification algorithm to
learn from training data.
Model Evaluation – Test the model on unseen data
using metrics like accuracy, precision, recall, and
F1-score.
Prediction & Deployment – Apply the trained
model to classify new data.
Common Classification Algorithms

• Decision Tree
• Naïve Bayes
• k-Nearest Neighbors (k-NN)
• Support Vector Machine (SVM)
• Neural Networks (Deep Learning)
• Random Forest
• Logistic Regression
A. Decision Tree

•Uses a tree-like structure to make decisions

based on feature values.

•Example: ID3, C4.5, CART.

B. Naïve Bayes

•Based on Bayes' theorem, assuming

independence between features.

•Suitable for text classification (e.g., spam

filtering).
C. k-Nearest Neighbors (k-NN)

•Classifies data based on the majority

class of the k-nearest neighbors.

•Works well with smaller datasets.

D. Support Vector Machine (SVM)

•Uses a hyperplane to separate

classes with the maximum margin.

•Effective for high-dimensional data.

E. Neural Networks (Deep Learning)
•Mimics the human brain with
interconnected layers of neurons.
•Used in complex tasks like image
recognition and NLP.

F. Random Forest
•An ensemble of multiple decision trees for
improved accuracy.
•Reduces overfitting compared to a single
decision tree.
G. Logistic Regression

•A statistical model that estimates the

probability of a class.

•Often used for binary classification

problems.
Types of Classification

Binary Classification – Two class labels (e.g.,

spam vs. not spam).

Multiclass Classification – More than two

class labels (e.g., types of diseases).

Multi-Label Classification – A single instance

can belong to multiple categories (e.g.,
tagging images with multiple objects).
Evaluation Metrics

Accuracy = (Correct Predictions) / (Total

Predictions)
Precision = TP / (TP + FP) – Measures how
many predicted positives are actually
positive.

Recall (Sensitivity) = TP / (TP + FN) –

Measures how many actual positives were
correctly predicted.

F1-Score = 2 × (Precision × Recall) /

(Precision + Recall) – Harmonic mean of
precision and recall.
Applications of Classification

•Healthcare – Disease prediction and

diagnosis.

•Finance – Credit scoring, fraud detection.

•E-commerce – Customer segmentation

and recommendation systems.

•Social Media – Sentiment analysis and

content moderation.
Statistical-Based Algorithms in Data
Mining

• Statistical-based algorithms use

mathematical models and probability
distributions to identify patterns,
relationships, and trends in data.

• These methods are widely used in

classification, clustering, regression, and
anomaly detection.
1.Common Statistical-Based Algorithms
A. Naïve Bayes Classifier

•Based on Bayes' Theorem and assumes independence

between features.
•Used for spam detection, sentiment analysis, and
medical diagnosis.
•Types:
• Gaussian Naïve Bayes – Assumes normal
distribution (e.g., continuous data).
• Multinomial Naïve Bayes – Suitable for text data
(e.g., word counts).
• Bernoulli Naïve Bayes – Deals with binary
features.
Formula:
B. Logistic Regression

•Used for binary classification (e.g., spam vs. not

spam).
•Uses the sigmoid function to model probability
values.
•Can be extended to Multinomial Logistic
Regression for multiple classes.

Sigmoid Function:
C. Linear Regression

•Predicts a continuous output based on input

features.

•Used in sales prediction, price estimation,

and trend analysis.

•Equation: Y=wX+bY = wX + bY=wX+b where

YYY is the dependent variable, XXX is the
independent variable, www is the coefficient,
and bbb is the bias.
Distance-Based Algorithms in Data
Mining

Distance-based algorithms use

mathematical distance metrics to
measure similarity between data points.

These methods are widely used in

classification, clustering, and
anomaly detection.
1.Common Distance-Based Algorithms

A. k-Nearest Neighbors (k-NN)

•A lazy learning algorithm that classifies data
based on the k closest neighbors.
•Uses distance metrics like Euclidean,
Manhattan, and Minkowski.
•Works well for pattern recognition,
recommendation systems, and medical
diagnosis.

Distance Formula (Euclidean Distance):

B. K-Means Clustering
•An unsupervised learning algorithm that
partitions data into K clusters.
•Assigns data points to the nearest centroid and
updates centroids iteratively.
•Used in customer segmentation, image
compression, and anomaly detection.

Steps:
1.Choose K cluster centroids.
2.Assign each data point to the nearest centroid.
3.Update centroids based on assigned points.
4.Repeat until centroids stabilize.
C. Hierarchical Clustering
•Builds a tree-like dendrogram to show
relationships between data points.
•Two types:
• Agglomerative (Bottom-Up) –
Merges smaller clusters into larger
ones.

• Divisive (Top-Down) – Splits large

clusters into smaller ones.

•Uses distance metrics like Euclidean,

Manhattan, and Cosine similarity.
D. DBSCAN (Density-Based Spatial
Clustering of Applications with Noise)

•Groups points based on density rather than a

predefined number of clusters.

•Identifies outliers as noise.

•Works well for spatial data, fraud detection,

and anomaly detection.

Key Parameters:
•Epsilon (ε): Defines neighborhood radius.
•MinPts: Minimum points required to form a
dense cluster.
Decision-Based Algorithms in Data Mining

Decision-based algorithms are a category of

supervised learning techniques that use logical
structures, such as trees and rule-based
systems, to make decisions.

These algorithms are widely used in

classification and regression tasks.
A. Decision Tree Algorithm
•A tree-like structure where each internal node
represents a decision based on a feature.
•Uses splitting criteria like Gini Index,
Entropy (Information Gain), and Chi-Square.
•Can be used for both classification and
regression.

Types:

1.ID3 (Iterative Dichotomiser 3) – Uses

Entropy & Information Gain.
2.C4.5 – Improvement of ID3, handles missing
values and continuous data.
3.CART (Classification and Regression
Trees) – Uses Gini Index.
Random Forest Algorithm

•An ensemble of multiple decision trees that

reduces overfitting.
•Uses Bootstrap Aggregating (Bagging) to
improve prediction accuracy.

•Steps:
• Create multiple decision trees from random
subsets of data.
• Aggregate the results (majority vote for
classification, averaging for regression).

Div Card Harvest
100% (1)
Div Card Harvest
7 pages
Classification
No ratings yet
Classification
50 pages
Module Iii
No ratings yet
Module Iii
15 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
60 pages
Unit 4 Introduction To Algorithm
No ratings yet
Unit 4 Introduction To Algorithm
10 pages
ML ModuleUntitled 2
No ratings yet
ML ModuleUntitled 2
8 pages
Overview Basics
No ratings yet
Overview Basics
16 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
Unit 1
No ratings yet
Unit 1
15 pages
Classification Algorithm in Machine Learning
No ratings yet
Classification Algorithm in Machine Learning
7 pages
Data Mining Intro IEP
No ratings yet
Data Mining Intro IEP
47 pages
Data Mining All Summary
No ratings yet
Data Mining All Summary
47 pages
AI Overview Simplified
No ratings yet
AI Overview Simplified
17 pages
Machine Learning Algorithms Laiki
No ratings yet
Machine Learning Algorithms Laiki
123 pages
Unit 2
No ratings yet
Unit 2
57 pages
Machine Learning
100% (6)
Machine Learning
115 pages
3.popular Machine Learning Algorithm
No ratings yet
3.popular Machine Learning Algorithm
11 pages
Lec05 - Supervised
No ratings yet
Lec05 - Supervised
26 pages
Classification
No ratings yet
Classification
7 pages
Classification in Data Mining 12
No ratings yet
Classification in Data Mining 12
7 pages
Machine Learning
No ratings yet
Machine Learning
22 pages
Cse Vsem 503 B PR Unit 2 Notes
No ratings yet
Cse Vsem 503 B PR Unit 2 Notes
17 pages
4.0 Supervised Learning 4.1 Discuss Classification Model
No ratings yet
4.0 Supervised Learning 4.1 Discuss Classification Model
48 pages
Slides Courtesy: Ling Chen [email protected]
No ratings yet
Slides Courtesy: Ling Chen [email protected]
42 pages
CSCI946 W5-Classification
No ratings yet
CSCI946 W5-Classification
72 pages
Machine Learning
No ratings yet
Machine Learning
32 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
11 pages
UCS 401 Unit-Lll Lect 13 Distance Based Models Neighbours and Examples
No ratings yet
UCS 401 Unit-Lll Lect 13 Distance Based Models Neighbours and Examples
20 pages
Machine Learning Clustering AlgorithmsI
No ratings yet
Machine Learning Clustering AlgorithmsI
129 pages
Module 3 - Classification
No ratings yet
Module 3 - Classification
9 pages
Bia Unit-3 Part-2
No ratings yet
Bia Unit-3 Part-2
43 pages
Recommendation Systems
No ratings yet
Recommendation Systems
27 pages
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
No ratings yet
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
32 pages
Day 4 Content
No ratings yet
Day 4 Content
35 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
On Unit-3
No ratings yet
On Unit-3
30 pages
ML Notes
No ratings yet
ML Notes
12 pages
Algorithms 1
No ratings yet
Algorithms 1
23 pages
Classification Notes
No ratings yet
Classification Notes
14 pages
DataMining Unit-3
No ratings yet
DataMining Unit-3
8 pages
DWBI4
No ratings yet
DWBI4
10 pages
Classification (NaiveBayes KNN SVM DecisionTrees)
No ratings yet
Classification (NaiveBayes KNN SVM DecisionTrees)
105 pages
Data Science Unit 3
No ratings yet
Data Science Unit 3
33 pages
Unit 3 Ds
No ratings yet
Unit 3 Ds
10 pages
Lesson 8 - Classification
No ratings yet
Lesson 8 - Classification
74 pages
Chapter Four
No ratings yet
Chapter Four
75 pages
An Introduction To Data Mining IIT Bombay
No ratings yet
An Introduction To Data Mining IIT Bombay
48 pages
DM - Unit-1 - Fundamentals of Data Mining
No ratings yet
DM - Unit-1 - Fundamentals of Data Mining
43 pages
ML Notes
No ratings yet
ML Notes
10 pages
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
No ratings yet
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
47 pages
Unit - 2 ML Notes
No ratings yet
Unit - 2 ML Notes
14 pages
Discovering Knowledge in Data: Lecture Review of
No ratings yet
Discovering Knowledge in Data: Lecture Review of
20 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
ML Notes 1
No ratings yet
ML Notes 1
3 pages
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
Session 5
No ratings yet
Session 5
36 pages
BI Chapter 04 - Unlocked
No ratings yet
BI Chapter 04 - Unlocked
47 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
DS Unit 2
No ratings yet
DS Unit 2
34 pages
Quiz-3 Fourier Transform
No ratings yet
Quiz-3 Fourier Transform
2 pages
Machine Learning and Deep Learning For State of Art
No ratings yet
Machine Learning and Deep Learning For State of Art
21 pages
Minimum Spanning Tree Formulation: X Ij T
No ratings yet
Minimum Spanning Tree Formulation: X Ij T
6 pages
Machine-Learning-for-Sleep-Disorder-Classification (1) Vinay
No ratings yet
Machine-Learning-for-Sleep-Disorder-Classification (1) Vinay
8 pages
CHAPTER 4 - Violations of Assumptions
No ratings yet
CHAPTER 4 - Violations of Assumptions
96 pages
Estimation of Roots of Equations
No ratings yet
Estimation of Roots of Equations
18 pages
Cazoom Maths. Linear Functions. Equations of Parallel Lines
No ratings yet
Cazoom Maths. Linear Functions. Equations of Parallel Lines
2 pages
Excel Formula
No ratings yet
Excel Formula
9 pages
Cs73 Solved Assignment Ignou 2012
No ratings yet
Cs73 Solved Assignment Ignou 2012
8 pages
Getting Started Guide: Model Predictive Control Toolbox™
100% (1)
Getting Started Guide: Model Predictive Control Toolbox™
174 pages
DSA - 40 Days Plan
No ratings yet
DSA - 40 Days Plan
2 pages
Journal of Forecasting - 2024 - Lei - Volatility Forecasting For Stock Market in
No ratings yet
Journal of Forecasting - 2024 - Lei - Volatility Forecasting For Stock Market in
25 pages
A Novel Coupled Optimization Prediction Model For Air Quality
No ratings yet
A Novel Coupled Optimization Prediction Model For Air Quality
19 pages
P1.7 Genetic Algorithms in Geophysical Fluid Dynamics
No ratings yet
P1.7 Genetic Algorithms in Geophysical Fluid Dynamics
7 pages
Ai Unit3
No ratings yet
Ai Unit3
38 pages
Image Processing
No ratings yet
Image Processing
8 pages
Maths Assignment 2
No ratings yet
Maths Assignment 2
2 pages
A Brief Review On Artificial Neural Network Network Structures and Applications
No ratings yet
A Brief Review On Artificial Neural Network Network Structures and Applications
6 pages
Partial Differential Equations
No ratings yet
Partial Differential Equations
8 pages
Rigorous Derivation of Hooghoudt's Equation For Drainage Spacing
No ratings yet
Rigorous Derivation of Hooghoudt's Equation For Drainage Spacing
41 pages
Understanding Uncertainty
No ratings yet
Understanding Uncertainty
8 pages
3.1 Organizing and Displaying Data
No ratings yet
3.1 Organizing and Displaying Data
6 pages
Clustering - Jupyter Notebook
100% (1)
Clustering - Jupyter Notebook
11 pages
Operation Analytics MCQ
No ratings yet
Operation Analytics MCQ
11 pages
Practice Problem: Chapter 15, Short Term Scheduling
No ratings yet
Practice Problem: Chapter 15, Short Term Scheduling
6 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
13 pages
Cse330:Competitive Coding Approaches-Techniques: Course Outcomes
No ratings yet
Cse330:Competitive Coding Approaches-Techniques: Course Outcomes
2 pages
Cs3452 - Toc - QB New
No ratings yet
Cs3452 - Toc - QB New
10 pages

Classification

Uploaded by

Classification

Uploaded by

Classification in Data Mining

Data Collection – Gather the dataset with labeled

•Uses a tree-like structure to make decisions

•Example: ID3, C4.5, CART.

•Based on Bayes' theorem, assuming

•Suitable for text classification (e.g., spam

•Classifies data based on the majority

•Works well with smaller datasets.

D. Support Vector Machine (SVM)

•Uses a hyperplane to separate

•Effective for high-dimensional data.

•A statistical model that estimates the

•Often used for binary classification

Binary Classification – Two class labels (e.g.,

Multiclass Classification – More than two

Multi-Label Classification – A single instance

Accuracy = (Correct Predictions) / (Total

Recall (Sensitivity) = TP / (TP + FN) –

F1-Score = 2 × (Precision × Recall) /

•Healthcare – Disease prediction and

•Finance – Credit scoring, fraud detection.

•E-commerce – Customer segmentation

•Social Media – Sentiment analysis and

• Statistical-based algorithms use

• These methods are widely used in

•Based on Bayes' Theorem and assumes independence

•Used for binary classification (e.g., spam vs. not

•Predicts a continuous output based on input

•Used in sales prediction, price estimation,

•Equation: Y=wX+bY = wX + bY=wX+b where

Distance-based algorithms use

These methods are widely used in

A. k-Nearest Neighbors (k-NN)

Distance Formula (Euclidean Distance):

• Divisive (Top-Down) – Splits large

•Uses distance metrics like Euclidean,

•Groups points based on density rather than a

•Identifies outliers as noise.

•Works well for spatial data, fraud detection,

Decision-based algorithms are a category of

These algorithms are widely used in

1.ID3 (Iterative Dichotomiser 3) – Uses

•An ensemble of multiple decision trees that

You might also like