0% found this document useful (0 votes)
6 views34 pages

Classification

Classification is a supervised machine learning technique used to assign labels to data points, commonly applied in areas like spam detection and medical diagnosis. The process involves data collection, preprocessing, feature selection, model training, evaluation, and deployment, utilizing various algorithms such as Decision Trees, Naïve Bayes, and Neural Networks. Evaluation metrics like accuracy, precision, and recall are essential for assessing model performance across applications in healthcare, finance, and social media.

Uploaded by

dharshutae
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views34 pages

Classification

Classification is a supervised machine learning technique used to assign labels to data points, commonly applied in areas like spam detection and medical diagnosis. The process involves data collection, preprocessing, feature selection, model training, evaluation, and deployment, utilizing various algorithms such as Decision Trees, Naïve Bayes, and Neural Networks. Evaluation metrics like accuracy, precision, and recall are essential for assessing model performance across applications in healthcare, finance, and social media.

Uploaded by

dharshutae
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Classification in Data Mining

•Classification is a supervised
machine learning technique used
in data mining to assign
categories or labels to data points
based on their features.
•It is widely used in applications
such as spam detection, fraud
detection, sentiment analysis, and
medical diagnosis.
Steps in Classification

Data Collection – Gather the dataset with labeled


examples.
Preprocessing – Clean and normalize the data,
handle missing values.
Feature Selection – Identify the most relevant
attributes for classification.
Model Training – Use a classification algorithm to
learn from training data.
Model Evaluation – Test the model on unseen data
using metrics like accuracy, precision, recall, and
F1-score.
Prediction & Deployment – Apply the trained
model to classify new data.
Common Classification Algorithms

• Decision Tree
• Naïve Bayes
• k-Nearest Neighbors (k-NN)
• Support Vector Machine (SVM)
• Neural Networks (Deep Learning)
• Random Forest
• Logistic Regression
A. Decision Tree

•Uses a tree-like structure to make decisions


based on feature values.

•Example: ID3, C4.5, CART.

B. Naïve Bayes

•Based on Bayes' theorem, assuming


independence between features.

•Suitable for text classification (e.g., spam


filtering).
C. k-Nearest Neighbors (k-NN)

•Classifies data based on the majority


class of the k-nearest neighbors.

•Works well with smaller datasets.

D. Support Vector Machine (SVM)

•Uses a hyperplane to separate


classes with the maximum margin.

•Effective for high-dimensional data.


E. Neural Networks (Deep Learning)
•Mimics the human brain with
interconnected layers of neurons.
•Used in complex tasks like image
recognition and NLP.

F. Random Forest
•An ensemble of multiple decision trees for
improved accuracy.
•Reduces overfitting compared to a single
decision tree.
G. Logistic Regression

•A statistical model that estimates the


probability of a class.

•Often used for binary classification


problems.
Types of Classification

Binary Classification – Two class labels (e.g.,


spam vs. not spam).

Multiclass Classification – More than two


class labels (e.g., types of diseases).

Multi-Label Classification – A single instance


can belong to multiple categories (e.g.,
tagging images with multiple objects).
Evaluation Metrics

Accuracy = (Correct Predictions) / (Total


Predictions)
Precision = TP / (TP + FP) – Measures how
many predicted positives are actually
positive.

Recall (Sensitivity) = TP / (TP + FN) –


Measures how many actual positives were
correctly predicted.

F1-Score = 2 × (Precision × Recall) /


(Precision + Recall) – Harmonic mean of
precision and recall.
Applications of Classification

•Healthcare – Disease prediction and


diagnosis.

•Finance – Credit scoring, fraud detection.

•E-commerce – Customer segmentation


and recommendation systems.

•Social Media – Sentiment analysis and


content moderation.
Statistical-Based Algorithms in Data
Mining

• Statistical-based algorithms use


mathematical models and probability
distributions to identify patterns,
relationships, and trends in data.

• These methods are widely used in


classification, clustering, regression, and
anomaly detection.
1.Common Statistical-Based Algorithms
A. Naïve Bayes Classifier

•Based on Bayes' Theorem and assumes independence


between features.
•Used for spam detection, sentiment analysis, and
medical diagnosis.
•Types:
• Gaussian Naïve Bayes – Assumes normal
distribution (e.g., continuous data).
• Multinomial Naïve Bayes – Suitable for text data
(e.g., word counts).
• Bernoulli Naïve Bayes – Deals with binary
features.
Formula:
B. Logistic Regression

•Used for binary classification (e.g., spam vs. not


spam).
•Uses the sigmoid function to model probability
values.
•Can be extended to Multinomial Logistic
Regression for multiple classes.

Sigmoid Function:
C. Linear Regression

•Predicts a continuous output based on input


features.

•Used in sales prediction, price estimation,


and trend analysis.

•Equation: Y=wX+bY = wX + bY=wX+b where


YYY is the dependent variable, XXX is the
independent variable, www is the coefficient,
and bbb is the bias.
Distance-Based Algorithms in Data
Mining

Distance-based algorithms use


mathematical distance metrics to
measure similarity between data points.

These methods are widely used in


classification, clustering, and
anomaly detection.
1.Common Distance-Based Algorithms

A. k-Nearest Neighbors (k-NN)


•A lazy learning algorithm that classifies data
based on the k closest neighbors.
•Uses distance metrics like Euclidean,
Manhattan, and Minkowski.
•Works well for pattern recognition,
recommendation systems, and medical
diagnosis.

Distance Formula (Euclidean Distance):


B. K-Means Clustering
•An unsupervised learning algorithm that
partitions data into K clusters.
•Assigns data points to the nearest centroid and
updates centroids iteratively.
•Used in customer segmentation, image
compression, and anomaly detection.

Steps:
1.Choose K cluster centroids.
2.Assign each data point to the nearest centroid.
3.Update centroids based on assigned points.
4.Repeat until centroids stabilize.
C. Hierarchical Clustering
•Builds a tree-like dendrogram to show
relationships between data points.
•Two types:
• Agglomerative (Bottom-Up) –
Merges smaller clusters into larger
ones.

• Divisive (Top-Down) – Splits large


clusters into smaller ones.

•Uses distance metrics like Euclidean,


Manhattan, and Cosine similarity.
D. DBSCAN (Density-Based Spatial
Clustering of Applications with Noise)

•Groups points based on density rather than a


predefined number of clusters.

•Identifies outliers as noise.

•Works well for spatial data, fraud detection,


and anomaly detection.

Key Parameters:
•Epsilon (ε): Defines neighborhood radius.
•MinPts: Minimum points required to form a
dense cluster.
Decision-Based Algorithms in Data Mining

Decision-based algorithms are a category of


supervised learning techniques that use logical
structures, such as trees and rule-based
systems, to make decisions.

These algorithms are widely used in


classification and regression tasks.
A. Decision Tree Algorithm
•A tree-like structure where each internal node
represents a decision based on a feature.
•Uses splitting criteria like Gini Index,
Entropy (Information Gain), and Chi-Square.
•Can be used for both classification and
regression.

Types:

1.ID3 (Iterative Dichotomiser 3) – Uses


Entropy & Information Gain.
2.C4.5 – Improvement of ID3, handles missing
values and continuous data.
3.CART (Classification and Regression
Trees) – Uses Gini Index.
Random Forest Algorithm

•An ensemble of multiple decision trees that


reduces overfitting.
•Uses Bootstrap Aggregating (Bagging) to
improve prediction accuracy.

•Steps:
• Create multiple decision trees from random
subsets of data.
• Aggregate the results (majority vote for
classification, averaging for regression).

You might also like