Classification Notes
Classification Notes
Classification
Classification is a technique in data mining that involves categorizing or classifying data objects into
predefined classes, categories, or groups based on their features or attributes. Or
Data classification is a process in which data is organized or labeled into predefined categories or classes
based on its characteristics. The goal of classification is to assign new, unseen data instances to the
correct predefined categories. This process is a fundamental task in supervised machine learning and data
mining.
Data classification is a two-step process, consisting of a learning step (where a classification model is
constructed) and a classification step (where the model is used to predict class labels for given data).
In the first step (Learning), we build a classification model based on previous data (training or
sample data).
In the second step (classification), we determine if the model accuracy is acceptable and if so we
use the model to classify the new data.
Applications:
Data classification has numerous applications across various domains due to its ability to automatically
categorize data into predefined classes. Here are some key applications:
1. Spam Detection: Identifying and filtering out spam emails from legitimate emails. Example: Email
services like Gmail and Outlook use classification algorithms to move spam emails to a spam folder.
2. Medical Diagnosis: Predicting diseases or health conditions based on patient data. Example:
Classifying patients as diabetic or non-diabetic based on their medical history and test results.
3. Credit Scoring: Assessing the creditworthiness of individuals or businesses. Example: Banks and
financial institutions classify loan applicants into risk categories (e.g., low risk, high risk) based on their
financial history and credit score.
4. Image Recognition: Identifying objects, people, or scenes in images. Example: Facial recognition
systems classify images of faces to identify individuals for security purposes.
5. Sentiment Analysis: Determining the sentiment expressed in text data (e.g., positive, negative,
neutral). Example: Companies analyze customer reviews to classify them as positive or negative
feedback.
8. Fraud Detection: Identifying fraudulent activities in transactions. Example: Credit card companies
classify transactions as fraudulent or legitimate based on transaction patterns.
9. Voice Recognition: Identifying spoken words or phrases. Example: Virtual assistants like Siri and
Alexa classify audio input to recognize commands and provide appropriate responses.
10. Behavioral Targeting: Delivering personalized advertisements based on user behavior. Example:
Online advertising platforms classify users based on their browsing history to show relevant ads.
Prediction
Prediction, in the context of data science and machine learning, refers to the process of using a trained
model to make forecasts or estimations about future or unseen data. The goal of prediction is to generate
accurate and meaningful insights by using patterns learned from historical or labeled data.
Classification and Regression are the two major types of prediction problems, where classification is used
to predict discrete or nominal values, while regression is used to predict continuous or ordered values.
1. Data Cleaning: Data cleaning involves identifying and correcting (or removing) inaccuracies and
inconsistencies in the data to improve its quality.
Missing Values: Missing data points can lead to biased or incorrect models.
Noisy Data: Data with errors, outliers, or irrelevant information can distort model predictions.
Duplicate Records: Multiple entries of the same data can skew results.
2. Relevance: Relevance involves ensuring that the features (variables) used in the classification task are
important and useful for making accurate predictions.
Irrelevant Features: Features that do not contribute to the prediction can add noise and reduce model
performance.
Redundant Features: Highly correlated features can provide duplicate information, making the model
more complex than necessary.
3. Data Transformation: Data transformation involves converting data into a suitable format or structure
for analysis, which can include scaling, encoding, or creating new features.
Scaling Issues: Features with different scales can disproportionately affect the model.
Categorical Data: Many algorithms require numerical input, so categorical data must be transformed
appropriately.
Non-linear Relationships: Some relationships between features and the target variable may be non-linear
and need transformation.
Algorithms
1. Decision Tree Induction
ID3 (Iterative Dichotomiser 3)
C4.5
CART (Classification and Regression Trees)
Random forest(an ensemble of Decision tree)
2. Bayes classification method
Bayes Theorem
Naive Bayesian Classification
3. Rule Based Classification
4. Lazy Learning (Learn from your neighbors)
K- nearest neighbors
Decision Tree Induction
Decision tree induction is a popular and powerful method for classification and regression tasks in
machine learning. It involves creating a model that predicts the value of a target variable by learning
simple decision rules inferred from the data features.
1. Select the Best Attribute: Choose the feature that best splits the data according to a specific
criterion (such as information gain for ID3 or Gini impurity for CART).
2. Create a Decision Node: Create a node in the tree that represents the selected attribute.
3. Split the Data: Divide the dataset into subsets based on the selected attribute's values.
4. Repeat: Recursively apply the above steps to each subset.
5. Stopping Criteria: The recursion stops when one of the following conditions is met:
o All instances in a subset belong to the same class.
o No further attributes are left to split the data.
o The tree reaches a predefined maximum depth or a minimum number of instances per
node.
Advantages:
Disadvantages:
Overfitting: Decision trees can easily overfit the training data, especially if they are deep.
Instability: Small changes in the data can lead to significant changes in the tree structure.
Bias towards Features with More Levels: Features with more levels can dominate splits,
leading to biased models.
C4.5
Uses gain ratio (an extension of information gain) to handle attributes with many values.
Handles both categorical and continuous attributes.
Prunes the tree after creation to remove branches that do not provide additional power in
classification. Can handle missing values in the data.
CART (Classification and Regression Trees)
Uses Gini impurity or entropy to split the data for classification tasks.
Uses variance reduction to split the data for regression tasks.
Constructs binary trees, where each node has exactly two children.
Prunes the tree using cost-complexity pruning to avoid overfitting.
Uses bootstrap aggregating (bagging) to create multiple subsets of the training data.
Each tree is trained on a different subset, and a random subset of features is used to split each
node.
Reduces overfitting by averaging the results of many trees.
Provides a measure of feature importance based on how much each feature improves the split
criterion.
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Bayesian classification is based on Bayes’ theorem, described next. Studies comparing classification
algorithms have found a simple Bayesian classifier known as the naive Bayesian classifier to be
comparable in performance with decision tree and selected neural network classifiers. Bayesian classifiers
have also exhibited high accuracy and speed when applied to large databases. Naive Bayesian classifiers
assume that the effect of an attribute value on a given class is independent of the values of the other
attributes. This assumption is called class conditional independence. It is made to simplify the
computations involved and, in this sense, is considered “naive”.
Bayes Theorem
Bayes theorem is one of the most popular machine learning concepts that helps to calculate the
probability of occurring one event with uncertain knowledge while other one has already occurred.
Bayes' theorem can be derived using product rule and conditional probability of event X with known
event Y:
o According to the product rule we can express as the probability of event X with known event Y as
follows;
Bayes theorem can be expressed by combining both equations on right hand side. We will get:
Here, both events X and Y are independent events which means probability of outcome of both events
does not depends on one another. The above equation is called as Bayes Rule or Bayes Theorem.
Hence, Bayes Theorem can be written as: posterior = likelihood * prior / evidence
Naive Bayesian Classification
Solution:
Lazy Learning (Learn from your neighbors)
K- nearest neighbors
Problem
Problem 2
∴ Majority of the classification where there are rank 1and 2 are classify as 1 and
rank3 are classify as 0.
Problems
1. Decision Tree Induction (ID3 and CART)
2. Naive Bayesian Classification
3. K-NN
A classification matrix, also known as a confusion matrix, is a table used to evaluate the performance of a
classification algorithm. It compares the actual labels with the predicted labels generated by the model.
A confusion matrix for a binary classification problem typically looks like this:
1. Accuracy: The proportion of correctly classified instances (both true positives and true
negatives) among the total instances.
2. Precision (Positive Predictive Value): The proportion of true positive predictions among all
positive predictions.
3. Recall (Sensitivity, True Positive Rate): The proportion of true positive predictions among all
actual positive instances.
4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two
metrics.
5. Specificity (True Negative Rate): The proportion of true negative predictions among all actual
negative instances.
6. False Positive Rate (FPR): The proportion of false positive predictions among all actual
negative instances.
7. False Negative Rate (FNR): The proportion of false negative predictions among all actual
positive instances.
Example
Consider a binary classification problem where a model is used to predict whether an email is spam
(positive class) or not spam (negative class). Here's an example confusion matrix:
Difference between prediction and classification
Classification Prediction
Classification is the process of identifying which Predication is the process of identifying the
category a new observation belongs to based on a missing or unavailable numerical data for a
training data set containing observations whose new observation.
category membership is known.
In classification, the accuracy depends on finding In prediction, the accuracy depends on how
the class label correctly. well a given predictor can guess the value of a
predicated attribute for new data.
In classification, the model can be known as the In prediction, the model can be known as the
classifier. predictor.
A model or the classifier is constructed to find the A model or a predictor will be constructed that
categorical labels. predicts a continuous-valued function or
ordered value.
For example, the grouping of patients based on For example, We can think of prediction as
their medical records can be considered a predicting the correct treatment for a particular
classification. disease for a person.