Unit 3 Classification
Unit 3 Classification
Classification
Basic Concepts
Classification in data mining is a technique used to assign labels or classify each instance, record,
or data object in a dataset based on their features or attributes.
2. Data Preprocessing:
The collected data needs to be pre-processed to ensure its quality. This involves handling missing
values, dealing with outliers, and transforming the data into a format suitable for analysis.
3. Feature Selection:
Feature selection involves identifying the most relevant attributes in the dataset for classification. This
can be done using various techniques, such as correlation analysis, information gain, and principal
component analysis.
4. Principal Component Analysis:
Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of the dataset.
PCA identifies the most important features in the dataset and removes the redundant ones.
5. Model Selection:
Model selection involves selecting the appropriate classification algorithm for the problem at hand.
There are several algorithms available, such as decision trees, support vector machines, and neural
networks.
Real-Life Examples
Email spam classification
Image classification
Medical diagnosis
Credit risk analysis
Sentiment analysis
Customer segmentation
Fraud detection
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy? If yes then it
will go to the next feature which is humidity and wind. It will again check if there is a strong wind or
weak, if it’s a weak wind and it’s rainy then the person may go and play.
Rules:
While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called as
Attribute selection measure or ASM. By this measurement, we can easily select the best attribute for
the nodes of the tree.
1. Information Gain
2. Gini Index
3. Gain ratio
1. Information Gain:
Information gain is used for deciding the best features/attributes that render maximum data about a
class. It follows the method of entropy while aiming at reducing the level of entropy, starting from the
root node to the leaf nodes.
2. Gain ratio
The information gain measure is biased approaching tests with several results. It can select attributes
having a high number of values. For instance, consider an attribute that facilitates as a unique identifier,
including product ID.
3. Gini index − The Gini index can be used in CART. The Gini index calculates the impurity of D, a data partition
or collection of training tuples, :
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Tree Pruning
A technique to reduce the size of a Decision-Tree by removing certain branches or nodes w/o
significantly affecting the model’s accuracy.
• This evaluates the effect of removing each node on a validation-set and prunes nodes that do not
improve the model's performance.
It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
It can be very useful for solving decision-related problems. o It helps to think about all the possible
outcomes for a problem. o There is less requirement of data cleaning compared to other algorithms.
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. I
t depends on the conditional probability.
The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.
Naïve Bayes Classifier Algorithm o Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Problem: If the weather is sunny, then the Player should play or not?
5/14=
Overcast 0 5
0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification tasks.
Disadvantages of Naïve Bayes Classifier: o Naive Bayes assumes that all features are independent or
unrelated, so it cannot learn the relationship between features.
Rule-Based Classifier
Rule-based classifiers are just another type of classifier which makes the class decision
depending by using various “if. Else” rules.
These rules are easily interpretable and thus these classifiers are generally used to generate
descriptive models.
The condition used with “if” is called the antecedent.
The predicted class of each rule is called the consequent.
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Example:
K-Nearest Neighbour:
K-Nearest Neighbour is one of the simplest Algorithm based on
Supervised Learning technique.
K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
K-NN algorithm is a distance based algorithm.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of classification it performs an
action on the dataset.
The new point is classified as Category 2 because most of its closest neighbours are blue squares. KNN assigns
the category based on the majority of nearby points.
The image shows how KNN predicts the category of a new data point based on its closest neighbours.
The red diamonds represent Category 1 and the blue squares represent Category 2.
The new data point checks its closest neighbours (circled points).
Since the majority of its closest neighbours are blue squares (Category 2) KNN predicts the new data point
belongs to Category 2.
KNN works by using proximity and majority voting to make predictions.
Applications:
Handwriting detection application
Image recognition
Video recognition Advantages of KNN Algorithm:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
Always needs to determine the value of K which may be complex some time.
In data mining and machine learning, prediction, precision, and recall are key concepts used to evaluate the
performance of classification models.
Confusion Matrix
True Positives (TP): when the actual value is Positive and predicted is also Positive.
True negatives (TN): when the actual value is Negative and prediction is also Negative.
False positives (FP): When the actual is negative but prediction is Positive.
False negatives (FN): When the actual is Positive but the prediction is Negative.
Prediction
Definition: Prediction refers to the process of using a model to estimate or classify the value of an
unknown outcome based on input features. In the context of classification, it's about assigning a
class label to an instance.
We have a total of 20 cats and dogs and our model predicts whether it is a cat or not.
There are measures other than the confusion matrix which can help achieve better understanding and
analysis of our model and its performance.
a. Accuracy
b. Precision
Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio between the
number of correct predictions and the total number of predictions.
It is a measure of correctness that is achieved in true prediction. In simple words, it tells us how many
predictions are actually positive out of all the total positive predicted.
Recall: