0% found this document useful (0 votes)
20 views20 pages

4 Classification

Uploaded by

thirosul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views20 pages

4 Classification

Uploaded by

thirosul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Unit IV: Classification

What is Classification?

Classification is to identify the category or the class label of a new


observation. First, a set of data is used as training data. The set of input data
and the corresponding outputs are given to the algorithm.
Sometimes there can be more than two classes to classify. That is
called multiclass classification.
How does Classification Work?
There are two stages in the data classification system: classifier or model
creation and classification classifier.
Developing the Classifier or model creation: This level is the learning stage or
the learning process. The classification algorithms construct the classifier in
this stage. A classifier is constructed from a training set composed of the
records of databases and their corresponding class names.
Applying classifier for classification: The classifier is used for classification at
this level. The test data are used here to estimate the accuracy of the
classification algorithm. If the consistency is deemed sufficient, the
classification rules can be expanded to cover new data records. It includes:
o Sentiment Analysis: Sentiment analysis is highly helpful in social
media monitoring. We can use it to extract social media insights.
We can build sentiment analysis models to read and analyze
misspelled words with advanced machine learning algorithms.
The accurate trained models provide consistently accurate
outcomes and result in a fraction of the time.
o Document Classification: We can use document classification to
organize the documents into sections according to the content.
Document classification refers to text classification; we can
classify the words in the entire document. And with the help of
machine learning classification algorithms, we can execute it
automatically.
o Image Classification: Image classification is used for the trained
categories of an image. These could be the caption of the image,
a statistical value, a theme. You can tag images to train your
model for relevant categories by applying supervised learning
algorithms.
o Machine Learning Classification: It uses the statistically
demonstrable algorithm rules to execute analytical tasks that
would take humans hundreds of more hours to perform.
2. Data Classification Process: The data classification process can be
categorized into five steps:
o Create the goals of data classification, strategy, workflows, and
architecture of data classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use effects.
o Data is complex, and a continuous method is a classification.

A general approach to classification:


Classification is a two-step process involving,
Learning Step: It is a step where the Classification model is to be constructed.
In this phase, training data are analyzed by a classification Algorithm.
Classification Step: it’s a step where the model is employed to predict class
labels for given data. In this phase, test data are wont to estimate the
accuracy of classification rules.

Evaluation of classifiers:
There are various methods commonly used to evaluate the performance of a
classifier which are as follows −
Holdout Method –
In the holdout method, the initial record with labeled instances is partitioned
into two disjoint sets, known as the training and the test sets, accordingly. A
classification model is induced from the training set and its implementation is
computed on the test set.
The efficiency of the classifier can be computed depending on the efficiency
of the induced model on the test set. The holdout method has various
well-known disadvantages. First, some labeled instances are accessible for
training because several data are withheld for testing.
As a result, the induced model cannot be as best as when some labeled
examples are utilized for training. Second, the model can be hugely
dependent on the structure of the training and test sets.
On the other hand, if the training set is too large, then the estimated accuracy
computed from the smaller test set is less reliable. Therefore an estimate has
a broad confidence interval. Finally, the training and test sets are no higher
separate from each other.
Random Subsampling –
The holdout method can be repeated multiple times to enhance the
computation of a classifier's implementation. This method is called random
subsampling.
Let acci be the model accuracy during the ith iteration. The overall accuracy is
given by accsub=∑i=1k∑i=1kacci/k
Random subsampling encounters several issues associated with the holdout
approach because it does not use as much data is applicable for training. It
also has no control over the several times each data is used for testing and
training. Therefore, some data can be used for training more than others.
Cross-Validation −:
An alternative to random subsampling is cross-validation. In this method,
each data is used multiple times for training and accurately once for testing.
Consider that it can partition the record into two equal-sized subsets. First, it
can select one of the subsets for training and the other for testing. It can
change the roles of the subsets so that the earlier training set becomes the
test set. This method is known as twofold cross-validation.

Basic algorithms of classification:


Decision Tree: Decision tree is the most powerful and popular tool for
classification and prediction. A Decision tree is a flowchart-like tree structure,
where each internal node denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node (terminal node) holds a
class label.
A decision tree for the concept Play Tennis.

Construction of Decision Tree:


A tree can be “learned” by splitting the source set into subsets based on an
attribute value test. This process is repeated on each derived subset in a
recursive manner called recursive partitioning.
● The recursion is completed when the subset at a node all has the same
value of the target variable, or when splitting no longer adds value to
the predictions.
● The construction of decision tree classifier does not require any domain
knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery.
● Decision trees can handle high dimensional data. In general decision
tree classifier has good accuracy. Decision tree induction is a typical
inductive approach to learn knowledge on classification.
Decision Tree Representation:
Decision trees classify instances by sorting them down the tree from the
root to some leaf node, which provides the classification of the instance.
An instance is classified by starting at the root node of the tree, testing the
attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute as shown in the above figure. This
process is then repeated for the subtree rooted at the new node.
For example, the instance

(Outlook = Rain, Temperature = Hot, Humidity = High, Wind = Strong )

would be sorted down the leftmost branch of this decision tree and would
therefore be classified as a negative instance.
In other words we can say that decision tree represent a disjunction of
conjunctions of constraints on the attribute values of instances.

(Outlook = Sunny ^ Humidity = Normal) v (Outlook = Overcast) v (Outlook =


Rain ^ Wind = Weak)

Strengths and Weakness of the Decision Tree Approach


The strengths of decision tree methods are:

● Decision trees are able to generate understandable rules.


● Decision trees perform classification without requiring much computation.
● Decision trees are able to handle both continuous and categorical
variables.
● Decision trees provide a clear indication of which fields are most
important for prediction or classification.
The weaknesses of decision tree methods:

● Decision trees are less appropriate for estimation tasks where the goal is
to predict the value of a continuous attribute.
● Decision trees are prone to errors in classification problems with many
class and relatively small number of training examples.
● Decision tree can be computationally expensive to train.

Expressing attribute test conditions


Decision tree induction algorithms support an approach for defining an
attribute test condition and its correlating results for multiple attribute types.
Binary Attributes − A binary attribute is a nominal attribute with only two
elements or states including 0 or 1, where 0 frequently represents that the
attribute is absent, and 1 represents that it is present. Binary attributes are
defined as Boolean if the two states are equivalent to true and false.
A binary attribute is symmetric if both of its states are equal valuable and
make an equal weight. There is no preference on which results must be
coded as 0 or 1. An example can be the attribute gender having the states
male and female.
Nominal Attributes − Nominal defines associating with names. The values of a
nominal attribute are symbols or names of things. Each value defines some
type of category, code, or state, etc. Nominal attributes are defined as
categorical. The values do not have any significant order. In computer
science, the values are also called enumerations
Ordinal Attributes − An ordinal attribute is an attribute with applicable values
that have an essential series or ranking among them, but the magnitude
between successive values is unknown.
Ordinal attributes can make binary or multiway splits. Ordinal attribute
values can be combined considering the grouping does not violate the order
nature of the attribute values.
Numeric Attributes − A numeric attribute is quantitative. It is a computable
quantity, represented in numerical or real values. It can be interval-scaled or
ratio-scaled.
Measures for Selecting the Best Split
Node splitting, or simply splitting, is the process of dividing a node into
multiple sub-nodes to create relatively pure nodes. There are multiple ways
of doing this, which can be broadly divided into two categories based on the
type of target variable:

1. Continuous Target Variable:-

o Reduction in Variance

2. Categorical Target Variable

o Gini Impurity
o Information Gain
o Chi-Square

Decision Tree Splitting Method #1: Reduction in Variance

Reduction in Variance is a method for splitting the node used when the target
variable is continuous, i.e., regression problems. It is so-called because it
uses variance as a measure for deciding the feature on which node is split
into child nodes.

Variance is used for calculating the homogeneity of a node. If a node is


entirely homogeneous, then the variance is zero.

Here are the steps to split a decision tree using reduction in variance:

1. For each split, individually calculate the variance of each child node
2. Calculate the variance of each split as the weighted average variance of
child nodes
3. Select the split with the lowest variance
4. Perform steps 1-3 until completely homogeneous nodes are achieved
o Decision Tree Splitting Method #2: Gini Impurity/ index

1. Gini Index is a metric to measure how often a randomly chosen


element would be incorrectly identified.
2. It means an attribute with a lower Gini index should be preferred.
3. Sklearn supports “Gini” criteria for Gini Index and by default, it takes
“gini” value.

The Formula for the calculation of the Gini Index is given below.

The Gini Index is a measure of the inequality or impurity of a distribution,


commonly used in decision trees and other machine learning algorithms. It
ranges from 0 to 1, where 0 represents perfect equality (all values are the
same) and 1 represents perfect inequality (all values are different).

Some additional features and characteristics of the Gini Index are:

1. It is calculated by summing the squared probabilities of each outcome


in a distribution and subtracting the result from 1.
2. A lower Gini Index indicates a more homogeneous or pure distribution,
while a higher Gini Index indicates a more heterogeneous or impure
distribution.
3. In decision trees, the Gini Index is used to evaluate the quality of a split
by measuring the difference between the impurity of the parent node
and the weighted impurity of the child nodes.
4. Compared to other impurity measures like entropy, the Gini Index is
faster to compute and more sensitive to changes in class probabilities.
5. One disadvantage of the Gini Index is that it tends to favor splits that
create equally sized child nodes, even if they are not optimal for
classification accuracy.
6. In practice, the choice between using the Gini Index or other impurity
measures depends on the specific problem and dataset, and often
requires experimentation and tuning.
Decision Tree Splitting Method #3: Information Gain

Now, what if we have a categorical target variable? Reduction in variation


won’t quite cut it.
Well, the answer to that is Information Gain. Information Gain is used for
splitting the nodes when the target variable is categorical. It works on the
concept of the entropy and is given by:

Entropy is used for calculating the purity of a node. Lower the value of
entropy, higher is the purity of the node. The entropy of a homogeneous
node is zero. Since we subtract entropy from 1, the Information Gain is higher
for the purer nodes with a maximum value of 1. Now, let’s take a look at the
formula for calculating the entropy:

Steps to split a decision tree using Information Gain:

1. For each split, individually calculate the entropy of each child node
2. Calculate the entropy of each split as the weighted average entropy of
child nodes
3. Select the split with the lowest entropy or highest information gain
4. Until you achieve homogeneous nodes, repeat steps 1-3

Decision Tree Splitting Method #4: Chi-Square

Chi-square is another method of splitting nodes in a decision tree for datasets


having categorical target values. It can make two or more than two splits. It
works on the statistical significance of differences between the parent node
and child nodes. Chi-Square value is:
Here, the Expected is the expected value for a class in a child node based on
the distribution of classes in the parent node, and Actual is the actual value
for a class in a child node.

The above formula gives us the value of Chi-Square for a class. Take the sum
of Chi-Square values for all the classes in a node to calculate the Chi-Square
for that node. Higher the value, higher will be the differences between parent
and child nodes, i.e., higher will be the homogeneity.

Here are the steps to split a decision tree using Chi-Square:

1. For each split, individually calculate the Chi-Square value of each child
node by taking the sum of Chi-Square values for each class in a node
2. Calculate the Chi-Square value of each split as the sum of Chi-Square
values for all the child nodes
3. Select the split with a higher Chi-Square value
4. Until you achieve homogeneous nodes, repeat steps 1-3

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge.
It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed


event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the


evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions. So
to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:

Outlook Play
0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0
Rainy 2 2

Sunny 3 2

Total 10 4

Likelihood table weather condition:

Weather No Yes All

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.3
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.3= 0.7
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.3
So P(No|Sunny)= 0.5*0.29/0.3 = 0.48
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class
of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so
it cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is
an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment
analysis.

Bayesian Belief Networks


Bayesian classification is based on Bayes' Theorem. Bayesian classifiers
are the statistical classifiers. Bayesian classifiers can predict class
membership probabilities such as the probability that a given tuple belongs to
a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities −
● Posterior Probability [P(H/X)]
● Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Network

Bayesian Belief Networks specify joint conditional probability distributions.


They are also known as Belief Networks, Bayesian Networks, or Probabilistic
Networks.

● A Belief Network allows class conditional independencies to be defined


between subsets of variables.
● It provides a graphical model of causal relationship on which learning
can be performed.
● We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
● Directed acyclic graph
● A set of conditional probability tables

Directed Acyclic Graph

● Each node in a directed acyclic graph represents a random variable.


● These variable may be discrete or continuous valued.
● These variables may correspond to the actual attribute given in the data.

Directed Acyclic Graph Representation

The following diagram shows a directed acyclic graph for six Boolean variables.

The arc in the diagram allows representation of causal knowledge. For


example, lung cancer is influenced by a person's family history of lung cancer,
as well as whether or not the person is a smoker. It is worth noting that the
variable PositiveXray is independent of whether the patient has a family
history of lung cancer or that the patient is a smoker, given that we know the
patient has lung cancer.

Conditional Probability Table

The conditional probability table for the values of the variable LungCancer
(LC) showing each possible combination of the values of its parent nodes,
FamilyHistory (FH), and Smoker (S) is as follows −

K – Nearnest neighbor classification – Algorithm

What’s KNN?

KNN (K — Nearest Neighbors) is one of many (supervised learning) algorithms


used in data mining and machine learning, it’s a classifier algorithm where the
learning is based “how similar” is a data (a vector) from other .

How it’s working?

The KNN is pretty simple, imagine that you have a data about colored balls:

● Purple balls;

● Yellow balls;

● And a ball that you don’t know if it’s purple or yellow, but you has all the
data about this color (except the color label).
So, how are you going to know the ball’s color? imagine you like a machine
that you only have the ball’s characteristics(data), but doesn’t the final label.
Hou do you will to know the ball’s color(final label/your class)?

Obs: Let’s suppose that data with number 1(and label R) are referring to the
purple balls and the data with number 2 (and label A) are referring to the
yellow balls, this’s just to make the explanation easier,

Each line refers to a ball and each column refers to a ball’s characteristic, in
the last column we have the class (color) of each of the balls:

● R -> purple;

● A -> yellow

We have 5 balls there ( 5 lines), each one with yours classification, you can
try to discover the new ball’s color (in the case the class) of N ways, one of
these N ways is to comparing this new ball’s characteristics with all the
others, and see what it looks like most, if the data(characteristics) of this new
ball (that you doesn’t know the correct class) is similar to the data of the
yellow balls, then the color of the new ball is yellow, if the data in the new
ball is more similar to the data of the purple then yellow, then the color of
the new ball is purple, it looks so simple, and that is almost what the knn
does, but in a most sophisticated way.
The KNN’s steps are:
1 — Receive an unclassified data;
2 — Measure the distance (Euclidian, Manhattan, Minkowski or Weighted)
from the new data to all others data that is already classified;
3 — Gets the K(K is a parameter that you define) smaller distances;
4 — Check the list of classes had the shortest distance and count the amount
of each class that appears;
5 — Takes as correct class the class that appeared the most times;
6 —Classifies the new data with the class that you took in step 5;

Calculating distance:
To calculate the distance between two points (your new sample and all the
data you have in your dataset) is very simple, as said before, there are several
ways to get this value, in this article we will use the Euclidean distance.

The Euclidean distance’s formula is like the image below:

Characteristics of KNN
Between-sample geometric distance
The k-nearest-neighbor classifier is commonly based on the Euclidean
distance between a test sample and the specified training samples. Let xi be
an input sample with p features (xi1,xi2,…,xip) , n be the total number of input
samples (i=1,2,…,n) and p the total number of features (j=1,2,…,p) .
The Euclidean distance between sample xi and xl (l=1,2,…,n) is defined as
d(xi,xl)= √ xi1−xl1)2+(xi2−xl2)2+⋯+(xip−xlp)2

Classification decision rule and confusion matrix


Classification typically involves partitioning samples into training and testing
categories. Let xi be a training sample and x be a test sample, and let ω be
the true class of a training sample and ω^ be the predicted class for a test
sample (ω,ω^=1,2,…,Ω) . Here, Ω is the total number of classes.
Feature transformation
Increased performance of a classifier can sometimes be achieved when the
feature values are transformed prior to classification analysis. Two commonly
used feature transformations are standardization and fuzzification.
Standardization removes scale effects caused by use of features with
different measurement scales. For example, if one feature is based on patient
weight in units of kg and another feature is based on blood protein values in
units of ng/dL in the range [-3,3], then patient weight will have a much
greater influence on the distance between samples and may bias the
performance of the classifier. Standardization transforms raw feature values
into z-scores using the mean and standard deviation of a feature values over
all input samples, given by the relationship
zij=xij−μjσj,
where xij is the value for the ith sample and jth feature, μj is the average of
all xij for feature j, σj is the standard deviation of all xij over all input samples.
Performance assessment with cross-validation
A basic rule in classification analysis is that class predictions are not made for
data samples that are used for training or learning. If class predictions are
made for samples used in training or learning, the accuracy will be artificially
biased upward. Instead, class predictions are made for samples that are kept
out of training process.
The performance of most classifiers is typically evaluated
through cross-validation, which involves the determination of classification
accuracy for multiple partitions of the input samples used in training. For
example, during 5-fold (κ=5) cross-validation training, a set of input samples
is split up into 5 partitions D1,D2,…,D5 having equal sample sizes to the extent
possible. The notion of ensuring uniform class representation among the
partitions is called stratified cross-validation, which is preferred. To begin,
for 5-fold cross-validation, samples in partitions D2,D3,…,D5 are first used for
training while samples in partition D1 are used for testing. Next, samples in
groups D1,D3,…,D5 are used for training and samples in partition D2 used for
testing. This is repeated until each partitions have been used singly for testing.
It is also customary to re-partition all of the input samples e.g. 10 times in
order to get a better estimate of accuracy.

You might also like