Classification
Classification
CLASSIFICATION Introduction to
Machine Learning
Structure Methods
10.1 Introduction
10.2 Objectives
10.3 Understanding of Supervised Learning
10.4 Introduction to Classification
10.5 Classification Algorithms
10.5.1 Naïve Bayes
10.6 Summary
10.7 Solutions/Answers
10.8 Further Readings
10.1 INTRODUCTION
What exactly does learning entail, anyway? What exactly is meant by "machine
learning"? These are philosophical problems, but we won't be focusing too
much on philosophy in this lesson; the whole focus will be on gaining a solid
understanding of how things work in practise. In the subject of data mining,
many of the ideas, such as classification and clustering, are being addressed,
and so here in this Unit, we are going to once again investigate those concepts.
Therefore, in order to achieve a better knowledge, the first step is to differentiate
between the two fields of study known as data mining and machine learning.
It's possible that, at their core, data mining and machine learning are both about
learning from data and improving one's decision-making. On the other hand,
they approach things in a different manner. To get things started, let's start
with the most important question, What exactly is the difference between Data
Mining and Machine Learning?
What is data mining? Data mining is a subset of business analytics that involves
exploring an existing huge dataset in order to discover previously unknown
patterns, correlations, and anomalies that are present in the data. This process is
referred to as "data exploration." It enables us to come up with wholly original
ideas and perspectives.
What exactly is meant by "machine learning"? The field of artificial intelligence
(AI) includes the subfield of machine learning . Machine learning involves
computers performing analyses on large data sets, after which the computers
"learn" patterns that will assist them in making predictions regarding additional
data sets. It is not necessary for a person to interact with the computer for it to
301
Machine Learning - I learn from the data; the initial programming and possibly some fine-tuning are
all that are required.
It has come to our attention that there are a number of parallels between the two
ideas, namely Data Mining and Machine Learning. These parallels include the
following:
• Both are considered to be analytical processes;
• Both are effective at recognising patterns;
• Both focus on gaining knowledge from data in order to enhance decision-
making capabilities;
• Both need a substantial quantity of information in order to be precise
Due to the mentioned similarities between the two, generally the people
confuses the two concepts. So, to clearly demarcate the two concepts one should
understand that What are the key differences between the two i.e. Data Mining
and Machine Learning,?
The following are some of the most important distinctions between the two:
• Machine learning goes beyond what has happened in the past to make
predictions about future events based on the pre-existing data. Data mining,
on the other hand, consists of just looking for patterns that already exist in
the data.
• At the beginning of the process of data mining, the 'rules' or patterns that
will be used are unknown. In contrast, when it comes to machine learning,
the computer is typically provided with some rules or variables to follow in
order to comprehend the data and learn from it.
• The mining of data is a more manual process that is dependent on the
involvement and choice-making of humans. With machine learning, on
the other hand, once the foundational principles have been established,
the process of information extraction, as well as "learning" and refining, is
fully automated and does not require the participation of a human. To put it
another way, the machine is able to improve its own level of intelligence.
• Finding patterns in an existing dataset (like a data warehouse) can be
accomplished through the process of data mining. On the other hand,
machine learning is trained on a data set referred to as a "training" data set,
which teaches the computer how to make sense of data and then how to
make predictions about fresh data sets.
The approaches to data mining problems are based on the type of information/
knowledge to be mined. We will emphasis on three different approaches:
Classification, Clustering, and Association Rules.
The classification task puts data into groups or classes that have already been
set up. The value of a user-specified goal attribute shows what type of thing
a tuple is. Tuples are made up of one or more predicating attributes and one
or more goal attributes. The task is to find some kind of relationship between
the predicating attributes and the goal attribute, so that the information or
302 knowledge found can be used to predict the class of new tuple (s).
The purpose of the clustering process is to create distinct classes from groups Classification
of tuples that share characteristic values. Clustering is the process of defining a
mapping, using as input a database containing tuples and an integer value k, in
such a way that the tuples are mapped to various clusters.
The idea entails increasing the degree of similarity within a class while
decreasing the degree of similarity between classes. There is not an objective
attribute in the clustering process. Clustering, on the other hand, is an example of
an unsupervised classification, in contrast to classification, which is supervised
by the aim attribute.
The goal of association rule mining is to find interesting connections between
elements in a data set. Its initial use was for "market basket data." The rule is
written as X→Y, where X and Y are two sets of objects that do not intersect.
Support and confidence are the two metrics for any rule. The aim is to identify,
using rules with support and confidence above, minimum support and minimum
confidence given the user-specified minimum support and minimum confidence.
The distance measure determines the distance between items or their
dissimilarity. The following are the measures used in this unit:
k
Euclidean distance: dis(ti,tj)= ∑ (t − t jh ) 2
ih
h =1
where ti and tj are tuples and h are the different attributes which can take values
from 1 to k
There are some clear differences between the two, though. But as businesses
try to get better at predicting the future, machine learning and data mining
may merge more in the future. For example, more businesses may want to use
machine learning algorithms to improve their data mining analytics.
Machine learning algorithms use computational methods to "learn" information
directly from data, without using an equation as a model. As more examples
are available for learning, the algorithms get better and better at what they do.
Machine learning algorithms look for patterns in data that occur naturally. This
gives you more information and helps you make better decisions and forecasts.
They are used every day to make important decisions in diagnosing medical
conditions, trading stocks, predicting energy load, and more. Machine learning
is used by media sites to sort through millions of options and suggest songs or
movies. It helps retailers figure out what their customers buy and how they buy
it. With the rise of "big data," machine learning has become very important for
solving problems in areas like:
• Computational finance, for applications such as credit scoring and
algorithmic trading
• Face identification, motion detection, and object detection can all be
accomplished through image processing and computer vision.
• Tumor detection, drug development, and DNA sequencing can all be
accomplished through computational biology. 303
Machine Learning - I • Production of energy, for the sake of pricing and load forecasting •
Automotive, aerospace, and manufacturing, for the purpose of predictive
maintenance • Processing of natural languages
In general, Classical Machine Learning Algorithms can be put into two groups:
Supervised Learning Algorithms, which use data that has been labelled, and
Un-Supervised Learning Algorithms, which use data that has not been labelled
and are used for Clustering. We will talk more about Clustering in Unit 15,
which is part of Block 4 of this course.
MACHINE LEARNING
CLASSIFICATION
Supervised
Learning
Develop predictive model
based on both input and REGRESSION
output data
10.2 OBJECTIVES
After completing this unit you should be able to :
• Understand Supervised Learning
• Understand Un-Supervised Learning
• Understanding various Classification Algorithms
304
Your responses to these questions will assist you in determining whether Classification
supervised or unsupervised learning is best for you.
MOBILE
2. PREPROCESS the data 5. ITERATE to find the best DEVICES
model
Workflow at a Glance
305
Machine Learning - I
Supervisor
Example: Predicting heart attacks with the help of supervised learning: Let's
say doctors want to know if someone will have a heart attack in the next year.
They have information about the age, weight, height, and blood pressure of
past patients. They know if the patients who were there before had heart attacks
within a year. So the problem is making a model out of the existing data that can
tell if a new person will have a heart attack in the next year.
Following Steps are Involved in Supervised Learning, and they are self
explanatory:
1. Determine the Type of Training Examples
2. Prepare/Gather the Training Data
3. Determine Relation Between Input Feature & Representing Learned
Function
4. Select a Learning Algorithm
5. Run the Selected Algorithm on Training Data
6. Evaluate the Accuracy of the Learned Function Using Values from Test Set
There are some common issues which are generally faced when one applies the
Supervised Learning, and they are listed below:
(i) Training and classifying require a lot of computer time, especially when big
data is involved.
(ii) Overfitting: The model may learn so much from the noise in the data that
instead of seeing it as a mistake, it can be seen as a learning concept.
(iii) A key difference between supervised and unsupervised learning is that if an
input doesn't fit into any class, the model will add it to one of the existing
ones instead of making a new one.
Lets discuss some of the Practical Applications of Supervised Machine
Learning. For beginners at least, probably knowing ‘what does supervised
learning achieve’ becomes equally or more important than simply knowing
‘what is supervised learning’.
306
A very large number of practical applications of the method can be outlined, but Classification
the following are some of the common ones
a) Detection of spam
b) Detection of fraudulent banking or other activities
c) Medical Diagnosis
d) Image recognition
e) Predictive maintenance
With increasing applications each day in all the fields, machine learning
knowledge is an essential skill.
☞ Check Your Progress 1
1. Compare between Supervised and Un-Supervised Learning.
.....................................................................................................................
.....................................................................................................................
.....................................................................................................................
2. List the Steps Involved in Supervised Learning
.....................................................................................................................
.....................................................................................................................
.....................................................................................................................
3. What are the Common Issues Faced While Using Supervised Learning
.....................................................................................................................
.....................................................................................................................
.....................................................................................................................
• k-Nearest Neighbors.
• Decision Trees.
• Support Vector Machine.
• Naive Bayes.
Major application areas of Binary Classification:
• Detection of Email spam.
• Prediction of Churn.
• Prediction of Purchase or Conversion(buy or not).
• Multi-class classification: Multi-class classification means putting things
into more than two groups. In multiclass classification, there is only one
target label for each sample. This is done by predicting which of more than
two classes the sample belongs to. An animal, for instance, can either be a
cat or a dog, but not both. Face classification, plant species classification,
and optical character recognition are some of the examples.
Popular algorithms that can be used for multi-class classification include:
• k-Nearest Neighbors.
• Decision Trees.
• Naive Bayes.
• Random Forest.
• Gradient Boosting.
Binary classification algorithms can be changed to work for problems with
more than two classes. This is done by fitting multiple binary classification
models for each class vs. all other classes (called "one-vs-rest") or one model
for each pair of classes (called one-vs-one).
• One versus the Rest: Fit one binary classification model for each class
versus all of the other classes in the dataset.
• One-versus-one: Fit one binary classification model for each pair of classes
using the one-on-one comparison method.
Binary classification techniques such as logistic regression and support vector
machine are two examples of those that are capable of using these strategies for
multi-class classification.
• Multi-label classification: Multi-label classification, also known as more
than one class classification, is a classification task in which each sample
is mapped to a collection of target labels. This classification task involves
making predictions about one or more classes; say for example, a news
story can be about Games, People, and a Location all together at the same
time.
309
Machine Learning - I Note : Classification algorithms used for binary or multi-class classification
cannot be used directly for multi-label classification.
Specialized versions of standard classification algorithms can be used, so-called
multi-label versions of the algorithms, including:
• Multi-label Decision Trees
• Multi-label Random Forests
• Multi-label Gradient Boosting
Another approach is to use a separate classification algorithm to predict the
labels for each class.
• An imbalanced classification is a task in which the number of examples
in each class is not the same. Examples includes Fraud detection, Outlier
detection, Medical diagnostic tests. These problems are modelled as two-
way classification tasks, but you may need to use specialised methods to
solve them.
The different types of classifications discussed above, have to deal with different
type of learners, and Learners in Classification Problems are categorized
into following two types :
1. Lazy Learners: Lazy Learner will first store the training dataset, and then
it will wait until it is given the test dataset. In the case of the Lazy learner,
classification is done based on the information in the training dataset that is
most relevant to the question at hand. Less time is needed for training, but
more time is needed for making predictions. K-NN algorithm and Case-
based reasoning are two examples.
2. Eager Learners: Eager Learners use a training dataset to make a
classification model before they get a test dataset. Eager learners, on the
other hand, spend less time on training and more time on making predictions.
Decision Trees, Naive Bayes, and ANN are some examples.
Examples of eager learners include the classification techniques of Bayesian
classification, decision tree induction, rule-based classification, classification
by back-propagation, support vector machines, and classification based on
association rule mining. Eager learners, when presented with a collection
of training pairs, will construct a generalisation model (also known as a
classification model) before being presented with new tuples, also known as
test tuples, to classify. One way to think about the learnt model is as one that is
prepared and eager to categorise tuples that have not been seen before.
Imagine, on the other hand, in the lazy learner approach the learner is required
to wait until the final moment, to develop a model for classification of the given
test tuple, i.e. in the lazy approach the learner just stores the training tuple,
given for classification. It does not do generalization until it is given a test tuple,
after receiving the test tuple, it classifies the tuple based on how similar it is to
the training tuples that it has previously stored. Lazy learning methods, on the
other hand, do less work when a training pair is shown but more work when
classifying or making a prediction. Because lazy learners keep the training
310
tuples, which are also called "instances." So, they are also called "instance- Classification
based learners," even though this is how most people learn.
When classifying or making a prediction, lazy learners can take a lot of
processing power. They need efficient ways to store information and can be
done well on parallel hardware. They don't explain or show much about how
the data is put together. Lazy learners, on the other hand, tend to be in favour
of incremental learning. They can make models of complex decision spaces
with hyper-polygonal shapes that other learning algorithms may not be able to
do as well (such as hyper-rectangular shapes modeled by decision trees). The
k-nearest neighbour classifier and the case-based reasoning classifier are both
types of lazy learners.
☞ Check Your Progress 2
1. Compare between Multi Class and Multi Label Classification
.....................................................................................................................
.....................................................................................................................
.....................................................................................................................
2. Compare between structured and unstructured data
.....................................................................................................................
.....................................................................................................................
.....................................................................................................................
3. Compare between Lazy learners and Eager Learners algorithms for machine
learning.
.....................................................................................................................
.....................................................................................................................
.....................................................................................................................
Class B
3. AUC-ROC curve:
• The letters AUC and ROC stand for "area under the curve" and "receiver
operating characteristics curve," respectively.
• This graph shows how well the classification model works at several
different thresholds.
• We use the AUC-ROC Curve to see how well the multi-class classification
model is doing.
• The (TPR)TruePositiveRate and the (FPR)FalsePositiveRate are used to
plot the ROC curve, with TPR on the Y-axis and FPR on the X-axis.
Classification algorithms have several applications, Following are some popular
applications or use cases of Classification Algorithms:
o Detecting Email Spam
o Recognizing Speech
o Detection of Cancer tumor cells.
o Classifying Drugs
o Biometric Identification, etc.
314
☞ Check Your Progress 3 Classification
1. List the classification algorithms under the categories of Linear and Non-
Linear Models. Also Discuss the various methods used for evaluating a
classification model
.....................................................................................................................
.....................................................................................................................
.....................................................................................................................
.....................................................................................................................
315
Machine Learning - I Note: From the data sample X and the data used for training, we may get the
parameters P(X), P(X | H), and P(H). Whereas, P(H | X) is the only variable that,
by itself, may define the likelihood that X belongs to a class C; this probability,
however, cannot be calculated. This purpose is served by Bayes' theorem in
particular.
The Bayer’s theorem states:
P(H | X) = P(X | H) P(H)
P(X)
Now after defining the Bayes theorem, let us explain the Bayesian classification
with the help of an example.
i) Consider the sample having an n-dimensional feature vector. For our
example, it is a 3-dimensional (Department, Age, Salary) vector with
training data as given in the Figure 3.
ii) Assume that there are m classes C1 to Cm. And an unknown sample X.
The problem is to data mine which class X belongs to. As per Bayesian
classification, the sample is assigned to the class, if the following holds:
P(Ci|X) > P(Cj|X) where j is from 1 to m but j ≠ i
In other words the class for the data sample X will be the class, which has the
maximum probability for the unknown sample. Please note: The P(Ci |X) will
be found using:
P(Ci| X)= P(X|Ci) P(Ci) (3)
P(X)
In our example, we are trying to classify the following data:
X = (Department = “Personal”, Age = “31 – 40” and Salary = “Medium_Range)
into two classes (based on position) C1=_BOSS_ OR C2=_ASSISTANT_.
iii) The value of P(X) is constant for all the classes, therefore, only P(X|Ci)
P(Ci) needs to be found to be maximum. Also, if the classes are equally,
then,
P(C1)=P(C2)=…..P(Cn), then we only need to maximise P(X|Ci).
How is P(Ci) calculated?
P(Ci) = Number of training samples for Class Ci
Total Number of Training Samples
In our example,
5
P(C1) =
11
and
6
P(C2) =
11
316 So P(C1) ≠ P(C2)
iv) P(X|Ci) calculation may be computationally expensive if, there are large Classification
numbers of attributes. To simplify the evaluation, in the naïve Bayesian
classification, we use the condition of class conditional independence, that is
the values of attributes are independent of each other. In such a situation:
n
P(X|Ci)= П P(xk|Ci) ….(4)
k=1
where xk represent the single dimension or attribute.
The P(xk|Ci) can be calculated using mathematical function if it is continuous,
otherwise, if it is categorical then, this probability can be calculated as:
P(xk|Ci) = Number of training samples of class Ci having the value xk for the attribute Ak
Number of training samples belonging to Ci
_OUTLOOK_ _PLAY_
0 _RAINY_ YES
1 _SUNNY_ YES
2 _OVERCAST_ YES
3 _OVERCAST_ YES
4 _SUNNY_ NO
5 _RAINY_ YES
6 _SUNNY_ YES
7 _OVERCAST_ YES
8 _RAINY_ NO
9 _SUNNY_ NO
10 _SUNNY_ YES
11 _RAINY_ NO
318
Classification
12 _OVERCAST_ YES
13 _OVERCAST_ YES
_WEATHER_ NO YES
_OVERCAST_ 0 5
_RAINY_ 2 2
_SUNNY_ 2 3
TOTAL 4 10
_WEATHER_ NO YES
_OVERCAST_ 0 5 5/14=0.35
_RAINY_ 2 2 4/14=0.29
_SUNNY_ 2 3 5/14=0.35
ALL 4/14 = 0.29 10/14 = 0.71
Applying Bayes'theorem:
P(Yes|_SUNNY_)= P(_SUNNY_|Yes)*P(Yes)/P(_SUNNY_)
P(_SUNNY_|Yes)= 3/10= 0.3
P(_SUNNY_)= 0.35
P(Yes)=0.71
So P(Yes|_SUNNY_) = 0.3*0.71/0.35= 0.60
P(No|_SUNNY_)= P(_SUNNY_|No)*P(No)/P(_SUNNY_)
P(_SUNNY_|NO)= 2/4=0.5
P(No)= 0.29
P(_SUNNY_)= 0.35
So P(No|_SUNNY_)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|_SUNNY_)>P(No|_
SUNNY_)
Hence on a _SUNNY_ day, Player can play the game.
☞ Check Your Progress 4
1. Predicting a class label using naïve Bayesian classification. We wish to
predict the class label of a tuple using naïve Bayesian classification, given
the training data as shown in Table-1 Below. The data tuples are described
by the attributes age, income, student, and credit rating.
319
Machine Learning - I
The class label attribute known as "buys computer" can take on one of two
distinct values—specifically, "yes" or "no." Let's say that C1 represents the class
buying a computer and C2 represents the class deciding not to buy a computer.
We are interested in classifying X as having the following characteristics: (age
= youth, income = medium, student status = yes, credit rating = fair).
Some of the most common techniques used for classification may include the
use of Decision Trees, K-NN etc. Most of these techniques are based on finding
the distances or uses statistical methods.
The distance measure finds, the distance or dissimilarity between objects the
measures that are used in this unit are as follows:
k
• Manhattan distance: dis(ti,tj) = ∑ (t
h =1
ih − t jh )
where ti and tj are tuples and h are the different attributes which can take values
from 1 to k
In this section, we look at the distance based classifier i.e. the k-nearest neighbor
classifiers.
A test tuple is compared to training tuples that are used in the classification
process that are similar to it. This is how nearest-neighbor classifiers work.
There are n different characteristics that can be used to define the training tuples.
Each tuple can be thought of as a point located in a space that has n dimensions.
In this method, each and every one of the training tuples is preserved within
an n-dimensional pattern space. A K-nearest-neighbor classifier searches the
pattern space in order to find the k training tuples that are the most comparable
to an unknown tuple. These k training tuples are referred to as the "k nearest
neighbours" of the unknown tuple.
A distance metric, like Euclidean distance, is used to define "closeness."The
Euclidean distance between two points or tuples, say, X1 = (x11, x12,..., x1n)
and X2 = (x21, x22,..., x2n), is
(eq. 1)
In other words, for each numeric attribute, we take the difference in value
between the corresponding values of that property in tuple X1 and the values of
that attribute in tuple X2 and then square this difference. Finally, we add up all
of these differences. The final tally of all the accumulated distances is used to
calculate the square root. In most cases, prior to making use of Equation, we first
normalize the values of every property (eq. 1). This helps avoid attributes with
initially high ranges (like income, for example) from outweighing attributes
with originally lower ranges, which helps prevent unfairness (such as binary
attributes). Min-max normalization, for example, can be used to change the
value of a numeric attribute A from v to v' in the range [0, 1].
(eq. 2) 321
Machine Learning - I Where, minA and maxA are the minimum and maximum values of attribute A
For the purpose of k-nearest-neighbor classification, the unknown tuple
is assigned to the class that has the highest frequency among its k closest
neighbours. When k equals 1, the class of the training tuple that is assigned to
the unknown tuple is the one that is most similar to the unknown tuple in pattern
space. It is also possible to utilise nearest neighbour classifiers for prediction,
which means that they can be used to deliver a real-valued forecast for a given
unknown tuple. The result that the classifier produces in this scenario is the
weighted average of the real-valued labels that are associated with the unknown
tuple's k nearest neighbours.
Classification Using Distance (K-Nearest Neighbours) - Some of the basic
points to be noted about this algorithm are:
• The training set includes classes along with other attributes. (Please refer to
the training data given in the Table given below).
• The value of the K defines the number of near items (items that have less
distance to the attributes of concern) that should be used from the given set
of training data (just to remind you again, training data is already classified
data). This is explained in point (2) of the following example.
• A new item is placed in the class in which the most number of close items
are placed. (Please refer to point (3) in the following example).
323
Machine Learning - I 2. Sort the distance and determine nearest neighbors based on the K-th
minimum distance
.....................................................................................................................
.....................................................................................................................
..................................................................................................................... 325
Machine Learning - I 10.5.3 Decision Trees
Given a data set D = {t1,t2 …, tn} where ti=<ti1, …, tih>, that is, each tuple is
represented by h attributes, assume that, the database schema contains attributes
as {A1, A2, …, Ah}. Also, let us suppose that the classes are C={C1, …., Cm},
then:
Decision or Classification Tree is a tree associated with D such that
• Each internal node is labeled with attribute, Ai
• Each arc is labeled with the predicate which can be applied to the attribute
at the parent node.
• Each leaf node is labeled with a class, Cj
Basics steps in the Decision Tree are as follows:
• Building the tree by using the training set dataset/database.
• Applying the tree to the new dataset/database.
Decision Tree Induction is the process of learning about the classification using
the inductive approach. During this process, we create a decision tree from the
training data. This decision tree can, then be used, for making classifications.
To define this we need to define the following.
Let us assume that we are given probabilities p1, p2, .., ps whose sum is 1.
Let us also define the term Entropy, which is the measure of the amount of
randomness or surprise or uncertainty. Thus our basic goal in the classification
process is that, the entropy for a classification should be zero, that, if no surprise
then, entropy is equal to zero. Entropy is defined as:
∑
s
H(p1,p2,…ps)= i =1
( pi * log(1 / pi ) ……. (1)
∑
s
Gain (D,S) =H(D) - i =1
( P( Di ) * H ( Di ) ………..(2)
where S is new states ={D1,D2,D3…DS} and H(D) finds the amount of order
in that state Consider the following data in which Position attribute acts as class
Department Age Salary Position
Since age has the maximum gain, so, this attribute is selected as the first splitting
attribute. In age range 31-40, class is not defined while for other ranges it is
defined.
So, we have to again calculate the spitting attribute for this age range (31-40).
Now, the tuples that belong to this range are as follows:
= 0.13335
The Gain is maximum for salary attribute, so we take salary as the next splitting
attribute. In middle range salary, class is not defined while for other ranges it
is defined. So, we have to again calculate the spitting attribute for this middle
range. Since only department is left, so, department will be the next splitting
attribute. Now, the tuples that belong to this salary range are as follows:
Department Position
_PERSONNEL_ _BOSS_
_ADMIN_ _BOSS_
_ADMIN_ _ASSISTANT_
Again in the _PERSONNEL_ department, all persons are _BOSS_, while, in
the _ADMIN_ there is a tie between the classes. So, the person can be either
_BOSS_ or _ASSISTANT_ in the _ADMIN_ department.
Now the decision tree will be as follows:
Figure 4: The decision tree using ID3 algorithm for the sample data
Now, we will take a new dataset and we will classify the class of each tuple by
applying the decision tree that we have built above.
329
Machine Learning - I Steps of algorithm of decision tree
1. Data Pre-processing step
2. Fitting a Decision-Tree algorithm to the Training set
3. Predicting the test result
4. Test accuracy of the result(Creation of Confusion matrix)
5. Visualizing the test set result.
Example : Problem on Decision Tree - Consider whether a dataset based on
which we will determine whether to play football or not.
PLAY
YES NO TOTAL
_SUNNY_ 3 2 5
_OUTLOOK_ _OVERCAST_ 4 0 4
_RAINY_ 2 3 5
330 14
Now we have to calculate average weighted entropy. ie, we have found the total Classification
of weights of each feature multiplied by probabilities.
E(S, _OUTLOOK_) = (5/14)*E(3,2) + (4/14)*E(4,0) + (5/14)*E(2,3) = (5/14)
(-(3/5)log(3/5)-(2/5)log(2/5))+ (4/14)(0) + (5/14)((2/5)log(2/5)-(3/5)log(3/5))
= 0.693
The next step is to find the information gain. It is the difference between parent
entropy and average weighted entropy we found above.
IG(S, _OUTLOOK_) = 0.94 - 0.693 = 0.247
Similarly find Information gain for TEMP., Humidity, and Windy.
IG(S, TEMP.) = 0.940 - 0.911 = 0.029
IG(S, Humidity) = 0.940 - 0.788 = 0.152
IG(S, Windy) = 0.940 - 0.8932 = 0.048
Now select the feature having the largest entropy gain. Here it is _OUTLOOK_.
So it forms the first node(root node) of our decision tree.
Now our data look as follows
The next step is to find the next node in our decision tree. Now we will find
one under _SUNNY_. We have to determine which of the following TEMP.,
Humidity or Wind has higher information gain.
PLAY
YES NO TOTAL
_HOT_ 0 2 2
TEMP. _COOL_ 1 1 2
_MILD_ 1 0 1
5
E(_SUNNY_, TEMP.) = (2/5)*E(0,2) + (2/5)*E(1,1) + (1/5)*E(1,0)=2/5=0.4
Now calculate information gain.
IG(_SUNNY_, TEMP.) = 0.971–0.4 =0.571
Similarly we get
IG(_SUNNY_, Humidity) = 0.971
IG(_SUNNY_, Windy) = 0.020
Here IG(_SUNNY_, Humidity) is the largest value. So Humidity is the node
that comes under _SUNNY_.
PLAY
YES NO TOTAL
HIGH 0 3 3
HUMIDITY NORMAL 2 0 2
332 5
For humidity from the above table, we can say that play will occur if humidity Classification
is normal and will not occur if it is high. Similarly, find the nodes under _
RAINY_.
Note: A branch with entropy more than 0 needs further splitting.
Finally, our decision tree will look as below:
y = b0+b1X1+b2X2+…….+bnXn
• In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y): y/(1-y); 0 for y=0 and infinity for y=1
• But we need range between -[infinity] to +[infinity], then take logarithm of
the equation it will become: Log[y/(1-y)] = b0+b1X1+b2X2+…….+bnXn
The above equation is the final equation for Logistic Regression.
The relationship between a numerical response and a numerical or categorical
predictor is the subject of the statistical technique known as simple linear
regression. While multiple regression looks at the relationship between a single
numerical response and a number of different numerical and/or categorical
predictors, single regression looks at the relationship between a single numerical
response and a single. What should be done, however, when the predictors are
odd (nonlinear, intricate dependence structure, and so on), or when the response
is unusual (categorical, count data, and so on)? When this occurs, we deal with
odds, which are another method of measuring the likelihood of an event and are
frequently applied in the context of gambling (and logistic regression).
Odds For some event E is expressed as,
odds(E) = P(E)/P(Ec ) = P(E)/(1 − P(E))
Similarly, if we are told the odds of E are x to y then
odds(E) = x/ y = {x/(x + y)}/{y/(x + y)}
which implies P(E) = x/(x + y), P(E c ) = y/(x + y)
Logistic regression is a statistical approach for modelling a binary categorical
variable using numerical and categorical predictors, and this idea of Odds is
commonly employed in it. We suppose the outcome variable was generated by
a binomial distribution, and we wish to build a model with p as the probability
of success for a given collection of predictors. There are other alternatives, but
the logit function is the most popular.
Logit function: logit (p) = log {p/(1 – p)} , for 0 ≤ p ≤ 1
Example-1: In a survey of 250 customers of an auto dealership, the service
department was asked if they would tell a friend about it. The number of people
who said "yes" was 210, where "p" is the percentage of customers in the group
from which the sample was taken who would answer "yes" to the question. Find
the sample odds and sample proportion.
Solution: The number of customers who would respond Yes in an simple
random sample (SRS) of size n has the binomial distribution with parameters n
and p. The sample size of customers is n = 250, and the number who responded
Yes is the count X = 210. Therefore, the sample proportion is p’=210/250 =
0.84
Since, Logistic regressions work with odds rather than proportions. We need to
calculate the Odds, the odds are simply the ratio of the proportions for the two
335
Machine Learning - I possible outcomes. If p’ is the proportion for one outcome, then 1 – p’ is the
proportion for the second out
odds = p’ ⁄ (1 – p’)
A similar formula for the population odds is obtained by substituting p for p’ in
this expression
Odds of responding Yes. For the customer service data, the proportion of
customers who would recommend the service in the sample of customers is p’
= 0.84, so the proportion of customers who would not recommend the service
department will be 1 – p’ i.e. 1 – p’ = 1 - 0.84 = 0.16
Therefore, the odds of recommending the service department are
odds = p’/1 – p’ = 0.84/0.16 = 5.25
When people speak about odds, they often round to integers or fractions. If we
round 5.25 to 5 = 5/1, we would say that the odds are approximately 5 to 1 that
a customer would recommend the service to a friend. In a similar way, we could
describe the odds that a customer would not recommend the service as 1 to 5.
☞ Check Your Progress 6
Q1 Odds of drawing a heart. If you deal one card from a standard deck, the
probability that the card is a heart is 13/52 = 1/4.
(a) Find the odds of drawing a heart.
(b) Find the odds of drawing a card that is not a heart.
There are many ways to put these points into groups, but the question is which
is the best and how do we find it?
NOTE: We call this decision boundary a "straight line" because we are plotting
data points on a two-dimensional graph. If there are more dimensions, we call
it a "hyperplane."
Fig4. Hyperplane
SVM is all about finding the hyperplane with the most space between the two
classes, such hyperplane is the best hyperplane. This is done by finding many
hyperplanes that best fit the labels and then picking the one that is farthest from
the data points or has the biggest margin.i.e. The best hyperplane is the one with
the greatest distance between the two classes, and this is what SVM is all about.
338
Classification
Optimal Hyper plane
Support vector
Support vector
10.11 SOLUTIONS/ANSWERS
☞ Check Your Progress 1
1. Compare between Supervised and Un-Supervised Learning.
Solution : Refer to section 10.3
2. List the Steps Involved in Supervised Learning
Solution : Refer to section 10.3
3. What are the Common Issues Faced While Using Supervised Learning
Solution : Refer to section 10.3
☞ Check Your Progress 2
1. Compare between Multi Class and Multi Label Classification
Solution : Refer to section 10.4
340
2. Compare between structured and unstructured data Classification
344