0% found this document useful (0 votes)
34 views96 pages

Presentation UNIT-2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views96 pages

Presentation UNIT-2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

UNIT-2

• The k-nearest neighbors (KNN) algorithm is a simple, easy-to-


implement supervised machine learning algorithm.
• K-Nearest Neighbour is one of the simplest Machine Learning
algorithms.

• K-NN algorithm assumes the similarity between the new


case/data and available cases and put the new case into the
category that is most similar to the available categories.

• K-NN algorithm stores all the available data and classifies a


new data point based on the similarity.

• K-NN algorithm can be used for Regression as well as for


Classification.
• Example: Suppose, we have an image of a creature that
looks similar to cat and dog, but we want to know either
it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our
KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
• Why do we need a K-NN Algorithm?
• Suppose there are two categories, i.e., Category A and
Category B, and we have a new data point x1 (Input
Value). This new data point will lie in which of these
categories?. To solve this type of problem, we need a K-
NN algorithm. With the help of K-NN, we can easily
identify the category or class of a particular dataset.
Consider the below diagram:
Algorithm:
• The K-NN working can be explained on the basis of the
below algorithm:
• Step-1: Select the number K of the neighbors.
• Step-2: Calculate the Euclidean distance of K number
of neighbors
• Step-3: Take the K nearest neighbors as per the
calculated Euclidean distance.
• Step-4: Among these k neighbors, count the number
of the data points in each category.
• Step-5: Assign the new data points to that category for
which the number of the neighbor is maximum.
• Step-6: Our model is ready.
• Example:- We have data from the
questionnaires survey(to ask people opinion)
and objective testing with two attributes(1.
Acid Durability and 2. Strength) to classify
whether a paper tissues is good or not?
• Here is four training data set is given .
Sr. X1=Acid X2=Strength
No. Durability (In Kg/square Y=Classification
(In Seconds) meter)
01 7 7 Bad
02 7 4 Bad
03 3 4 Good
04 1 4 Good
• Training (Sample) data set=(X1, X2) =(7,7),(7,4), (3,4), (1,4)
• Now factory produces a new paper tissues that pass
laboratory test with X1=3 and X2=7.
• Testing Sample data=(3,7)
• Without expensive survey, Can we guess/classify this new
paper tissues is Good or Not Good(Bad)?
• So , we applying KNN algorithm.
• Step1:- Determine , K= No. of nearest neighbors.
Suppose K=3.
• Step2: Calculate the distance between the query-instance
(Testing sample) and all the training samples.
Coordinate of query-instance is =(3,7).
So we compute Euclidian Distance as follows:
1. (7,7),(3,7)=√(7-3)2+(7-7)2=√16=4
2. (7,4), (3,7)= √(7-3)2+(4-7)2=√25=5
3. (3,4), (3,7)= √(3-3)2+(4-7)2=√9=3
4. (1,4), (3,7)= √(1-3)2+(4-7)2=√13=3.6
Step3:-Sorting the distance in Ascending order and determine nearest
neighbors based on K-th minimum distance.
X1 X2 Euclidian Distance Is it included in Category of
( in Ascending Order) nearest Nearest
neighbors Neighbors
3 4 3 Yes Good
1 4 3.6 Yes Good
7 7 4 Yes Bad
7 4 5 No Bad

Step4: Count the No. of Nearest Neighbors in each category.


We have Good=2 and Bad=1, since 2>1, then we conclude that a new paper
Conclusion:
• We have 2 Good NN and 1 Bad NN, since 2>1,
then we conclude that a new paper tissues
with X1=3 and X2=7 should be included in
Good Category.
The KNN Algorithm:
1. Load the data
2. Initialize K to your chosen number of neighbors
3. For each example in the data
• 3.1 Calculate the distance between the query example
and the current example from the data.
• 3.2 Add the distance and the index of the example to an
ordered collection
4. Sort the ordered collection of distances and indices from
smallest to largest (in ascending order) by the distances
5. Pick the first K entries from the sorted collection
6. Get the labels of the selected K entries
7. If regression, return the mean of the K labels
8. If classification, return the mode of the K labels
• How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the
value of K in the K-NN algorithm:
1. There is no particular way to determine the best value
for "K", so we need to try some values to find the best
out of them.

2. The most preferred value for K is 5.

3. A very low value for K such as K=1 or K=2, can be noisy


and lead to the effects of outliers in the model.

4. Large values for K are good, but it may find some


difficulties.
Advantages of KNN Algorithm:
1. It is simple to implement.
2. It is robust to the noisy training data.
3. It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


4. Always needs to determine the value of K which may be
complex some time.

5. The computation cost is high because of calculating the


distance between the data points for all the training
samples.

6. The algorithm gets significantly slower as the number of


examples and/or predictors/independent variables increase.
Support Vector Machine Algorithm

• Support Vector Machine or SVM is one of the most popular Supervised


Learning algorithms, which is used for Classification as well as Regression
problems.

• However, primarily, it is used for Classification problems in Machine


Learning.

• The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.

• This best decision boundary is called a hyperplane.

• SVM chooses the extreme points/vectors that help in creating the


hyperplane.

• These extreme cases are called as support vectors.


• In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is a number of features you have)
with the value of each feature being the value of a particular
coordinate.

• Then, we perform classification by finding the hyper-plane that


differentiates the two classes very well.

• Hyperplane whose distance from it to the nearest data point on


each side is maximized.

• If such a hyperplane exists it is known as the maximum-margin


hyperplane/hard margin.

• SVM algorithm can be used for Face detection, image


classification, text categorization, etc.
• Consider the below diagram in which there
are two different categories that are classified
using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat
or dog, so such a model can be created by using the SVM algorithm.
• We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it
with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme
cases (support vectors), it will see the extreme case of cat and dog.
• On the basis of the support vectors, it will classify it as a cat.
• Consider the below diagram:
Types of SVM

• Linear SVM: Linear SVM is used for linearly separable


data, which means if a dataset can be classified into
two classes by using a single straight line, then such
data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.

• Non-linear SVM: Non-Linear SVM is used for non-


linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
Hyperplane: There can be multiple lines/decision
boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision
boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the
features present in the dataset, which means if there
are 2 features (as shown in image), then hyperplane
will be a straight line. And if there are 3 features,
then hyperplane will be a 2-dimension plane.
• We always create a hyperplane that has a maximum
margin, which means the maximum distance
between the data points.
Support Vectors:
• The data points or vectors that are the closest
to the hyperplane and which affect the
position of the hyperplane are termed as
Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
How does SVM works?
• Linear SVM: Suppose we have a dataset that has two tags (green
and blue), and the dataset has two features x1 and x2. We want
a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue. Consider the below image:

• So as it is 2-d space so by just using a straight line, we can easily


separate these two classes.
• But there can be multiple lines that can
separate these classes. Consider the below
image:
• Hence, the SVM algorithm helps to find the best line or decision
boundary.
• This best boundary or region is called as a hyperplane.
• SVM algorithm finds the closest point of the lines from both the classes.
These points are called support vectors.
• The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin.
• The hyperplane with maximum margin is called the optimal hyperplane.
• Non-Linear SVM: If data is linearly arranged, then we can
separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line.

• Consider the below image:


• So to separate these data points, we need to add one more
dimension. For linear data, we have used two dimensions x and y, so
for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2

• By adding the third dimension, the sample space will become as


below image:
• Since we are in 3-d Space, hence it is looking
like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
• How can we identify the right hyper-plane?”. Don’t worry, it’s not as
hard as you think!

• Scenario-1:Identify the right hyper-plane: Here, we have three


hyper-planes (A, B, and C). Now, identify the right hyper-plane to
classify stars and circles.

•To identify the right hyper-plane: “Select the hyper-plane which


segregates the two classes better”.

•In this scenario, hyper-plane “B” has excellently performed this job.
• Scenario-2: Identify the right hyper-plane : Here, we
have three hyper-planes (A, B, and C) and all are
segregating the classes well. Now, How can we
identify the right hyper-plane?

Here, maximizing the distances between nearest data point


(either class) and hyper-plane will help us to decide the right
hyper-plane. This distance is called as Margin.
• Let’s look at the below snapshot:

• Above, you can see that the margin for hyper-plane C is high as
compared to both A and B. Hence, we name the right hyper-
plane as C. Another lightning reason for selecting the hyper-
plane with higher margin is robustness. If we select a hyper-plane
having low margin then there is high chance of miss-
classification.
• Scenario3:Identify the right hyper-plane :

Some of you may have selected the hyper-plane B as it has


higher margin compared to A. But, here is the catch, SVM selects
the hyper-plane which classifies the classes accurately prior
to maximizing margin. Here, hyper-plane B has a classification
error and A has classified all correctly. Therefore, the right hyper-
plane is A.
• Scenario-4:Can we classify two classes :
• Below, I am unable to segregate the two classes
using a straight line, as one of the stars lies in the
territory of other(circle) class as an outlier.
• As I have already mentioned, one star at other end is
like an outlier for star class.

• The SVM algorithm has a feature to ignore outliers and


find the hyper-plane that has the maximum margin.
Hence, we can say, SVM classification is robust to
outliers.
• Find the hyper-plane to segregate to classes (Scenario-5): In
the scenario below, we can’t have linear hyper-plane between
the two classes, so how does SVM classify these two classes?
Till now, we have only looked at the linear hyper-plane.
• SVM can solve this problem. Easily! It solves this problem by
introducing additional feature. Here, we will add a new feature
z=x^2+y^2. Now, let’s plot the data points on axis x and z:

• In above plot, points to consider are:


• All values for z would be positive always because z is the
squared sum of both x and y
• In the original plot, red circles appear close to the origin of x
and y axes, leading to lower value of z and star relatively away
from the origin result to higher value of z.
• In the SVM classifier, it is easy to have a linear hyper-
plane between these two classes. But, another
burning question which arises is, should we need to
add this feature manually to have a hyper-plane. No,
the SVM algorithm has a technique called the kernel
trick. The SVM kernel is a function that takes low
dimensional input space and transforms it to a higher
dimensional space i.e. it converts not separable
problem to separable problem. It is mostly useful in
non-linear separation problem. Simply put, it does
some extremely complex data transformations, then
finds out the process to separate the data based on
the labels or outputs you’ve defined.
SVM Kernel:
• The SVM kernel is a function that takes low dimensional
input space and transforms it into higher-dimensional
space, ie it converts not separable problem to separable
problem. It is mostly useful in non-linear separation
problems. Simply put the kernel, it does some extremely
complex data transformations then finds out the process to
separate the data based on the labels or outputs defined.

• Advantages of SVM:
• Effective in high dimensional cases
• Its memory efficient as it uses a subset of training points in
the decision function called support vectors
• Different kernel functions can be specified for the decision
functions and its possible to specify custom kernels
Sr. No. Title of Book Author Publications
01 Machine Learning, Second Stephen Marsland CRC Press
Edition
02 Introduction to Machine Ethem Alpaydin The MIT Press
Learning, Third Edition
03 Machine Learning Tom M. Mitchell McGraw-Hill
Website:-
1. https://www.javatpoint.com/k-nearest-neighbor-
algorithm-for-machine-learning.
2. https://medium.com/@adi.bronshtein
3.
https://www.analyticsvidhya.com/blog/2021/05/k
nn-the-distance-based-machine-learning-algorithm
4.https://www.tutorialspoint.com/
5. https://towardsdatascience.com/
6.
https://people.revoledu.com/kardi/tutorial/KNN/K
NN_Numerical-example.html
7. https://medium.com/analytics-vidhya
Decision Tree Classification Algorithm:
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.

• It uses a flowchart like a tree structure to show the predictions that result
from a series of feature-based splits.

• It is a tree-structured classifier, where internal nodes represent the features


of a dataset, branches represent the decision rules and each leaf node
represents the outcome.
• Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain
any further branches.

• The decisions or the test are performed on the basis of features of the given
dataset.

• In order to build a tree, we use the CART algorithm, which stands


for Classification and Regression Tree algorithm.
• Below diagram explains the general structure
of a decision tree:
Decision Tree Terminologies:
• Root Node: Root node is from where the decision tree
starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the
tree cannot be segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given
conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted
branches from the tree.
• Parent/Child node: The root node of the tree is called the
parent node, and other nodes are called the child nodes.
The independent variables are Outlook, Temperature, Humidity, and Wind. The
dependent variable is whether to play football or not.
As the first step, we have to find the Root Node(parent node) for our decision tree.
For that follow the steps:
Entropy:
• Entropy is a measure of randomness. In other
words, its a measure of unpredictability. We
will take a moment here to give entropy in
case of binary event(like the coin toss, where
output can be either of the two events, head
or tail) a mathematical face:
Entropy = -[probability(a) * log2(probability(a))] – [probability(b) * log2(probability(b))]

Where,
probability(a) is probability of getting head and
probability(b) is probability of getting tail.
What is “Entropy”? and What is its function?
• In machine learning, entropy is a measure of the
randomness in the information being processed.
The higher the entropy, the harder it is to draw
any conclusions from that information.

• It is a measure of the amount of uncertainty in a


data set. Entropy controls how a Decision Tree
decides to split the data. It actually affects how
a Decision Tree draws its boundaries.
• Information Gain
• In the decision tree shown here:
Example:- Given the set S = {a, a, a, b, b, b, b, b}
Total intances: 8
Instances of b: 5
Instances of a: 3

= [ -3/8*log2(3/8) - 5/8*log2(5/8)]
= [ -0.375 * (-1.415) - 0.625 * (-0.678)]
= (0.530+0.424)
= 0.954 bits
Information Gain:
• The measure we will use called information gain,
is simply the expected reduction
in entropy caused by partitioning the data set
according to this attribute.

• The information gain (Gain(S,A)) of an attribute A


relative to a collection of data set S, is defined as-
• Information gain is used for determining the best
features/attributes that render maximum information about a
class.

• It follows the concept of entropy while aiming at decreasing the


level of entropy, beginning from the root node to the leaf nodes.

• Information gain computes the difference between entropy


before and after split and specifies the impurity in class
elements.

• Information Gain = Entropy before splitting - Entropy after


splitting
The expected information needed to classify a tuple in D is given
by
Information gain is calculated by comparing the entropy of the
dataset before and after a transformation.

• where pi is the nonzero probability that an arbitrary tuple in D


belongs to class Ci and is estimated by | Ci , D |/ |D |.

• A log function to the base 2 is used, because the information


is encoded in bits.

• Info(D) is also known as the entropy of D.


• How much more information would we still need (after the
partitioning) to arrive at an exact classification?

• This amount is measured by

• The term | Dj |/ |D| acts as the weight of the jth partition.

• InfoA(D) is the expected information required to classify a


tuple from D based on the partitioning by A.

• The smaller the expected information (still) required, the


greater the purity of the partitions.
Information gain is defined as the difference between the
original information requirement and the new requirement
(i.e., obtained after partitioning on A).

• It is given by

• In other words, Gain(A) tells us how much would be gained


by branching on A.

• It is the expected reduction in the information requirement


caused by knowing the value of A.

• The attribute A with the highest information gain, Gain(A), is


chosen as the splitting attribute at node N.
Table 8.1 Class-Labeled Training Tuples
RID Age income student Credit_rating Class:
buys_computer
1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle_aged High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle_aged Low Yes Excellent Yes
8 Youth Medium No Fair No
9 Youth low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 youth Medium Yes Excellent Yes
12 Middle_aged Medium No Excellent Yes
13 Middle_aged High Yes Fair Yes
14 senior medium No Excellent No
Example 8.1 Induction of a decision tree using information
gain.

Solution: Table 8.1 presents a training set, D.


• In this example, each attribute is discrete valued.
( Continuous-valued attributes have been generalized.)

• The class label attribute, buys computer, has two distinct


values (namely, {yes, no}) ;

• therefore, there are two distinct classes (i.e., m=2).

• Let class C1 correspond to yes and class C2 correspond to


no.
• There are 9 tuples of class yes and 5 tuples of class no.

• A (root) node N is created for the tuples in D.

• To find the splitting criterion for these tuples, we must


compute the information gain of each attribute.

• We first use Eq. (8.1) to compute the expected


information needed to classify a tuple in D:
• Next, we need to compute the expected information
requirement for each attribute.

• Let’s start with the attribute age. We need to look at the


distribution of yes and no tuples for each category of age.

• For the age category “youth,” there are 2 yes tuples and 3
no tuples.

• For the category “middle aged,” there are 4 yes tuples and
0 no tuples.

• For the category “senior,” there are 3 yes tuples and 2 no


tuples.
• Using Eq. (8.2), the expected information needed to
classify a tuple in D if the tuples are partitioned
according to age is,

= 0.694
• Hence, the gain in information from such a partitioning
would be,

= 0.940-0.694
= 0.246 bits
• Next, we need to compute the expected information
requirement for the attribute Income.

• We need to look at the distribution of yes and no tuples for


each category of age.

• For the income category “High ,” there are 2 yes tuples and
2 no tuples.

• For the income category “Medium,” there are 4 yes tuples


and 2 no tuples.

• For the category “Low ,” there are 3 yes tuples and 1 no


tuples.
• Using Eq. (8.2), the expected information needed to classify a tuple in D if the
tuples are partitioned according to income is,

Entropy(income(D))=(4/14)*[-2/4*log(2/4)-2/4*log(2/4)+(6/14)*[-4/6*log(4/6)-
2/6*log(2/6)] + (4/14)*[-3/4*log(3/4)- ¼ *log(1/4)]

=(4/14)*[(-0.5*-1.0+-0.5* 1.0)]+(6/14)*[(0.666*0.5849)+(0.3333*1.5851)]+
(4/14)*[(0.75*0.4150) + (0.25*2.0)]

=(2/7)*[0.5+0.5]+(3/7)*[(0.666*0.5849)+(0.3333*1.5851)]+
(2/7)*[(0.75*0.4150) + (0.25*2.0)]

= 0.2857*[1.0] + 0.4285*[0.38954+0.5283]+ 0.2857*[0.31125+0.5]


= 0.2857+0.3934+ 0.2317
= 0.9108
• Hence, the gain in information from such a partitioning would be,
• Gain(income)= Info(D)- Info(income)(D)
= 0.940-0.9108
= 0.029 bits ------------------(8.1)
• Next, we need to compute the expected
information requirement for the attribute Student.

• We need to look at the distribution of yes and no


tuples for each category of student.

• For the student category “Yes ,” there are 6 yes


tuples and 1 no tuples.

• For the student category “No,” there are 3 yes


tuples and 4 no tuples.
• Using Eq. (8.2), the expected information needed to classify a tuple in D if the
tuples are partitioned according to student is,

Info(student)(D)=(7/14)*[-6/7*log(6/7)-1/7*log(1/7)+(7/14)*[-3/7*log(3/7)-
4/7*log(4/7)]

= (7/14)*[0.8571*(0.2224)+(0.1428*2.8079)+(7/14)*[0.4285*1.2226 -
0.5714*0.8074)]

= (7/14)*[0.8571*(0.2224)+(0.1428*2.8079)+(7/14)*[0.4285*1.2226 -
0.5714*0.8074)]
= 0.5*(0.1906+0.4009)+0.5*(0.5238+0.4613)
=0.2957+0.4925
=0.7880

• Hence, the gain in information from such a partitioning would be,

Gain(student) = 0.940-0.788
= 0.152 bits
• Similarly, we can compute Gain(credit_rating)= 0.048 bits.

• Because age has the highest information gain among the


attributes, it is selected as the splitting attribute.

• Node N is labeled with age, and branches are grown for each of
the attribute’s values.

• The tuples are then partitioned accordingly, as shown in Figure 8.5.

• Notice that the tuples falling into the partition for


age=middle_aged all belong to the same class. Because they all
belong to class “yes,” a leaf should therefore be created at the end
of this branch and labeled “yes.”

• The final decision tree returned by the algorithm was shown earlier
in Figure 8.2.
• Step 1: Calculate the Entropy of one attribute —
Prediction: Clare Will Play Tennis/ Clare Will Not
Play Tennis
• For this illustration, I will use this contingency table
to calculate the entropy of our target variable:
Played? (Yes/No). There are 14 observations (10
“Yes” and 4 “No”). The probability (p) of ‘Yes’ is
0.71428(10/14), and the probability of ‘No’ is
0.28571 (4/14). You can then calculate the entropy
of our target variable using the equation above.
• Step 2: Calculate the Entropy for each feature using the
contingency table
• To illustrate, I use Outlook as an example to explain how to
calculate its Entropy. There are a total of 14 observations.
Summing across the rows we can see there are 5 of them
belong to Sunny, 4 belong to Overcast, and 5 belong to Rainy.
Therefore, we can find the probability of Sunny, Overcast,
and Rainy and then calculate their entropy one by one using
the above equation. The calculation steps are shown below.
• Definition: Information Gain is the decrease
or increase in Entropy value when the node is
split.
• The equation of Information Gain:

• Information Gain from X on Y.


• Step 3: Choose attribute with the largest
Information Gain as the Root Node
• The information gain of ‘Humidity’ is the
highest at 0.918. Humidity is the root node.
• Step 4: A branch with an entropy of 0 is a leaf
node, while a branch with entropy more than 0
needs further splitting.
• Step 5: Nodes are grown recursively in the ID3
algorithm until all data is classified.
• We decided to break the first decision on the basis of
outlook. We could have our first decision based on
humidity or wind but we chose outlook. Why?

• Because making our decision on the basis of outlook


reduced our randomness in the outcome(which
is whether to play or not), more than what it would
have been reduced in case of humidity or wind.

• Let’s understand with the example here.

• Please refer to the play tennis dataset that is pasted


above.
• What is a good quantitative measure of the
worth of an attribute? Information gain
measures how well a given attribute separates
the training examples according to their target
classification. ID3 uses this information gain
measure to select among the candidate
attributes at each step while growing the tree.
• Information Gain is based on Entropy
• We have data for 14 days. We have only two
outcomes :
• Either we played tennis or we didn’t play.
• In the given 14 days, we played tennis on 9 occasions
and we did not play on 5 occasions.

• Probability of playing tennis:


• Number of favourable events : 9
• Number of total events : 14
• Probability = (Number of favourable events) / (Number of total events)

= 9/14
= 0.642
• Now, we will see probability of not playing
tennis.
• Probability of not playing tennis:
Number of favourable events : 5
• Number of total events : 14
• Probability = (Number of favourable events) / (Number of total events)

=5/14
=0.357
• Entropy at source= -(Probability of playing
tennis) * log2(Probability of playing tennis) –
(Probability of not playing tennis) *
log2(Probability of not playing tennis)
• E(S) = -[(9/14)log2(9/14) - (5/14)log2(5/14)]
= -0.652 * log2(0.652) – 0.357*log2(0.357)
=0.940
So, Entropy of whole system before we make
our first question is 0.940
Note: Here typically we will take log to base 2
• From the above data for outlook we can arrive at the following
table easily:

• Now we have to calculate average weighted entropy. ie, we have


found the total of weights of each feature multiplied by probabilities.
• Entropy among the three branches:
• Entropy among three branches = ((number of sunny days)/(total
days) * (entropy when sunny)) + ((number of overcast
days)/(total days) * (entropy when overcast)) + ((number of rainy
days)/(total days) * (entropy when rainy))
• E(S, outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) +
(5/14)*E(2,3)
= ((5/14) * 0.971) +((4/14) * 0.0) +((5/14)*0.971)
= 0.693

Hence, the gain in information from such a


partitioning would be,

Gain(S, Outlook)= E(S)- E(S, Outlook)


= 0.940-0.693
=0.247
• Now we will calculate Information gain for Humidity:
Play

Humidity Yes No Total


High 3 4 7
Normal 6 1 7

• Entropy among the three branches:


• Entropy among three branches = ((number of High
Humidity days)/(total days) * (entropy when High))
+ ((number of Normal days)/(total days) * (entropy
when Normal))

• E(S, Humidity) = (7/14)*E(3,4) + (7/14)*E(6,1)


= (7/14)*(-(3/7)log2(3/7)-(4/7)log2(4/7))+ (7/14)* (-
(6/7)log2(6/7)-(1/7)log2(1/7))
=0.788

Hence, the gain in information from such a


partitioning would be,

Gain(S, Humidity)= E(S)- E(S,Humidity)


= 0.940-0.788
=0.152
Information gain for Outlook, Temperature,
Humidity, and Wind is as follows:
• Outlook: Information Gain= 0.247
• Humidity:Information Gain = 0.152
• Windy: Information Gain = 0.048
• Temperature: Information Gain = 0.029

• Now select the feature having the largest


Information gain.
• Here it is Outlook. So it forms the first node(Root
node) of our decision tree.
• Since overcast contains only examples of class
‘Yes’ we can set it as yes. That means If outlook
is overcast Tennis will be played. Now our
decision tree looks as follows.
• The next step is to find the next node in our
decision tree.
• Now we will find one under sunny.
• Now, we have to determine which of the
following Temperature, Humidity or Wind has
higher information gain.
• The next step is to find the next node in our
decision tree. Now we will find one under
sunny. We have to determine which of the
following Temperature, Humidity or Wind has
higher information gain.

Day Outlook Temp Humidity Wind Class:


Play Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Rain
Overcast

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[4+,0-] [3+,2-]
[2+,3-]
Humidity Yes ?
Test for this
node
High Normal
• Calculate parent entropy E(sunny)
E(sunny) = -(2/5)log(2/5)-(3/5)log(3/5)
= -( 0.4)*(-1.3219)-(0.6)*(-0.7369)
= 0.5287+0.4421
= 0.971
• Now Calculate the information gain of
Humidity, Wind and Temperature.
• IG(sunny, Temperature)
Play Tennis?
Yes No Total
Sunny Humidity High 0 3 3
Normal 2 0 2
5

• E(Sunny, Humidity)=(3/5)*E(0,3)+(2/5)*E(2,0)
=(3/5)*0+(2/5)*0
=0
IG(Sunny, Humidity)=E(S)- E(Sunny, Humidity)
= 0.971-0
= 0.971
For humidity from the above table, we can say that play will occur if
humidity is normal and will not occur if it is high. Similarly, find the
nodes under rainy.
Play Tennis?
Yes No Total
Hot 0 2 2
Sunny Temp. Mild 1 1 2
Cool 1 0 1
5

E(Sunny, Temperature)=(2/5)*E(0,2)+(2/5)*E(1,1)+(1/5)*E(1,0)
=(2/5)*0+(2/5)*(-(1/2)log1/2-(1/2)log1/2)+ (1/5)*0
=0+(2/5)*1.0+0= 0.40

IG(Sunny, Temp.)=E(S)- E(Sunny, Temperature)


= 0.971-0.40
= 0.571
Class:
Play Tennis?
Yes No Total
Strong 1 1 2
Sunny Wind Weak 1 2 3
5

E(Sunny, Wind)=(2/5)*E(1,1)+(3/5)*E(1,2)
=(2/5)*(-(1/2)log(1/2)-(1/2)log(1/2)+ (3/5)*(-(1/3)log1/3-(2/3)log2/3)
=0.4*[0.5(-1)+(0.5)(-1)]+0.6[-0.3333*(-1.5851)-0.6666*(-0.585)]
=0.4*1.0+0.6 *[0.5278+0.3896]
=0.4+0.6*0.9174
=0.4+0.5504
=0.9504

IG(Sunny, Wind)=E(S)- E(Sunny, Wind)


= 0.971-0.9504
= 0.020
Day Outlook Temp Humidity Wind Class:
Play Tennis?
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D10 Rain Mild Normal Weak Yes
D14 Rain Mild High Strong No

E(Rain)=E(3,2)
= (-3/5)*log3/5-(2/5)log(2/5)
=0.6*0.7369+0.4*1.3219
=0.970

Now Calculate the information gain of Humidity, Wind and Temperature.


Humidity Yes No Total

High 1 1 2
Outlook Rain Normal 2 1 3
Total 5

E(Rain , Humidity)=(2/5)*E(1,1)+(3/5)*E(2,1)
=(2/5)*(-(1/2)log1/2-(1/2)log1/2+(3/5)*(-(2/3)log2/3-(1/3)log1/3
=0.4*((0.5*-1)+(0.5*1)+0.6*(0.666*0.5851+0.333*1.5851)
=0.4*1.0+0.6*(0.3900+0.5283)
=0.4+0.5509
=0.950
IG(Rain, Humidity)=E(S)- E(Rain, Humidity)
= 0.970-0.950
= 0.020
Day Outlook Wind Total
D4 Rain Weak Yes
D5 Rain Weak Yes
D6 Rain Strong No
D10 Rain Weak Yes
D14 Rain Strong No

E(Rain , Wind)=(2/5)*E(1,1)+(3/5)*E(2,1)
=(2/5)*(-(1/2)log1/2-(1/2)log1/2+(3/5)*(-(2/3)log2/3-(1/3)log1/3
=0.4*((0.5*-1)+(0.5*1)+0.6*(0.666*0.5851+0.333*1.5851)
=0.4*1.0+0.6*(0.3900+0.5283)
=0.4+0.5509
=0.950
IG(Rain, Wind)=E(S)- E(Rain, Wind)
= 0.970-0.950
= 0.020
Day Outlook Temp Humidity Wind Class:
Play Tennis?
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D10 Rain Mild Normal Weak Yes
D14 Rain Mild High Strong No
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Overcast Rain

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[4+,0-] [3+,2-]
[2+,3-]

? Yes ?
Test for this node

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970


Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
Day Outlook Temp Humidity Wind Class:
Play Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
• Outlook: Information Gain=
• Humidity:Information Gain = 0.152
• Windy: Information Gain = 0.048
• Temperature: Information Gain = 0.029

You might also like