Learning
• Learning is essential for unknown environments, i.e. when designer
lacks omniscience
• Learning is useful as a system construction method, i.e., expose the
agent to reality rather than trying to write it down
• Learning modifies the agent's decision mechanisms to improve
performance
Types of Learning
• Supervised learning:
• Correct answer for each example. Answer can be a numeric
variable, categorical variable etc.
• Both inputs and outputs are given
• The outputs are typically provided by a friendly teacher.
o Unsupervised learning:
• Correct answers not given – just examples
• The agent receives some evaluation of its actions (such as a fine for
stealing bananas), but is not told the correct action (such as how to
buy bananas).
o Reinforcement learning:
• Occasional rewards
• The agent can learn relationships among its percepts, and the trend
with time
Decision Tree
• A decision tree takes as input an object or situation described by a set of
properties, and outputs a yes/no “decision”.
• Decision tree induction is one of the simplest and yet most successful
forms of
machine learning. We first describe the representation—the hypothesis
space—
and then show how to learn a good hypothesis.
• Each node tests the value of an input attribute
• Branches from the node correspond to possible values of the attribute
• Leaf nodes supply the values to be returned if that leaf is reached
Simple Examples
• Decision: Whether to wait for a table at a restaurant.
1. Alternate: whether there is a suitable alternative restaurant.
2. Bar: whether the restaurant has a Bar for waiting customers.
3. Fri/Sat: true on Fridays and Saturdays Sat.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in it (None, Some, Full).
6. Price: the restaurant’s rating (★, ★★ , ★★★).
7. Raining: whether it is raining outside Raining.
8. Reservation: whether we made a reservation.
9. Type: the kind of restaurant (Indian, Chinese, Thai, Fast food):
10.Wait Estimate: 10 mins , 10 30, 30 60, >60
Representation
• To draw a decision tree from a dataset of some attributes:
● Each node corresponds to a splitting attribute.
● Each arc is a possible value of that attribute.
● Splitting attribute is selected to be the most informative among
the attributes.
● Entropy is a factor used to measure how informative is a node.
● The algorithm uses the criterion of information gain to determine
the goodness of a split.
● The attribute with the greatest information gain is taken as the
splitting attribute, and the data set is split for all distinct values of
the attribute values of the attribute.
• Classification Methods
Gini Index
• Used in CART, SLIQ, SPRINT to determine the best split.
• Tree Growing Process:
• Find each predictor’s best split.
• Find the node’s best split : Among the best splits found in step 1,
choose the one that maximizes the splitting criterion.
• Split the node using its best split found in step 2 if the stopping
rules are not satisfied.
• Finding the Best Split: GINI Index
• GINI Index for a given node t:
•
p(j|t) is the relative frequency of j at node t
● Maximum (1-1/n): records equally distributed in n classes
● Minimum 0: all records in one class
Example
Revision Of Classification
Decision Tree
Naïve Bayes
Inductive Learning (Learning From Examples)
Different kinds of learning…
Semi-
Supervised Unsupervised Reinforcement
supervised
learning: learning: learning:
learning:
Someone gives us We see examples Small amount of
examples with right only but get no labeled data, We take actions
answer (output/label) for feedback (no large amount of and get rewards
those examples labels/output) unlabeled data
We have to predict the right We need to find Have to learn
answer for unseen patterns in the how to get high
examples data rewards
Inductive Machine Learning Process …
Learning
Preparation Testing and
Scheme OR Trained
of Training Evaluation
Learning Model
Data Test Data
Algorithm
Deploy Trained &
Tested Model in
Domain
Machine Learning Process
Learning
Scheme OR
Learning
Algorithm
Supervised Machine Learning Types
Types Supervised Machine Learning
Classification
Association
Regression \ Numeric Prediction
Also Known as Predictive Learning
Classification Algorithms
There are many algorithms for classification. Some of the
famous are following:
Decision trees
Naive Bayes classifier
Linear Regression
Logistic regression
Neural networks
Perceptron
Multi Layer Neural Networks
Support vector machines (SVM)
Quadratic classifiers
Instance Based Learning
k-nearest neighbor
Decision Tree Learning
Classification Training data may be used to create a
Decision Tree (Model) which consists of:
Nodes:
Represents the Test of Attributes/Features of Training Data
Edges:
Correspond to the outcome of a test which connect to the
next node or leaf
Leaves:
Predict the class(y)
Nominal and numeric attributes
Nominal:
number of children usually equal to number values
attribute won’t get tested more than once
Other possibility: division into two subsets
Numeric:
test whether value is greater or less than constant
attribute may get tested several times
Other possibility: three-way split (or multi-way split)
Integer: less than, equal to, greater than
Real: below, within, above
Classification using Decision Tree
Example (Restaurant)
Decision Tree of Restaurant Example
Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait
Yes No yes Yes Full $ No Yes French 30–60 ?
Criterion for attribute selection
n
There is no way to efficiently search through the 22
trees
Want to get the smallest tree
Heuristic: choose the attribute that produces the
“purest” nodes
Strategy: choose attribute that gives greatest
information gain
Purity check of an attribute
Pure Impure
Measuring Purity
Question is How to Measure Purity?
Two Methods are used
Entropy Based Information Gain
Gini Index
Information Gain Used in ID3 and its commercial
version i.e. C5.4, J48 and J50
Gini Index is used in CART algorithm
implemented in Python Sklearn
Decision Trees
There are four cases to consider while creating
If the remaining examples are all positive (or all negative), then answer Yes
or No
If there are some positive and some negative examples, then choose the
best attribute to split them
No Example left: If there are no examples left, it means that no example
has been observed for this combination of attribute values, and we return a
default value calculated from the plurality classification
Noise in Data (No feature left): If there are no attributes left, but both
positive and negative examples, it means that these examples have exactly
the same description, but different classifications. This can happen because
there is an error or noise in the data; because the domain is on
deterministic; or because we can’t observe an attribute that would
distinguish the examples. The best we can do is return the plurality
classification of the remaining examples.
Example
(Weather Data)
Which attribute to select? 19
Computing information
Measure information in bits
Formula to calculate Entropy
entropy(𝒑𝟏, 𝒑𝟐, . . . , 𝒑𝒏 ) = −𝒑𝟏 𝐥𝐨𝐠𝒑𝟏 − 𝒑𝟐 𝐥𝐨𝐠𝒑𝟐 . . . −𝒑𝒏 𝐥𝐨𝐠𝒑𝒏
Constructing DT
1. Calculate Entropy of Branch: of each value of feature
2. Average Entropy Nodes : average of Entropy of Branches
3. Calculate Information Gain Nodes: which is: Total
Information - average of Entropy of Nodes
4. Select Node with highest Information Gain
5. Expand Nodes with mixed Classes Branches & repeat
process 1-4
Example: attribute Outlook
Outlook = Sunny :
info( 𝟐, 𝟑 ) = entropy(𝟐 𝟓, 𝟑 𝟓) = −𝟐 𝟓 𝐥𝐨𝐠(𝟐 𝟓) − 𝟑 𝟓 𝐥𝐨𝐠(𝟑 𝟓) = 𝟎. 𝟗𝟕𝟏bits
Outlook = Overcast : Note: this is normally undefined.
info( 4,0 ) = entropy(1,0) = −1log(1) − 0log(0) = 0bits
Outlook = Rainy :
info( 𝟐, 𝟑 ) = entropy(𝟑 𝟓, 𝟐 𝟓) = −𝟑 𝟓 𝐥𝐨𝐠(𝟑 𝟓) − 𝟐 𝟓 𝐥𝐨𝐠(𝟐 𝟓) = 𝟎. 𝟗𝟕𝟏bits
Expected information for attribute:
info 3,2 , 4,0 , 3,2 = 5 14 × 0.971 + 4 14 × 0 + (5 14) × 0.971
= 0.693bits
Which attribute to select?
Computing information gain
• Information gain: information before splitting – information
after splitting
• Gain(Outlook ) = Info([9,5]) – info([2,3],[4,0],[3,2])
= {-9/14log9/14-5/14log5/14}- 0.693
= 0.940 – 0.693
= 0.247 bits
• Information gain for attributes from weather data:
Gain(Outlook ) = 0.247 bits
Gain(Temperature ) = 0.029 bits
Gain(Humidity ) = 0.152 bits
Gain(Windy ) = 0.048 bits
Which attribute to select?
Gain(Outlook ) = 0.247 bits
Gain(Temperature ) = 0.029 bits
Gain(Humidity ) = 0.152 bits
Gain(Windy ) = 0.048 bits
Continuing to split
gain(Temperature ) = 0.571 bits
gain(Humidity ) = 0.971 bits
gain(Windy ) = 0.020 bits
Final decision tree
Note: not all leaves need to be pure; sometimes
identical instances have different classes
Splitting stops when data can’t be split any further
Activity Restaurant?
• Construct Decision Tree on a Page and Submit
• Clearly Calculate Information Gain of every feature
• Submit Up to this Sunday
CART Algorithm
Many alternative measures to Information Gain
Most popular alternative: Gini index used in e.g., in CART
(Classification And Regression Trees)
Average Gini index (instead of average entropy / information)
Gini index is minimized instead of maximizing Gini gain
Impurity measure/For each Feature Value:
Average Gini for Attribute:
CART Algorithm
• Gini (outlook=sunny)=1-(2/5)2-(3/5)2 =0.48
• Gini (outlook=overcast)=1-(4/4)2-(0/4)2 =0
• Gini (outlook=rainy)=1-(3/5)2-(2/5)2 =0.48
Average Gini for Attribute:
Gini (outlook)=(5/14)*0.48+(4/14)*0+(5/14)*0.48
=0.342
Impurity measure/For each Feature Value:
Average Gini for Attribute:
Which attribute to select?
Activity
Weather?
• Construct Decision Tree
on a Page and Submit
• Clearly Calculate Gini
Measure feature
• Submit Up to this Sunday
Properties of Good Measure
We expect information measure should have
following properties:
1. When the number of either yes’s or no’s is
zero, the information is zero.
2. When the number of yes’s and no’s is equal,
the information reaches a maximum
3. The information should obey the multistage
property
Information Gain based on Entropy has all
above properties:
ISSUES With Training Data
Missing data: Not all the attribute values are known
Method to train and test during implementation
Multivalued attributes: Attribute has many values information gain gives inappropriate
indication of the attribute’s usefulness. An attribute such as ExactTime ,ID
One solution is to use the gain ratio C4.5
Continuous and integer-valued input attributes: Continuous or integer-valued
attributes such as Height and Weight , have an infinite set of possible values.
SPLIT POINT split point For example, at a given node in the tree, it might be the case
that testing on Weight > 160. start by sorting the values
Splitting is the most expensive part of real-world decision tree learning applications
Continuous-valued output attributes: If we are trying to predict a numerical output
value, such as the price of an apartment, then we need a regression tree rather than a
classification tree.
DT Discussion
A decision-tree learning system for real-world applications must be able to
handle all of these problems.
Handling continuous-valued variables is especially important, because both
physical and financial processes provide numerical data. Several commercial
packages built that meet these criteria,
In many areas of industry and commerce, decision trees are usually the first
method tried when a classification method is to be extracted from a data set.
One important property of decision trees is that it is possible for a human to
understand the reason for the output of the learning algorithm. (Indeed, this is
a legal requirement for financial decisions that are subject to anti-
discrimination laws.)
This is a property not shared by some other representations, such as neural
networks.
Loading Data
import pandas
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
df = pandas.read_csv("weatherdata_converted.csv")
print(df)
print(df.info())
Spliting dat
features=['Outlook', 'Temperature', 'Humidity', 'Windy']
X = df[features]
y = df['Play']
print(X)
print(y)
print(df.groupby('Play').size())
Decision Tree (Entropy)
dtree1 = DecisionTreeClassifier(criterion = "entropy")
# Performing training
dtree1.fit(X, y)
tree.plot_tree(dtree1,feature_names=features)
Decision Tree (Gini Index)
# Classifier and Prediction
dtree = DecisionTreeClassifier(criterion="gini")
dtree = dtree.fit(X, y)
#tree.plot_tree(dtree,feature_names=features)
print(X.iloc[1,:])
Prediction
print(dtree.predict(X.iloc[1:3,:]))
print(y.iloc[1:3])
x_predict=dtree.predict(X)
print(x_predict)
print ("Accuracy : ", metrics.accuracy_score(x_predict,y)*100)
Statistical modeling (Naïve Bayes)
Two assumptions: Attributes are
equally important
statistically independent (given the class value)
I.e., knowing the value of one attribute says nothing
about the value of another (if the class is known)
Independence assumption is never correct!
But … this scheme works well in practice
Bayes’s rule
Probability of event H given evidence E:
𝑃𝑟 𝐸 ∣ 𝐻 𝑃𝑟 𝐻
𝑃𝑟 𝐻 ∣ 𝐸 =
𝑃𝑟 𝐸
A priori probability of H : 𝑷𝒓 𝐄 ∣ 𝑯
Probability of event before evidence is seen
A posteriori probability of H : 𝑃𝑟 𝐻|𝑬
Probability of event after evidence is seen
Naïve Bayes for classification
Classification learning: what’s the probability
of the class given an instance?
• Evidence E = instance’s non-class attribute
values
• Event H = class value of instance
Naïve assumption: evidence splits into parts
(i.e. attributes) that are independent
𝑷𝒓 𝑬𝟏 ∣ 𝑯 𝑷𝒓 𝑬𝟐 ∣ 𝑯 . . 𝑷𝒓 𝑬𝒏 ∣ 𝑯 𝑷𝒓 𝑯
𝑷𝒓 𝑯 ∣ 𝑬 =
𝑷𝒓 𝑬
Weather data example
Outlook Temp. Humidity Windy Play Evidence E
Sunny Cool High True ?
P(yes| E) = P(Outlook = Sunny | yes)
P(Temperature= Cool | yes)
Probability of
P(Humidity = High | yes)
class “yes”
𝑷𝒓 𝑬𝟏 ∣ 𝑯 𝑷𝒓 𝑬𝟐 ∣ 𝑯 . . 𝑷𝒓 𝑬𝒏 ∣ 𝑯 𝑷𝒓 𝑯
P(Windy = True| yes) 𝑷𝒓 𝑯 ∣ 𝑬 =
𝑷𝒓 𝑬
P(yes) / P(E)
2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14
=
P(E)
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Probabilities for Rainy
Rainy
Mild
Cool
High
Normal
False
False
Yes
Yes
weather data Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Prior Probabilities for weather data
(Training)
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
• A new day: Outlook Temp. Humidity Windy Play
Sunny Cool High True ?
Likelihood of “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053
Likelihood For “no” = 3/5 1/5 4/5 3/5 5/14 = 0.0206
Normalizing Likelihood so that probablities sum to 1:
NO
How Naïve Bayes Deals….
Zero Frequency Problem
Missing Values
Numeric Values
The “zero-frequency problem”
What if an attribute value doesn’t occur with every class
value?
(e.g. “Humidity = high” for class “yes”)
Probability will be zero!
A posteriori probability will also be zero!
(No matter how likely the other values are!)
Remedy: add 1 to the count for every attribute value-
class combination (Laplace estimator)
Result: probabilities will never be zero!
(also: stabilizes probability estimates)
Missing values
Training: instance is not included in frequency
count for attribute value-class combination
Classification: attribute will be omitted from
calculation
Example:
Outlook Temp. Humidity Windy Play
? Cool High True ?
Numeric attributes
Outlook Temperature Humidity Windy Play
Sunny 85 85 FALSE no
Sunny 80 90 TRUE no
Overcast 83 86 FALSE yes
Rainy 70 96 FALSE yes
Rainy 68 80 FALSE yes
Rainy 65 70 TRUE no
Overcast 64 65 TRUE yes
Sunny 72 95 FALSE no
Sunny 69 70 FALSE yes
Rainy 75 80 FALSE yes
Sunny 75 70 TRUE yes
Overcast 72 90 TRUE yes
Overcast 81 75 FALSE yes
Rainy 71 91 TRUE no
Numeric attributes
• Usual assumption: attributes have a normal or Gaussian probability
distribution (given the class)
• Use probability density function for the normal distribution is defined
by two parameters:
• Sample mean
• Standard deviation
• Then the density function f(x) is
Statistics for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65,71, 65, 70, 70, 85, False 6 2 9 5
Overcas 4 0 69, 70, 72,80, 70, 75, 90, 91, True 3 3
tRainy 3 2 72, … 85, 80, … 95, …
Sunny 2/9 3/5 =73 …
=75 =79 =86 False 6/9 2/5 9/ 5/
Overcas 4/9 0/5 =6.2 =7.9 =10.2 =9.7 True 3/9 3/5 14 14
tRainy 3/9 2/5
• Example density value:
Classifying a new day
• A new day: Outlook Temp. Humidity Windy Play
Sunny 66 90 true ?
Likelihood of “yes” = 2/9 0.0340 0.0221 3/9 9/14 = 0.000036
Likelihood of “no” = 3/5 0.0221 0.0381 3/5 5/14 = 0.000108
P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%
• Missing values during training are not included
in calculation of mean and standard deviation
Naïve Bayes: discussion
Naïve Bayes works surprisingly well (even if
independence assumption is clearly violated)
Why? Because classification doesn’t require
accurate probability estimates as long as
maximum probability is assigned to correct class
However: adding too many redundant attributes
will cause problems (e.g. identical attributes)
Note also: many numeric attributes are not
normally distributed ( kernel density estimators)
Multinomial naïve Bayes
• Version of naïve Bayes used for document classification using bag of words
model
• n1,n2, ... , nk: number of times word i occurs in the document
• P1,P2, ... , Pk: probability of obtaining word i when sampling from documents
in class H
• Probability of observing a particular document E given probabilities class H
(based on multinomial distribution):
• Note that this expression ignores the probability of generating a document of
the right length
• This probability is assumed to be constant for all classes
Multinomial naïve Bayes
• Suppose dictionary has two words, yellow and blue
• Suppose P(yellow | H) = 75% and P(blue | H) = 25%
• Suppose E is the document “blue yellow blue”
• Probability of observing document:
0.751 0.252 27
P({blueyellowblue} | H ) = 3!´ ´ =
1! 2! 64
Suppose there is another class H' that has
P(yellow | H’) = 10% and P(blue| H’) = 90%:
0.11 0.9 2 243
P({blueyellowblue} | H ) = 3!´ ´ =
1! 2! 1000
• Need to take prior probability of class into account to make the final
classification using Bayes’ rule
Categorical Naïve Bays
import pandas
from sklearn.naive_bayes import CategoricalNB
from sklearn import metrics
df = pandas.read_csv("weatherdata_converted.csv")
print(df)
features=['Outlook', 'Temperature', 'Humidity', 'Windy']
print(df.info())
df['Outlook'] = df['Outlook'].astype('category')
df['Temperature'] =
df['Temperature'].astype('category')
df['Humidity'] = df['Humidity'].astype('category')
df['Windy'] = df['Windy'].astype('category')
X = df[features]
y = df['Play']
Categorical Naïve Bays
print(df.info())
print(df.describe())
print(df.groupby('Play').size())
clf = CategoricalNB()
clf.fit(X, y)
y_pred = clf.predict(X)
print(y_pred)
print ("Accuracy : ",
metrics.accuracy_score(y_pred,y)*
100)
Gaussian Naïve Bays (Iris flower data set)
The dataset contains a set of 150 records under five attributes /Features-
petal length, petal width, sepal length, sepal width and species(Class)
………..
………..
Gaussian Naïve Bays (Iris flower data set)
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
iris= datasets.load_iris()
X = iris.data
y = iris.target
print(X.shape)
print(y.shape)
nb = GaussianNB()
nb.fit(X,y).predict(X)
y_pred = nb.predict(X)
print(y_pred)
print ("Accuracy : ", metrics.accuracy_score(y_pred,y)*100)
Revision Of Classification
Decision Tree
Naïve Bayes
Inductive Learning (Learning From Examples)
Different kinds of learning…
Semi-
Supervised Unsupervised Reinforcement
supervised
learning: learning: learning:
learning:
Someone gives us We see examples Small amount of
examples with right only but get no labeled data, We take actions
answer (output/label) for feedback (no large amount of and get rewards
those examples labels/output) unlabeled data
We have to predict the right We need to find Have to learn
answer for unseen patterns in the how to get high
examples data rewards
Inductive Machine Learning Process …
Learning
Preparation Testing and
Scheme OR Trained
of Training Evaluation
Learning Model
Data Test Data
Algorithm
Deploy Trained &
Tested Model in
Domain
Machine Learning Process
Learning
Scheme OR
Learning
Algorithm
Supervised Machine Learning Types
Types Supervised Machine Learning
Classification
Association
Regression \ Numeric Prediction
Also Known as Predictive Learning
Classification Algorithms
There are many algorithms for classification. Some of the
famous are following:
Decision trees
Naive Bayes classifier
Linear Regression
Logistic regression
Neural networks
Perceptron
Multi Layer Neural Networks
Support vector machines (SVM)
Quadratic classifiers
Instance Based Learning
k-nearest neighbor
Decision Tree Learning
Classification Training data may be used to create a
Decision Tree (Model) which consists of:
Nodes:
Represents the Test of Attributes/Features of Training Data
Edges:
Correspond to the outcome of a test which connect to the
next node or leaf
Leaves:
Predict the class(y)
Nominal and numeric attributes
Nominal:
number of children usually equal to number values
attribute won’t get tested more than once
Other possibility: division into two subsets
Numeric:
test whether value is greater or less than constant
attribute may get tested several times
Other possibility: three-way split (or multi-way split)
Integer: less than, equal to, greater than
Real: below, within, above
Classification using Decision Tree
Example (Restaurant)
Decision Tree of Restaurant Example
Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait
Yes No yes Yes Full $ No Yes French 30–60 ?
Criterion for attribute selection
n
There is no way to efficiently search through the 22
trees
Want to get the smallest tree
Heuristic: choose the attribute that produces the
“purest” nodes
Strategy: choose attribute that gives greatest
information gain
Purity check of an attribute
Pure Impure
Measuring Purity
Question is How to Measure Purity?
Two Methods are used
Entropy Based Information Gain
Gini Index
Information Gain Used in ID3 and its commercial
version i.e. C5.4, J48 and J50
Gini Index is used in CART algorithm
implemented in Python Sklearn
Decision Trees
There are four cases to consider while creating
If the remaining examples are all positive (or all negative), then answer Yes
or No
If there are some positive and some negative examples, then choose the
best attribute to split them
No Example left: If there are no examples left, it means that no example
has been observed for this combination of attribute values, and we return a
default value calculated from the plurality classification
Noise in Data (No feature left): If there are no attributes left, but both
positive and negative examples, it means that these examples have exactly
the same description, but different classifications. This can happen because
there is an error or noise in the data; because the domain is on
deterministic; or because we can’t observe an attribute that would
distinguish the examples. The best we can do is return the plurality
classification of the remaining examples.
Example
(Weather Data)
Which attribute to select? 19
Computing information
Measure information in bits
Formula to calculate Entropy
entropy(𝒑𝟏, 𝒑𝟐, . . . , 𝒑𝒏 ) = −𝒑𝟏 𝐥𝐨𝐠𝒑𝟏 − 𝒑𝟐 𝐥𝐨𝐠𝒑𝟐 . . . −𝒑𝒏 𝐥𝐨𝐠𝒑𝒏
Constructing DT
1. Calculate Entropy of Branch: of each value of feature
2. Average Entropy Nodes : average of Entropy of Branches
3. Calculate Information Gain Nodes: which is: Total
Information - average of Entropy of Nodes
4. Select Node with highest Information Gain
5. Expand Nodes with mixed Classes Branches & repeat
process 1-4
Example: attribute Outlook
Outlook = Sunny :
info( 𝟐, 𝟑 ) = entropy(𝟐 𝟓, 𝟑 𝟓) = −𝟐 𝟓 𝐥𝐨𝐠(𝟐 𝟓) − 𝟑 𝟓 𝐥𝐨𝐠(𝟑 𝟓) = 𝟎. 𝟗𝟕𝟏bits
Outlook = Overcast : Note: this is normally undefined.
info( 4,0 ) = entropy(1,0) = −1log(1) − 0log(0) = 0bits
Outlook = Rainy :
info( 𝟐, 𝟑 ) = entropy(𝟑 𝟓, 𝟐 𝟓) = −𝟑 𝟓 𝐥𝐨𝐠(𝟑 𝟓) − 𝟐 𝟓 𝐥𝐨𝐠(𝟐 𝟓) = 𝟎. 𝟗𝟕𝟏bits
Expected information for attribute:
info 3,2 , 4,0 , 3,2 = 5 14 × 0.971 + 4 14 × 0 + (5 14) × 0.971
= 0.693bits
Which attribute to select?
Computing information gain
• Information gain: information before splitting – information
after splitting
• Gain(Outlook ) = Info([9,5]) – info([2,3],[4,0],[3,2])
= {-9/14log9/14-5/14log5/14}- 0.693
= 0.940 – 0.693
= 0.247 bits
• Information gain for attributes from weather data:
Gain(Outlook ) = 0.247 bits
Gain(Temperature ) = 0.029 bits
Gain(Humidity ) = 0.152 bits
Gain(Windy ) = 0.048 bits
Which attribute to select?
Gain(Outlook ) = 0.247 bits
Gain(Temperature ) = 0.029 bits
Gain(Humidity ) = 0.152 bits
Gain(Windy ) = 0.048 bits
Continuing to split
gain(Temperature ) = 0.571 bits
gain(Humidity ) = 0.971 bits
gain(Windy ) = 0.020 bits
Final decision tree
Note: not all leaves need to be pure; sometimes
identical instances have different classes
Splitting stops when data can’t be split any further
Activity Restaurant?
• Construct Decision Tree on a Page and Submit
• Clearly Calculate Information Gain of every feature
• Submit Up to this Sunday
CART Algorithm
Many alternative measures to Information Gain
Most popular alternative: Gini index used in e.g., in CART
(Classification And Regression Trees)
Average Gini index (instead of average entropy / information)
Gini index is minimized instead of maximizing Gini gain
Impurity measure/For each Feature Value:
Average Gini for Attribute:
CART Algorithm
• Gini (outlook=sunny)=1-(2/5)2-(3/5)2 =0.48
• Gini (outlook=overcast)=1-(4/4)2-(0/4)2 =0
• Gini (outlook=rainy)=1-(3/5)2-(2/5)2 =0.48
Average Gini for Attribute:
Gini (outlook)=(5/14)*0.48+(4/14)*0+(5/14)*0.48
=0.342
Impurity measure/For each Feature Value:
Average Gini for Attribute:
Which attribute to select?
Activity
Weather?
• Construct Decision Tree
on a Page and Submit
• Clearly Calculate Gini
Measure feature
• Submit Up to this Sunday
Properties of Good Measure
We expect information measure should have
following properties:
1. When the number of either yes’s or no’s is
zero, the information is zero.
2. When the number of yes’s and no’s is equal,
the information reaches a maximum
3. The information should obey the multistage
property
Information Gain based on Entropy has all
above properties:
ISSUES With Training Data
Missing data: Not all the attribute values are known
Method to train and test during implementation
Multivalued attributes: Attribute has many values information gain gives inappropriate
indication of the attribute’s usefulness. An attribute such as ExactTime ,ID
One solution is to use the gain ratio C4.5
Continuous and integer-valued input attributes: Continuous or integer-valued
attributes such as Height and Weight , have an infinite set of possible values.
SPLIT POINT split point For example, at a given node in the tree, it might be the case
that testing on Weight > 160. start by sorting the values
Splitting is the most expensive part of real-world decision tree learning applications
Continuous-valued output attributes: If we are trying to predict a numerical output
value, such as the price of an apartment, then we need a regression tree rather than a
classification tree.
DT Discussion
A decision-tree learning system for real-world applications must be able to
handle all of these problems.
Handling continuous-valued variables is especially important, because both
physical and financial processes provide numerical data. Several commercial
packages built that meet these criteria,
In many areas of industry and commerce, decision trees are usually the first
method tried when a classification method is to be extracted from a data set.
One important property of decision trees is that it is possible for a human to
understand the reason for the output of the learning algorithm. (Indeed, this is
a legal requirement for financial decisions that are subject to anti-
discrimination laws.)
This is a property not shared by some other representations, such as neural
networks.
Loading Data
import pandas
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
df = pandas.read_csv("weatherdata_converted.csv")
print(df)
print(df.info())
Spliting dat
features=['Outlook', 'Temperature', 'Humidity', 'Windy']
X = df[features]
y = df['Play']
print(X)
print(y)
print(df.groupby('Play').size())
Decision Tree (Entropy)
dtree1 = DecisionTreeClassifier(criterion = "entropy")
# Performing training
dtree1.fit(X, y)
tree.plot_tree(dtree1,feature_names=features)
Decision Tree (Gini Index)
# Classifier and Prediction
dtree = DecisionTreeClassifier(criterion="gini")
dtree = dtree.fit(X, y)
#tree.plot_tree(dtree,feature_names=features)
print(X.iloc[1,:])
Prediction
print(dtree.predict(X.iloc[1:3,:]))
print(y.iloc[1:3])
x_predict=dtree.predict(X)
print(x_predict)
print ("Accuracy : ", metrics.accuracy_score(x_predict,y)*100)
Statistical modeling (Naïve Bayes)
Two assumptions: Attributes are
equally important
statistically independent (given the class value)
I.e., knowing the value of one attribute says nothing
about the value of another (if the class is known)
Independence assumption is never correct!
But … this scheme works well in practice
Bayes’s rule
Probability of event H given evidence E:
𝑃𝑟 𝐸 ∣ 𝐻 𝑃𝑟 𝐻
𝑃𝑟 𝐻 ∣ 𝐸 =
𝑃𝑟 𝐸
A priori probability of H : 𝑷𝒓 𝐄 ∣ 𝑯
Probability of event before evidence is seen
A posteriori probability of H : 𝑃𝑟 𝐻|𝑬
Probability of event after evidence is seen
Naïve Bayes for classification
Classification learning: what’s the probability
of the class given an instance?
• Evidence E = instance’s non-class attribute
values
• Event H = class value of instance
Naïve assumption: evidence splits into parts
(i.e. attributes) that are independent
𝑷𝒓 𝑬𝟏 ∣ 𝑯 𝑷𝒓 𝑬𝟐 ∣ 𝑯 . . 𝑷𝒓 𝑬𝒏 ∣ 𝑯 𝑷𝒓 𝑯
𝑷𝒓 𝑯 ∣ 𝑬 =
𝑷𝒓 𝑬
Weather data example
Outlook Temp. Humidity Windy Play Evidence E
Sunny Cool High True ?
P(yes| E) = P(Outlook = Sunny | yes)
P(Temperature= Cool | yes)
Probability of
P(Humidity = High | yes)
class “yes”
𝑷𝒓 𝑬𝟏 ∣ 𝑯 𝑷𝒓 𝑬𝟐 ∣ 𝑯 . . 𝑷𝒓 𝑬𝒏 ∣ 𝑯 𝑷𝒓 𝑯
P(Windy = True| yes) 𝑷𝒓 𝑯 ∣ 𝑬 =
𝑷𝒓 𝑬
P(yes) / P(E)
2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14
=
P(E)
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Probabilities for Rainy
Rainy
Mild
Cool
High
Normal
False
False
Yes
Yes
weather data Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Prior Probabilities for weather data
(Training)
Probabilities for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
• A new day: Outlook Temp. Humidity Windy Play
Sunny Cool High True ?
Likelihood of “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053
Likelihood For “no” = 3/5 1/5 4/5 3/5 5/14 = 0.0206
Normalizing Likelihood so that probablities sum to 1:
NO
How Naïve Bayes Deals….
Zero Frequency Problem
Missing Values
Numeric Values
The “zero-frequency problem”
What if an attribute value doesn’t occur with every class
value?
(e.g. “Humidity = high” for class “yes”)
Probability will be zero!
A posteriori probability will also be zero!
(No matter how likely the other values are!)
Remedy: add 1 to the count for every attribute value-
class combination (Laplace estimator)
Result: probabilities will never be zero!
(also: stabilizes probability estimates)
Missing values
Training: instance is not included in frequency
count for attribute value-class combination
Classification: attribute will be omitted from
calculation
Example:
Outlook Temp. Humidity Windy Play
? Cool High True ?
Numeric attributes
Outlook Temperature Humidity Windy Play
Sunny 85 85 FALSE no
Sunny 80 90 TRUE no
Overcast 83 86 FALSE yes
Rainy 70 96 FALSE yes
Rainy 68 80 FALSE yes
Rainy 65 70 TRUE no
Overcast 64 65 TRUE yes
Sunny 72 95 FALSE no
Sunny 69 70 FALSE yes
Rainy 75 80 FALSE yes
Sunny 75 70 TRUE yes
Overcast 72 90 TRUE yes
Overcast 81 75 FALSE yes
Rainy 71 91 TRUE no
Numeric attributes
• Usual assumption: attributes have a normal or Gaussian probability
distribution (given the class)
• Use probability density function for the normal distribution is defined
by two parameters:
• Sample mean
• Standard deviation
• Then the density function f(x) is
Statistics for weather data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65,71, 65, 70, 70, 85, False 6 2 9 5
Overcas 4 0 69, 70, 72,80, 70, 75, 90, 91, True 3 3
tRainy 3 2 72, … 85, 80, … 95, …
Sunny 2/9 3/5 =73 …
=75 =79 =86 False 6/9 2/5 9/ 5/
Overcas 4/9 0/5 =6.2 =7.9 =10.2 =9.7 True 3/9 3/5 14 14
tRainy 3/9 2/5
• Example density value:
Classifying a new day
• A new day: Outlook Temp. Humidity Windy Play
Sunny 66 90 true ?
Likelihood of “yes” = 2/9 0.0340 0.0221 3/9 9/14 = 0.000036
Likelihood of “no” = 3/5 0.0221 0.0381 3/5 5/14 = 0.000108
P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%
• Missing values during training are not included
in calculation of mean and standard deviation
Naïve Bayes: discussion
Naïve Bayes works surprisingly well (even if
independence assumption is clearly violated)
Why? Because classification doesn’t require
accurate probability estimates as long as
maximum probability is assigned to correct class
However: adding too many redundant attributes
will cause problems (e.g. identical attributes)
Note also: many numeric attributes are not
normally distributed ( kernel density estimators)
Multinomial naïve Bayes
• Version of naïve Bayes used for document classification using bag of words
model
• n1,n2, ... , nk: number of times word i occurs in the document
• P1,P2, ... , Pk: probability of obtaining word i when sampling from documents
in class H
• Probability of observing a particular document E given probabilities class H
(based on multinomial distribution):
• Note that this expression ignores the probability of generating a document of
the right length
• This probability is assumed to be constant for all classes
Multinomial naïve Bayes
• Suppose dictionary has two words, yellow and blue
• Suppose P(yellow | H) = 75% and P(blue | H) = 25%
• Suppose E is the document “blue yellow blue”
• Probability of observing document:
0.751 0.252 27
P({blueyellowblue} | H ) = 3!´ ´ =
1! 2! 64
Suppose there is another class H' that has
P(yellow | H’) = 10% and P(blue| H’) = 90%:
0.11 0.9 2 243
P({blueyellowblue} | H ) = 3!´ ´ =
1! 2! 1000
• Need to take prior probability of class into account to make the final
classification using Bayes’ rule
Categorical Naïve Bays
import pandas
from sklearn.naive_bayes import CategoricalNB
from sklearn import metrics
df = pandas.read_csv("weatherdata_converted.csv")
print(df)
features=['Outlook', 'Temperature', 'Humidity', 'Windy']
print(df.info())
df['Outlook'] = df['Outlook'].astype('category')
df['Temperature'] =
df['Temperature'].astype('category')
df['Humidity'] = df['Humidity'].astype('category')
df['Windy'] = df['Windy'].astype('category')
X = df[features]
y = df['Play']
Categorical Naïve Bays
print(df.info())
print(df.describe())
print(df.groupby('Play').size())
clf = CategoricalNB()
clf.fit(X, y)
y_pred = clf.predict(X)
print(y_pred)
print ("Accuracy : ",
metrics.accuracy_score(y_pred,y)*
100)
Gaussian Naïve Bays (Iris flower data set)
The dataset contains a set of 150 records under five attributes /Features-
petal length, petal width, sepal length, sepal width and species(Class)
………..
………..
Gaussian Naïve Bays (Iris flower data set)
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
iris= datasets.load_iris()
X = iris.data
y = iris.target
print(X.shape)
print(y.shape)
nb = GaussianNB()
nb.fit(X,y).predict(X)
y_pred = nb.predict(X)
print(y_pred)
print ("Accuracy : ", metrics.accuracy_score(y_pred,y)*100)