0% found this document useful (0 votes)

6 views

Lecture2 DT

Uploaded by

inforocks86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Lecture2 DT

Uploaded by

inforocks86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 75

Introduction of Decision trees

Decision Trees
– A hierarchical data structure that represents data by implementing a
divide and conquer strategy
– Can be used as a non-parametric classification and regression method
– Given a collection of examples, learn a decision tree that represents it.
– Use this representation to classify new examples

C B A

2
Learning decision trees (ID3
algorithm
Will I play tennis today?
• Features
– Outlook: {Sun, Overcast, Rain}
– Temperature: {Hot, Mild, Cool}
– Humidity: {High, Normal, Low}
– Wind: {Strong, Weak}

• Labels
– Binary classification task: Y = {+, -}

4
Will I play tennis today?
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S +
8 S M H W -
C(ool)
9 S C N W +
10 R M N W + Humidity: H(igh),
11 S M N S + N(ormal),
12 O M H S +
13 O H N W +
L(ow)
14 R M H S - Wind: S(trong),
W(eak)
5
Basic Decision Trees Learning Algorithm
O T H W Play?
1 S H H W - • Data is processed in Batch (i.e. all the
2 S H H S -
3 O H H W +
data available) Algorithm?
4 R M H W + • Recursively build a decision tree top
5 R C N W +
6 R C N S - down.
7 O C N S + Outlook
8 S M H W -
9 S C N W +
10 R M N W + Rain
Sunny Overcast
11 S M N S +
12 O M H S + Humidity Yes Wind
13 O H N W +
14 R M H S - High Normal Strong Weak
No Yes No Yes
Basic Decision Tree Algorithm
• Let S be the set of Examples
– Label is the target attribute (the prediction)
– Attributes is the set of measured attributes
• ID3(S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)
for each possible value v of A
Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label why?in S
Else: below this branch add the subtree For evaluation time
ID3(Sv, Attributes - {a}, Label)
End
Return Root

7
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
– But, finding the minimal decision tree consistent with the data is NP-
hard
• The recursive algorithm is a greedy heuristic search for a
simple tree, but cannot guarantee optimality.
• The main decision in the algorithm is the selection of the next
attribute to condition on.

8
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
– The main decision in the algorithm is the selection of the next attribute
to condition on.
• We want attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
– The most popular heuristics is based on information gain, originated
with the ID3 system of Quinlan.

9
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:

• is the proportion of positive examples in S and

• is the proportion of negatives examples in S
– If all the examples belong to the same category: Entropy = 0
– If all the examples are equally mixed (0.5, 0.5): Entropy = 1
– Entropy = Level of uncertainty.
• In general, when pi is the fraction of examples labeled i:

• Entropy can be viewed as the number of bits required, on average, to encode the
class of labels. If the probability for + is 0.5, a single bit is required for each
example; if it is 0.8 – can use less then 1 bit. 10
Entropy
• Entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is:

• is the proportion of positive examples in S and

1 1 1

-- + -- + -- +
11
Entropy
(Convince yourself that the max value would be )
(Also note that the base of the log only introduce a constant factor; therefore, we’ll think about base 2)

1 1 1

12
Information Gain
High Entropy – High level of Uncertainty
Low Entropy – No Uncertainty.

• The information gain of an attribute a is the expected

reduction in entropy caused by partitioning on this attribute
Outlook
• Where:
– Sv is the subset of S for which attribute a has value v, and
Sunny Overcast Rain
– the entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set

• Partitions of low entropy (imbalanced splits) lead to high gain

• Go back to check which of the A, B splits is better

13
Will I play tennis today?
O T H W Play?
1 S H H W -
Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S -
7 O C N S + M(edium),
8 S M H W - C(ool)
9 S C N W +
10 R M N W + Humidity: H(igh),
11 S M N S + N(ormal),
12 O M H S + L(ow)
13 O H N W +
14 R M H S - Wind: S(trong),
W(eak)
14
Will I play tennis today?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W + calculate current entropy
4 R M H W +
5 R C N W +
•
6 R C N S -
7 O C N S +
8 S M H W -
9 S C N W +  0.94
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -

15
Information Gain: Outlook
¿
1
O
S
T
H
H
H
W
W
Play?
- 𝐺𝑎𝑖𝑛 ( 𝑆,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑆 ) − ∑ ¿ 𝑆𝑣 ∨
¿𝑆∨¿ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
¿¿
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Outlook = sunny:
4 R M H W + Entropy(O = S) = 0.971
5 R C N W +
Outlook = overcast:
6 R C N S -
Entropy(O = O) = 0
7 O C N S +
8 S M H W - Outlook = rainy:
9 S C N W + Entropy(O = R) = 0.971
10 R M N W +
11 S M N S + Expected entropy
12 O M H S + =
13 O H N W + = (5/14)×0.971 + (4/14)×0 + (5/14)×0.971 = 0.694
14 R M H S -
Information gain = 0.940 – 0.694 = 0.246
16
Information Gain: Humidity
¿
1
O
S
T
H
H
H
W
W
Play?
- 𝐺𝑎𝑖𝑛 ( 𝑆,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑆 ) − ∑ ¿ 𝑆𝑣 ∨
¿𝑆∨¿ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
¿¿
2 S H H S - 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝑆)
3 O H H W + Humidity = high:
4 R M H W + Entropy(H = H) = 0.985
5 R C N W +
Humidity = Normal:
6 R C N S -
Entropy(H = N) = 0.592
7 O C N S +
8 S M H W -
9 S C N W + Expected entropy
10 R M N W + =
11 S M N S + = (7/14)×0.985 + (7/14)×0.592 = 0.7785
12 O M H S +
13 O H N W + Information gain = 0.940 – 0.694 = 0.246
14 R M H S -

17
Which feature to split on?
O T H W Play?
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W + Information gain:
5 R C N W + Outlook: 0.246
6 R C N S - Humidity: 0.151
7 O C N S + Wind: 0.048
8 S M H W -
Temperature: 0.029
9 S C N W +
10 R M N W +
11 S M N S + → Split on Outlook
12 O M H S +
13 O H N W +
14 R M H S -

18
An Illustrative Example (III)
Gain(S,Humidity)=0.151
Outlook Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246

19
An Illustrative Example (III)
O T H W Play?
Outlook 1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain 5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
8 S M H W -
? Yes ? 9 S C N W +
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -

20
An Illustrative Example (III)
O T H W Play?
Outlook
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain 5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
10 R M N W +
Continue until:
• Every attribute is included in path, or, 11 S M N S +
• All examples in the leaf have same label 12 O M H S +
13 O H N W +
14 R M H S -

21
An Illustrative Example (IV)
Outlook
O T H W Play?
1 S H H W -
2 S H H S -
4 R M H W +
Sunny Overcast Rain
5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
10 R M N W +
Gain(S sunny , Humidity)  .97-(3/5) 0-(2/5) 0 = .97 11 S M N S +
Gain(S sunny , Temp)  .97- 0-(2/5) 1 = .57 12 O M H S +
13 O H N W +
Gain(S sunny , Wind)  .97-(2/5) 1 - (3/5) .92= .02 14 R M H S -

Split on Humidity
22
An Illustrative Example (V)
Outlook

Sunny Overcast Rain

1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
? Yes ?

23
An Illustrative Example (V)
Outlook

Sunny Overcast Rain

1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes ?

High Normal
No Yes

24
induceDecisionTree(S)
• 1. Does S uniquely define a class?
if all s ∈ S have the same label y: return S;

• 2. Find the feature with the most information gain:

i = argmax i Gain(S, Xi)

• 3. Add children to S:
for k in Values(Xi):
Sk = {s ∈ S | xi = k}
addChild(S, Sk)
induceDecisionTree(Sk)
return S;
25
An Illustrative Example (VI)
Outlook

Sunny Overcast Rain

1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

26
Hypothesis Space in Decision Tree Induction
• Conduct a search of the space of decision trees which can
represent all possible discrete functions. (pros and cons)
• Goal: to find the best decision tree
– Best could be “smallest depth”
– Best could be “minimizing the expected number of tests”
• Finding a minimal decision tree consistent with a set of data is
NP-hard.
• Performs a greedy heuristic search: hill climbing without
backtracking
• Makes statistically based decisions using all data
27
History of Decision Tree Research
• Hunt and colleagues in Psychology used full search decision tree
methods to model human concept learning in the 60s
– Quinlan developed ID3, with the information gain heuristics in the late 70s to
learn expert systems from examples
– Breiman, Freidman and colleagues in statistics developed CART (classification
and regression trees simultaneously)
• A variety of improvements in the 80s: coping with noise, continuous
attributes, missing data, non-axis parallel etc.
– Quinlan’s updated algorithm, C4.5 (1993) is commonly used (New: C5)
• Boosting (or Bagging) over DTs is a very good general purpose
algorithm
28
Overfitting
Example
Outlook

• Outlook = Sunny,
• Temp = Hot Sunny Overcast Rain
• Humidity = Normal 1,2,8,9,11 3,7,12,13 4,5,6,10,14
• Wind = Strong 2+,3- 4+,0- 3+,2-
• label: NO Humidity Yes Wind
• this example doesn’t exist in the tree

High Normal Strong Weak

No Yes No Yes

30
Overfitting - Example
This can always be done Outlook
– may fit noise or other
coincidental regularities

• Outlook = Sunny, Sunny Overcast Rain

• Temp = Hot 1,2,8,9,11 3,7,12,13 4,5,6,10,14
• Humidity = Normal 2+,3- 4+,0- 3+,2-
• Wind = Strong Humidity Yes Wind
• label: NO
• this example doesn’t exist in the tree
High Normal Strong Weak
No Wind No Yes

Strong Weak
No Yes
31
Our training data

32
The instance space

33
Overfitting the Data
• Learning a tree that classifies the training data perfectly may not lead to the tree with the best
generalization performance.
– There may be noise in the training data the tree is fitting
– The algorithm might be making decisions based on very little data
• A hypothesis h is said to overfit the training data if there is another hypothesis h’, such that h has a
smaller error than h’ on the training data but h has larger error on the test data than h’.

accuracy
On training

On testing

Complexity of tree 34
Reasons for overfitting
• Too much variance in the training data
– Training data is not a representative sample
of the instance space
– We split on features that are actually irrelevant

• Too much noise in the training data

– Noise = some feature values or class labels are incorrect
– We learn to predict the noise

• In both cases, it is a result of our will to minimize the empirical error

when we learn, and the ability to do it (with DTs)
35
Pruning a decision tree
• Prune = remove leaves and assign majority label of the parent
to all items
• Prune the children of node s if:
– all children are leaves, and
– the accuracy on the validation set does not decrease if we assign the
most frequent class label to all items at s.

36
Avoiding Overfitting
How can this be avoided with linear classifiers?
• Two basic approaches
– Pre-pruning: Stop growing the tree at some point during construction when it is determined that there is
not enough data to make reliable choices.
– Post-pruning: Grow the full tree and then remove nodes that seem not to have sufficient evidence.
• Methods for evaluating subtrees to prune
– Cross-validation: Reserve hold-out set to evaluate utility
– Statistical testing: Test if the observed regularity can be dismissed as likely to occur by chance
– Minimum Description Length: Is the additional complexity of the hypothesis smaller than remembering
the exceptions?
• This is related to the notion of regularization that we will see in other contexts – keep the hypothesis
simple.

Hand waving, for

now.
Next: a brief detour into explaining generalization and overfitting
37
Preventing Overfitting

h1 h2

38
The i.i.d. assumption
• Training and test items are independently and identically
distributed (i.i.d.):
– There is a distribution P(X, Y) from which the data D = {(x, y)} is generated.
• Sometimes it’s useful to rewrite P(X, Y) as P(X)P(Y|X)
Usually P(X, Y) is unknown to us (we just know it exists)
– Training and test data are samples drawn from the same P(X, Y): they are
identically distributed
– Each (x, y) is drawn independently from P(X, Y)

42
Overfitting
On training data
Accuracy
Why this shape
of curves? On test data

Size of tree

• A decision tree overfits the training data when its accuracy on

the training data goes up but its accuracy on unseen data goes
down
43
Overfitting
Empirical
Error

Model complexity

• Empirical error (= on a given data set):

The percentage of items in this data set are misclassified by
the classifier f.

44
Overfitting

Empirical
Error

Model complexity

• Model complexity (informally):

How many parameters do we have to learn?
• Decision trees: complexity = #nodes

45
Overfitting

Expected
Error

Model complexity

• Expected error:
What percentage of items drawn from P(x,y) do we expect to
be misclassified by f?
• (That’s what we really care about – generalization)
46
Variance of a learner (informally)

Variance

Model complexity

• How susceptible is the learner to minor changes in the training data?

– (i.e. to different samples from P(X, Y))
• Variance increases with model complexity
– Think about extreme cases: a hypothesis space with one function vs. all functions.
– Or, adding the “wind” feature in the DT earlier.
– The larger the hypothesis space is, the more flexible the selection of the chosen hypothesis is as a
function of the data.
– More accurately: for each data set D, you will learn a different hypothesis h(D), that will have a different
true error e(h); we are looking here at the variance of this random variable.

47
Bias of a learner (informally)

Bias Model complexity

• How likely is the learner to identify the target hypothesis?

• Bias is low when the model is expressive (low empirical error)
• Bias is high when the model is (too) simple
– The larger the hypothesis space is, the easiest it is to be close to the true hypothesis.
– More accurately: for each data set D, you learn a different hypothesis h(D), that has a different true error
e(h); we are looking here at the difference of the mean of this random variable from the true error.

48
Impact of bias and variance

Expected
Error
Variance

Bias

Model complexity

• Expected error ≈ bias + variance

49
Model complexity

Expected
Error
Variance

Bias

Model complexity

Simple models: Complex models:

High bias and low variance High variance and low bias

50
Underfitting and Overfitting
Underfittin Overfitting
Expected g
Error
Variance

Bias
Model complexity

Simple models: Complex models:

High bias and low variance High variance and low bias
This can be made more accurate for some loss functions.
We will discuss a more precise and general theory that
trades expressivity of models with empirical error
51
Avoiding Overfitting
How can this be avoided with linear classifiers?
• Two basic approaches
– Pre-pruning: Stop growing the tree at some point during construction when it is
determined that there is not enough data to make reliable choices.
– Post-pruning: Grow the full tree and then remove nodes that seem not to have sufficient
evidence.
• Methods for evaluating subtrees to prune
– Cross-validation: Reserve hold-out set to evaluate utility
– Statistical testing: Test if the observed regularity can be dismissed as likely to occur by
chance
– Minimum Description Length: Is the additional complexity of the hypothesis smaller than
remembering the exceptions?
• This is related to the notion of regularization that we will see in other contexts
– keep the hypothesis simple.
52
Trees and Rules
• Decision Trees can be represented
as Rules Outlook
– (outlook = sunny) and (humidity =
normal) then YES
Sunny Overcast Rain
– (outlook = rain) and (wind = strong) 1,2,8,9,11 3,7,12,13 4,5,6,10,14
then NO 2+,3- 4+,0- 3+,2-
Humidity Yes Wind
• Sometimes Pruning can be done at
the rules level High Normal Strong Weak
– Rules are generalized by No Yes No Yes

erasing a condition (different!)

53
DT Extensions:
continuous attributes and
missing values
Continuous Attributes
• Real-valued attributes can, in advance, be discretized into ranges, such as
big, medium, small
• Alternatively, one can develop splitting nodes based on thresholds of the
form A<c that partition the data into examples that satisfy A<c and A>=c.
– The information gain for these splits is calculated in the same way and compared to the
information gain of discrete splits.
• How to find the split with the highest gain?
• For each continuous feature A:
– Sort examples according to the value of A
– For each ordered pair (x,y) with different labels
• Check the mid-point as a possible threshold, i.e.
• Sa < x Sa >= y

55
Continuous Attributes
• Example:
– Length (L): 10 15 21 28 32 40 50
– Class: - + + - + + -
– Check thresholds: L < 12.5; L < 24.5; L < 45
– Subset of Examples= {…}, Split= k+,j-

• How to find the split with the highest gain ?

– For each continuous feature A:
• Sort examples according to the value of A
• For each ordered pair (x,y) with different labels
– Check the mid-point as a possible threshold. I.e,
– Sa < x, Sa >= y
56
Missing Values
• Diagnosis = < fever, blood_pressure,…, blood_test=?,…>

• Many times values are not available for all attributes during
training or testing (e.g., medical diagnosis)

• Training: evaluate Gain(S,a) where in some of the examples a

value for a is not given

57
𝐺𝑎𝑖𝑛 ( 𝑆,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑆 ) − ∑ ¿ 𝑆𝑣 ∨ ¿
Missing Values ¿ 𝑆∨¿ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
¿¿

Other suggestions? Outlook Gain(S sunny ,Temp)  .97- 0-(2/5) 1 = .57

Gain(S sunny , Humidity) 
 Fill in: assign the most likely value of Xi to s:
Sunny Overcast Rain argmax k P( Xi = k ):
1,2,8,9,11 3,7,12,13 4,5,6,10,14 Normal
2+,3- 4+,0- 3+,2-  97-(3/5) Ent[+0,-3] -(2/5) Ent[+2,-0] = .97
 Assign fractional counts P(Xi =k)
? Yes ?
for each value of Xi to s
 .97-(2.5/5) Ent[+0,-2.5] - (2.5/5) Ent[+2,-.5] < .97
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild ??? Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
Missing Values
• Diagnosis = < fever, blood_pressure,…, blood_test=?,…>

• Many times values are not available for all attributes during
training or testing (e.g., medical diagnosis)

• Training: evaluate Gain(S,a) where in some of the examples a

value for a is not given
• Testing: classify an example without knowing the value of a

59
Missing Values
Outlook = Sunny, Temp = Hot, Humidity = ???, Wind = Strong, label = ?? Normal/High

Outlook = ???, Temp = Hot, Humidity = Normal, Wind = Strong, label = ??

Outlook 1/3 Yes + 1/3 Yes +1/3 No = Yes

Sunny Overcast Rain

1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
60
Other Issues
• Attributes with different costs
– Change information gain so that low cost attribute are preferred
• Dealing with features with different # of values
• Alternative measures for selecting attributes
– When different attributes have different number of values information gain
tends to prefer those with many values
• Oblique Decision Trees
– Decisions are not axis-parallel
• Incremental Decision Trees induction
– Update an existing decision tree to account for new examples incrementally
(Maintain consistency?)
61
Summary: Decision Trees
• Presented the hypothesis class of Decision Trees
– Very expressive, flexible, class of functions

• Presented a learning algorithm for Decision Tress

– Recursive algorithm.
– Key step is based on the notion of Entropy

• Discussed the notion of overfitting and ways to address it within DTs

– In your problem set – look at the performance on the training vs. test

• Briefly discussed some extensions

– Real valued attributes
– Missing attributes

• Evaluation in machine learning

– Cross validation
– Statistical significance
62
Decision Trees as Features
• Rather than using decision trees to represent the target function it is becoming common to use small
decision trees as features
• When learning over a large number of features, learning decision trees is difficult and the resulting
tree may be very large
 (over fitting)
• Instead, learn small decision trees, with limited depth.
• Treat them as “experts”; they are correct, but only on a small region in the domain. (what DTs to
learn? same every time?)
• Then, learn another function, typically a linear function, over these as features.
• Boosting (but also other linear learners) are used on top of the small decision trees. (Either Boolean,
or real valued features)

• In HW1 you learn a linear classifier over DTs.

– Not learning the DTs sequentially; all are learned at once.
• How can you learn multiple DTs?
– Combining them using an SGD algorithm.

63
Experimental Machine Learning
• Machine Learning is an Experimental Field and we will spend some time
(in Problem sets) learning how to run experiments and evaluate results
– First hint: be organized; write scripts
• Basics:
– Split your data into three sets:
• Training data (often 70-90%)
• Test data (often 10-20%)
• Development data (10-20%)
• You need to report performance on test data, but you are not allowed to
look at it.
– You are allowed to look at the development data (and use it to tune parameters)

64
Metrics
Methodologies
Statistical Significance
Metrics
• We train on our training data Train = {xi, yi}1,m
• We test on Test data.
• We often set aside part of the training data as a development set, especially when
the algorithms require tuning.
– In the HW we asked you to present results also on the Training; why?
• When we deal with binary classification we often measure performance simply
using Accuracy:

• Any possible problems with it?

66
Alternative Metrics
Positive negative
• If the Binary classification problem is biased
– In many problems most examples are negative
• Or, in multiclass classification
– The distribution over labels is often non-uniform
• Simple accuracy is not a useful metric.
– Often we resort to task specific metrics
• However one important example that is being used often
involves Recall and Precision

• Recall: # (positive identified = true positives)

Predicted positive
# (all positive)

• Precision: # (positive identified = true positives)

# (predicted positive)

67
Example
Positive negative
• 100 examples, 5% are positive.

• Just say NO: your accuracy is 95%

– Recall = precision = 0
• Predict 4+, 96-; 2 of the +s are indeed positive
– Recall:2/5; Precision: 2/4

• Recall: # (positive identified = true positives)

# (all positive)

• Precision: # (positive identified = true positives)

# (predicted positive)
68
Confusion Matrix
• Given a dataset of P positive instances and N negative
instances:
Predicted Class
Yes No

Actual Class
The notion of a
confusion matrix can Yes TP FN
be usefully extended
to the multiclass case No FP TN
(i,j) cell indicate how
many of the i-labeled • Imagine using classifier to identify positive cases (i.e., for
examples were information retrieval)
predicted to be j

Probability that a randomly Probability that a randomly

selected positive prediction selected positive is
is indeed positive identified

69
Relevant Metrics
• It makes sense to consider Recall and
Precision together or combine them
into a single metric.

• Recall-Precision Curve:

• F-Measure:
– A measure that combines precision and
recall is the harmonic mean of precision
and recall.

– F1 is the most commonly used metric.

70
Comparing Classifiers
Say we have two classifiers, C1 and C2, and want to choose the
best one to use for future predictions
Can we use training accuracy to choose between them?
• No!

• What about accuracy on test data?

71
N-fold cross validation
• Instead of a single test-training split:
train test

• Split data into N equal-sized parts

• Train and test N different classifiers

• Report average accuracy and standard deviation of the
accuracy

72
Evaluation: significance tests
• You have two different classifiers, A and B
• You train and test them on the same data set using N-fold
cross-validation
• For the n-th fold:
accuracy(A, n), accuracy(B, n)
pn = accuracy(A, n) - accuracy(B, n)
• Is the difference between A and B’s accuracies significant?

73
Hypothesis testing
• You want to show that hypothesis H is true, based on your data

– (e.g. H = “classifier A and B are different”)

• Define a null hypothesis H0

– (H0 is the contrary of what you want to show)

• H0 defines a distribution P(m |H0) over some statistic

– e.g. a distribution over the difference in accuracy between A and B
• Can you refute (reject) H0?

74
Rejecting H0
• H0 defines a distribution P(M |H0) over some statistic M
– (e.g. M= the difference in accuracy between A and B)
• Select a significance value S
– (e.g. 0.05, 0.01, etc.)
– You can only reject H0 if P(m |H0) ≤ S
• Compute the test statistic m from your data
– e.g. the average difference in accuracy over your N folds
• Compute P(m |H0)
• Refute H0 with p ≤ S if P(m |H0) ≤ S
75
Paired t-test
• Null hypothesis (H0; to be refuted):
– There is no difference between A and B, i.e. the expected accuracies of
A and B are the same
• That is, the expected difference (over all possible data sets)
between their accuracies is 0:
H0: E[pD] = 0

•We don’t know the true E[pD]

•N-fold cross-validation gives us N samples of pD

76
Paired t-test
• Null hypothesis H0: E[diffD] = μ = 0
• m: our estimate of μ based on N samples of diffD
m = 1/N n diffn
•The estimated variance S2:
S2 = 1/(N-1) 1,N (diffn – m)2
•Accept Null hypothesis at significance level a if the
following statistic lies in (-ta/2, N-1, +ta/2, N-1)

77
Decision Trees - Summary
• Hypothesis Space:
– Variable size (contains all functions)
– Deterministic; Discrete and Continuous attributes
• Search Algorithm
– ID3 - batch
– Extensions: missing values
• Issues:
– What is the goal?
– When to stop? How to guarantee good generalization?
• Did not address:
– How are we doing? (Correctness-wise, Complexity-wise)
78

Mình gửi kèm cho bạn bộ đề Reading & Listening kèm hướng dẫn trong đó nhé
80% (5)
Mình gửi kèm cho bạn bộ đề Reading & Listening kèm hướng dẫn trong đó nhé
5 pages
MR590I - Manual - Neha Refu
100% (2)
MR590I - Manual - Neha Refu
182 pages
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
No ratings yet
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
54 pages
7-Decision Trees Learning
No ratings yet
7-Decision Trees Learning
51 pages
2.3 Decision-Tree-Algorithm
No ratings yet
2.3 Decision-Tree-Algorithm
61 pages
07 - ML - Decision Tree
No ratings yet
07 - ML - Decision Tree
37 pages
7. Decision Tree & Random Forest
No ratings yet
7. Decision Tree & Random Forest
41 pages
DMDW-CO3-SESSION-14
No ratings yet
DMDW-CO3-SESSION-14
55 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Module 3 Chap 3 Decision Tree Learning
No ratings yet
Module 3 Chap 3 Decision Tree Learning
79 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
Decision Tree (Class 37-38) 169692509554958626652505a71d481
No ratings yet
Decision Tree (Class 37-38) 169692509554958626652505a71d481
45 pages
7_DecisionTree
No ratings yet
7_DecisionTree
58 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
MLT UNIT-3 notes
No ratings yet
MLT UNIT-3 notes
35 pages
L5 - Decision Tree - B
No ratings yet
L5 - Decision Tree - B
51 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
Chapter 3 Decision Trees
No ratings yet
Chapter 3 Decision Trees
61 pages
Module - 2 Decision Tree Learning
No ratings yet
Module - 2 Decision Tree Learning
79 pages
Module 3-Decision Tree Learning
100% (1)
Module 3-Decision Tree Learning
33 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Decision - Tree
No ratings yet
Decision - Tree
75 pages
Unit-3 (1)
No ratings yet
Unit-3 (1)
81 pages
07_Decision tree
No ratings yet
07_Decision tree
45 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
DS_w12_DT
No ratings yet
DS_w12_DT
61 pages
Decision Trees: Decision Tree Representation ID3 Learning Algorithm Entropy, Information Gain Overfitting
No ratings yet
Decision Trees: Decision Tree Representation ID3 Learning Algorithm Entropy, Information Gain Overfitting
33 pages
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
No ratings yet
Geometric Intuition of Decision Tree: Axis Parallel Hyperplanes
7 pages
2.decision Tree
No ratings yet
2.decision Tree
56 pages
3 Decision Trees_LMS
No ratings yet
3 Decision Trees_LMS
47 pages
Decision Tree Example
No ratings yet
Decision Tree Example
21 pages
Decision Tree Learning and Inductive Inference
No ratings yet
Decision Tree Learning and Inductive Inference
37 pages
Classification - Issues Regarding Classification and Prediction
No ratings yet
Classification - Issues Regarding Classification and Prediction
42 pages
Decision Tree
No ratings yet
Decision Tree
58 pages
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
No ratings yet
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
73 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
ID3 Decision Tree Explanation
No ratings yet
ID3 Decision Tree Explanation
8 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Ml Lecture04x2
No ratings yet
Ml Lecture04x2
16 pages
ID3_Explanation
No ratings yet
ID3_Explanation
23 pages
Dec Tree
No ratings yet
Dec Tree
17 pages
Decision Tree
No ratings yet
Decision Tree
14 pages
ML_Unit-2_Material
No ratings yet
ML_Unit-2_Material
20 pages
AIML Lect5 Decision Tree
No ratings yet
AIML Lect5 Decision Tree
33 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
Lecture 04 Decession Trees 04112022 015118pm
No ratings yet
Lecture 04 Decession Trees 04112022 015118pm
43 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
jdavis-indlearn2 (1)
No ratings yet
jdavis-indlearn2 (1)
91 pages
ML Unit-3 ppt
No ratings yet
ML Unit-3 ppt
92 pages
ML-Lec5
No ratings yet
ML-Lec5
7 pages
Decision Tree
100% (1)
Decision Tree
10 pages
Classification With Decision Trees I: Instructor: Qiang Yang
No ratings yet
Classification With Decision Trees I: Instructor: Qiang Yang
29 pages
Machine Learning - Part 1
100% (1)
Machine Learning - Part 1
80 pages
Decision Tree
No ratings yet
Decision Tree
100 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
Module 3 DecisionTree Notes
100% (1)
Module 3 DecisionTree Notes
14 pages
19 -- Decision Tree -- ID3
No ratings yet
19 -- Decision Tree -- ID3
87 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
Modern Algebra Essentials
From Everand
Modern Algebra Essentials
Lufti A. Lutfiyya
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Get (Ebook) Quantitative analysis for management by Badri, T.N.; Hale, Trevor S.; Hanna, Michael E.; Render, Barry; Stair, Ralph M ISBN 9789332568587, 9789332578692, 9332568588, 9332578699 PDF ebook with Full Chapters Now
100% (11)
Get (Ebook) Quantitative analysis for management by Badri, T.N.; Hale, Trevor S.; Hanna, Michael E.; Render, Barry; Stair, Ralph M ISBN 9789332568587, 9789332578692, 9332568588, 9332578699 PDF ebook with Full Chapters Now
55 pages
Euclid's Algorithm: ENGI 1331: Exam 2 Review - Additional Practice Problems Fall 2020
No ratings yet
Euclid's Algorithm: ENGI 1331: Exam 2 Review - Additional Practice Problems Fall 2020
4 pages
Unit 4 CN
No ratings yet
Unit 4 CN
2 pages
[2025-AEJ]Object detection in real-time video surveillance using attention based transformer-YOLOv8 model
No ratings yet
[2025-AEJ]Object detection in real-time video surveillance using attention based transformer-YOLOv8 model
14 pages
VAF1
No ratings yet
VAF1
56 pages
News Letter 2014
No ratings yet
News Letter 2014
12 pages
PowerShell 7 Fundamentals by Jeff Hicks
No ratings yet
PowerShell 7 Fundamentals by Jeff Hicks
115 pages
CPAN - BSNL Presentation - 24-Sep
No ratings yet
CPAN - BSNL Presentation - 24-Sep
37 pages
Changes in Reading Habits
No ratings yet
Changes in Reading Habits
14 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
I TUTOR Business Plan
No ratings yet
I TUTOR Business Plan
53 pages
Introduction to Compressible Fluid Flow 2nd Edition instant download
100% (1)
Introduction to Compressible Fluid Flow 2nd Edition instant download
23 pages
Arduino Based Automatic Car Parking Empty Slots Detection: - Volume 9, Issue 8, August 2020
No ratings yet
Arduino Based Automatic Car Parking Empty Slots Detection: - Volume 9, Issue 8, August 2020
5 pages
PHILIPS Theta FBS110 FBS111 MAX15W-E27 220-240V
No ratings yet
PHILIPS Theta FBS110 FBS111 MAX15W-E27 220-240V
2 pages
Chaks Induction Notes
100% (1)
Chaks Induction Notes
42 pages
Screenshot 2024-11-24 at 00.13.56
No ratings yet
Screenshot 2024-11-24 at 00.13.56
1 page
An Overview of Hospital Information System HIS Imp
No ratings yet
An Overview of Hospital Information System HIS Imp
8 pages
Plot Unfolding Machine v5-5
No ratings yet
Plot Unfolding Machine v5-5
8 pages
Practice 4 Exam 1
No ratings yet
Practice 4 Exam 1
6 pages
COAG NMAT For October 2021-B
No ratings yet
COAG NMAT For October 2021-B
41 pages
3 RedundantNumber
No ratings yet
3 RedundantNumber
52 pages
Window 11 Update
No ratings yet
Window 11 Update
26 pages
All Proformas Updated
No ratings yet
All Proformas Updated
14 pages
VIT Library Details
No ratings yet
VIT Library Details
5 pages
Cha Resume
No ratings yet
Cha Resume
3 pages
Prince of Persia - Warrior Within - PC Manual
0% (1)
Prince of Persia - Warrior Within - PC Manual
13 pages
T PZ 1652350793 Pirate Themed Fun Secret Code Breaking Puzzle Level 1 Difficulty - Ver - 1
No ratings yet
T PZ 1652350793 Pirate Themed Fun Secret Code Breaking Puzzle Level 1 Difficulty - Ver - 1
2 pages
PA 1 Computer
No ratings yet
PA 1 Computer
2 pages

Lecture2 DT

Uploaded by

Lecture2 DT

Uploaded by

Introduction of Decision trees

• is the proportion of positive examples in S and

• is the proportion of positive examples in S and

• The information gain of an attribute a is the expected

• Partitions of low entropy (imbalanced splits) lead to high gain

Sunny Overcast Rain

Sunny Overcast Rain

• 2. Find the feature with the most information gain:

Sunny Overcast Rain

High Normal Strong Weak

High Normal Strong Weak

• Outlook = Sunny, Sunny Overcast Rain

• Too much noise in the training data

• In both cases, it is a result of our will to minimize the empirical error

Hand waving, for

• A decision tree overfits the training data when its accuracy on

• Empirical error (= on a given data set):

• Model complexity (informally):

• How susceptible is the learner to minor changes in the training data?

Bias Model complexity

• How likely is the learner to identify the target hypothesis?

• Expected error ≈ bias + variance

Simple models: Complex models:

Simple models: Complex models:

erasing a condition (different!)

• How to find the split with the highest gain ?

• Training: evaluate Gain(S,a) where in some of the examples a

Other suggestions? Outlook Gain(S sunny ,Temp)  .97- 0-(2/5) 1 = .57

• Training: evaluate Gain(S,a) where in some of the examples a

Outlook = ???, Temp = Hot, Humidity = Normal, Wind = Strong, label = ??

Sunny Overcast Rain

High Normal Strong Weak

• Presented a learning algorithm for Decision Tress

• Discussed the notion of overfitting and ways to address it within DTs

• Briefly discussed some extensions

• Evaluation in machine learning

• In HW1 you learn a linear classifier over DTs.

• Any possible problems with it?

• Recall: # (positive identified = true positives)

• Precision: # (positive identified = true positives)

• Just say NO: your accuracy is 95%

• Recall: # (positive identified = true positives)

• Precision: # (positive identified = true positives)

Probability that a randomly Probability that a randomly

– F1 is the most commonly used metric.

• What about accuracy on test data?

• Split data into N equal-sized parts

• Train and test N different classifiers

– (e.g. H = “classifier A and B are different”)

• Define a null hypothesis H0

• H0 defines a distribution P(m |H0) over some statistic

•We don’t know the true E[pD]

You might also like