0% found this document useful (0 votes)
6 views39 pages

Introduction To Machine Learning: Decision Trees

The document provides an introduction to decision trees in machine learning, detailing their structure, use cases, and the ID3 algorithm for creating decision trees. It discusses the concept of information gain and how it is calculated to determine the best attributes for splitting data. Additionally, it covers the advantages and cautions of using decision trees, as well as the concept of ensemble learning to improve model accuracy.

Uploaded by

sanjai0718345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views39 pages

Introduction To Machine Learning: Decision Trees

The document provides an introduction to decision trees in machine learning, detailing their structure, use cases, and the ID3 algorithm for creating decision trees. It discusses the concept of information gain and how it is calculated to determine the best attributes for splitting data. Additionally, it covers the advantages and cautions of using decision trees, as well as the concept of ensemble learning to improve model accuracy.

Uploaded by

sanjai0718345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Introduction to Machine Learning

Decision Trees
Inas A. Yassine
Systems and Biomedical Engineering Department,
Faculty of Engineering - Cairo University
[email protected]
Decision Tree Representation

Root : Tree’s first Node

Branch: Outcome of a test

Node: Decision Variable

Leaf: Class Label

Machine Learning Spring 2020 Inas A. Yassine


Decision Tree Classifier - Use Cases
• When a series of categorical questions are answered to arrive
at a classification
• Biological species classification
• Checklist of symptoms during a doctor’s evaluation of a patient
• When “if-then” conditions are preferred to linear models.
• Customer segmentation to predict response rates
• Financial decisions such as loan approval
• Fraud detection
• Short Decision Trees are the most popular "weak learner" in
ensemble learning techniques

Machine Learning Spring 2020 Inas A. Yassine


Example: The Credit Prediction Problem

Machine Learning Spring 2020 Inas A. Yassine


Learning Data …
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Strong Yes - Outlook: Sunny, Overcast, Rain
4 Rain Mild High Weak Yes
- Temperature: Hot, Mild Cool
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No - Humidity: High, Normal
7 Overcast Cool Normal Strong Yes
- Wind: Weak, Strong
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes - Play Tennis: Yes, No
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Machine Learning Spring 2020 Inas A. Yassine


Data to be Classified

Day Outlook Temperature Humidity Wind Play Tennis


1 Overcast Mild High Weak ?
2 Rain Cool Normal Strong ?
3 Sunny Hot High Strong ?

Machine Learning Spring 2020 Inas A. Yassine


ID3: The Basic Decision Tree Learning Algorithm

D12 D11 What is the “best”


D1
attribute?
D2 D5
D10 D4
D6
D3
D14
D8 D9
D7 [“best” = with highest
D13
information gain]

Machine Learning Spring 2020 Inas A. Yassine


Deciding whether a pattern is interesting
• Information Theory
• A very large topic, originally used for compressing signals
• But more recently used for data mining…

Machine Learning Spring 2020 Inas A. Yassine


Information Gain
• The information gain of a feature F is the expected reduction in entropy resulting from
splitting on this feature.
Sv
• Gain( S , F ) = Entropy ( S ) − ∑ Entropy ( S v ) , Entropy (decision) = P+ log 2 P+ + P− log 2 P−
v∈Values ( F ) S
where Sv is the subset of S having value v for feature F. Entropy (decision) = P( D )(P ( D )(log P ( D
∑ i + i 2 + i
• Entropy of each resulting subset weighted by its relative size.
+ P− ( Di )(log 2 P− ( Di )))
<big, red, circle>: +
<small, red, circle>: + 2+, 2 −: E=1 2+, 2 − : E=1 2+, 2 − : E=1
<small, red, square>: − size color shape
<big, blue, circle>: −
big small red blue circle square
1+,1− 1+,1− 2+,1− 0+,1− 2+,1− 0+,1−
E=1 E=1 E=0.92 E=0 E=0.92 E=0
Gain=1−(0.5⋅1 + 0.5⋅1) = 0 Gain=1−(0.75⋅0.918 + Gain=1−(0.75⋅0.918 +
0.25⋅0) = 0.311 0.25⋅0) = 0.311

Machine Learning Spring 2020 Inas A. Yassine


Information Gain Calculation Example
Day Outlook Temperature Humidity Wind Play Tennis
• Entropy for a dataset
• E(S)= −(9/14)log(9/14) − (5/14)log(5/14) = 0.94 1 Sunny Hot High Weak No
• Outlook 2 Sunny Hot High Strong No
• Sunny[2+,3-], Overcast [4+,0-], Rain [3+, 2-] 3 Overcast Hot High Strong Yes
• E[SS]= −(2/5)log(2/5) − (3/5)log(3/5) = 0.4416 4 Rain Mild High Weak Yes

• E[SO]= −(4/4)log(4/4) − (0/4)log(0/4) = 0 5 Rain Cool Normal Weak Yes


6 Rain Cool Normal Strong No
• E[SR]=−(3/5)log(3/5) − (2/5)log(2/5) = 0.4416
7 Overcast Cool Normal Strong Yes
• IF= 0.94- [(5/14)E[Ss]+ (4/14)E[SN]+ (5/14)E[SR]]=0.29
8 Sunny Mild High Weak No
• Temperature 9 Sunny Cool Normal Weak Yes
• Hot [2+,2-], Cool [4+,0-], Mild [4+, 2-]
10 Rain Mild Normal Weak Yes
• E[SH]= −(2/4)log(2/4) − (2/4)log(2/4) =1 11 Sunny Mild Normal Strong Yes
• E[SC]= −(4/4)log(4/4) − (0/4)log(0/4) = 0 12 Overcast Mild High Strong Yes
• E[SM]=−(4/6)log(4/6) − (2/6)log(2/6) = 0.92 13 Overcast Hot Normal Weak Yes
• IF= 0.94- [(4/14)E[SH]+ (4/14)E[SC]+ 6/14 E[SM ]]= 0.26 14 Rain Mild High Strong No

Machine Learning Spring 2020 Inas A. Yassine


Information Gain Calculation Example
Day Outlook Temperature Humidity Wind Play Tennis

• Entropy for a dataset 1 Sunny Hot High Weak No


−9 9 5 5 2 Sunny Hot High Strong No
• E(S)= log - log =0.94 3 Overcast Hot High Strong Yes
14 14 14 14 4 Rain Mild High Weak Yes
• Humidity 5 Rain Cool Normal Weak Yes
• High[3+,4-], Normal [6+,1-]
6 Rain Cool Normal Strong No
• E[SH]= −(3/7)log(3/7) − (4/7log(4/7) = 0.984
7 Overcast Cool Normal Strong Yes
• E[SN]= −(6/7)log(6/7) − (1/7)log(1/7) = 0.59 8 Sunny Mild High Weak No
• IG= 0.94- [(7/14) E[SH]+ (7/14)E[SN] ]=0.115 9 Sunny Cool Normal Weak Yes
• Wind 10 Rain Mild Normal Weak Yes
• Strong [6+,2-], Weak [3+,3-] 11 Sunny Mild Normal Strong Yes

• E[Ss]=−(3/6)log(3/6) − (3/6)log(3/6) = 1 12 Overcast Mild High Strong Yes


13 Overcast Hot Normal Weak Yes
• E[Sw]=−(6/8)log(6/8) − (2/8)log(2/8) = 0.8075
14 Rain Mild High Strong No
• IG= 0.94- [(8/14)E[Ss] + (6/14)E[Sw] ]=0.0225
IG(S, Out)> IG(S, Temp)> IG(S, Hum)> IG(S, Wind) , Outlook is chosen as the root

Machine Learning Spring 2020 Inas A. Yassine


ID3 (Cont’d) Outlook
Sunny Rain
Overcast

D10 D6
D1 D8
D3
D14 D4
D11 D12
D9
D2 D7
D5
D13

Machine Learning Spring 2020 Inas A. Yassine


Information Gain Calculation Example
Day Outlook Temperature Humidity Wind Play Tennis
• Entropy for outlook-(Sunny
• E(S)= −(3/5)log(3/5) − (2/5)log(2/5) = 0.4416
1 Sunny Hot High Weak No
• Temperature
• Hot [0+,2-], Cool [1+,0-], Mild [1+, 1-] 2 Sunny Hot High Strong No
• E[SH]= −(2/2)log(2/2) − (0/2)log(0/2) = 0 8 Sunny Mild High Weak No
• E[SC]= −(1/1)log(1/1) − (0/1)log(0/1) = 0 9 Sunny Cool Normal Weak Yes
• E[SM]=−(1/2)log(1/2) − (1/2)log(1/2) = 1 11 Sunny Mild Normal Strong Yes
• IF= 0.441- [(2/5)E[SH]+ (1/5)E[SC]+ 2/5 E[SM ]]= 0.041
• Humidity
• High[0+,3-], Normal [2+,0-]
• E[SH]= −(0/3)log(0/3) − (3/3log(3/3) = 0
• E[SN]= −(2/2)log(2/2) − (0/2)log(0/2) = 0
• IG= 0.441- [(3/5) E[SH]+ (2/5)E[SN] ]=0.441
• Wind
• Strong [1+,1-], Weak [1+,2-]
• E[SW]=−(1/2)log(1/2) − (1/2)log(1/2) = 1
• E[Ss]=−(6/8)log(6/8) − (2/8)log(2/8) = 0.9128
• IG= 0.441- [(2/5)E[Ss] + (3/5)E[Sw] ]=0.0225

Machine Learning Spring 2020 Inas A. Yassine


ID3 (Cont’d) Outlook
Sunny Rain
Overcast

D10 D6
D1 D8
D3
D14 D4
D11 D12
D9
D2 D7
D5
D13

Yes
What are the
“best” attributes? Humidity and Wind
Machine Learning Spring 2020 Inas A. Yassine
General Algorithm
• To construct tree T from training set S
• If all examples in S belong to some class in C, or S is sufficiently
"pure", then make a leaf labeled C.
• Otherwise:
• select the “most informative” attribute A
• partition S according to A’s values
• recursively construct sub-trees T1, T2, ..., for the subsets of S

• The details vary according to the specific algorithm – CART,


ID3, C4.5 – but the general idea is the same

Machine Learning Module 4: Analytics Theory/Methods


Spring 2020 Inas A. Yassine
Decision Tree Classifier - Reasons to Choose (+)
& Cautions (-)
Reasons to Choose (+) Cautions (-)
Takes any input type (numeric, categorical) Decision surfaces can only be axis-aligned
In principle, can handle categorical variables with
many distinct values (ZIP code)
Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the training
data
Naturally handles variable interaction A "deep" tree is probably over-fit
Because each split reduces the training data for
subsequent splits
Handles variables that have non-linear effect on outcome Not good for outcomes that are dependent on many variables
Related to over-fit problem, above
Computationally efficient to build Doesn't naturally handle missing values;
However most implementations include a method for
dealing with this
Easy to score data In practice, decision rules can be fairly complex
Many algorithms can return a measure of variable
importance
In principle, decision rules are easy to understand

Machine Learning Spring 2020 Inas A. Yassine


Module 4: Analytics Theory/Methods
Ensemble Learning
Random Forest

Machine Learning Spring 2020 Inas A. Yassine


Motivation
• So far – learning methods that learn a single hypothesis, chosen form a hypothesis space that is
used to make predictions.
• No Lunch Free Theorem: There is no algorithm that is always the most accurate
• Generate a group of base-learners which when combined have higher accuracy
• Build many models and combine them
• Ensemble model improves accuracy and robustness over single model methods
• Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to
understand and solve (divide-and-conquer approach)
• Applications:
• distributed computing
• privacy-preserving applications
• large-scale data with reusable models
• multiple sources of data

Machine Learning Spring 2020 Inas A. Yassine


Strong versus Weak learner
• Strong Learner !Objective of machine learning
• Take labeled data for training
• Produce a classifier which can be arbitrarily accurate

• Weak Learner
• Take labeled data for training
• Generate a hypothesis with a training accuracy greater than 0.5, i.e., < 50%
error over any distribution ; more accurate than random guessing
• Strong learners are very difficult to construct
• Constructing weaker Learners is relatively easy

Machine Learning Spring 2020 Inas A. Yassine


Bias versus Variance
• Bias is the persistent/systematic error of a learner
independent of the training set.
• Zero for a learner that always makes the optimal prediction
• Variance is the error incurred by fluctuations in response to
different training sets.
• Independent of the true value
of the predicted variable and zero
for a learner that always predicts
the same class regardless of the
training set

Machine Learning Spring 2020 Inas A. Yassine


Ensemble Learning Block Diagram

model 1
Ensemble model

Data model 2

……

model k

Machine Learning Spring 2020 Inas A. Yassine


Stories of Success
• Million-dollar prize
• Improve the baseline movie recommendation approach of Netflix by
10% in accuracy
• The top submissions all combine several teams and algorithms as an
ensemble

• Data mining competitions


– Classification problems
– Winning teams employ an
ensemble of classifiers

Machine Learning Spring 2020


6 Inas A. Yassine
Netflix Prize
• Supervised learning task
– Training data is a set of users and ratings (1,2,3,4,5 stars) those users have given to
movies.
– Construct a classifier that given a user and an unrated movie, correctly classifies that
movie as either 1, 2, 3, 4, or 5 stars
– $1 million prize for a 10% improvement over Netflix’s current movie recommender
• Competition
– At first, single-model methods are developed, and performances are improved
– However, improvements slowed down
– Later, individuals and teams merged their results, and significant improvements are
observed

Machine Learning Spring 2020


7 Inas A. Yassine
Leaderboard

“Our final solution (RMSE=0.8712) consists of blending 107 individual results. “

“Predictive accuracy is substantially improved when blending multiple


predictors. Our experience is that most efforts should be concentrated in
deriving substantially different approaches, rather than refining a single
technique. “

Machine Learning Spring 2020 Inas A. Yassine


Different learners
• Subsampling training examples
• Manipulating input features
• Manipulating different Learning Algorithms
• Manipulating Different Algorithms parameters
• Injecting randomness

Machine Learning Spring 2020 Inas A. Yassine


Achieving Diversity
Diversity from differences in inputs
1. Divide up training data among models
Classifier A
Classifier B+ Predictions
+
Classifier C
Training Examples

2. Different feature weightings


Ratings Classifier A
Actors Classifier B + Predictions
+
Genres Classifier C
Training Examples

Machine Learning Spring 2020 Inas A. Yassine


Ensemble Mechanisms - Combiners
• Voting
• Averaging (if predictions not 0,1)
• Weighted Averaging
• base weights on confidence in component
• Learning combiner
• Bagging
• Boosting(Adaboost, Region Boost)
• piecewise combiner
• Gating, Stacking
• general combiner

Machine Learning Spring 2020 Inas A. Yassine


Random Forest
• Ensemble method specifically designed for decision tree
classifiers
• Random Forests grows many trees
• Ensemble of unpruned decision trees
• Each base classifier classifies a “new” vector of attributes from the
original data
• Final result on classifying a new instance: voting. Forest chooses the
classification result having the most votes (over all the trees in the
forest)

Machine Learning Spring 2020 Inas A. Yassine


Random Forests

Machine Learning Spring 2020 Inas A. Yassine


Bagging - Aggregate Bootstrapping
• Given a standard training set D of size n

• For i = 1 .. M
• Draw a sample of size n*<n from D uniformly and with replacement
• Learn classifier Ci
• Final classifier is a vote of C1 .. CM
• Increases classifier stability/reduces variance

Machine Learning Spring 2020 Inas A. Yassine


Boosting
• Developed to guarantee performance improvements on fitting training data for a weak
learner (Schapire, 1990).
• Instead of sampling (as in bagging) re-weigh examples!
• Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically
improves generalization performance (Freund & Shapire, 1996).
• Examples are given weights. At each iteration, a new hypothesis is learned (weak
learner) and the examples are reweighted to focus the system on examples that the
most recently learned classifier got wrong.
• Final classification based on weighted vote of weak classifiers

Machine Learning Spring 2020 Inas A. Yassine


Random Tree
• Pick N features at Random and build your tree
• Bootstrapping:
• Draw points at random
• Build the tree
• Use the rest for testing
• Decide how confident you are in this decision
• Paralelism:
• Load data on different machines

Machine Learning Spring 2020 Inas A. Yassine


AdaBoost Example*:

Original training set: equal weights to all training samples

Machine Learning Spring 2020 Inas A. Yassine


AdaBoost Example:
ε = error rate of classifier
α = weight of classifier
ROUND 1

Machine Learning Spring 2020 Inas A. Yassine


AdaBoost Example:
ROUND 2

Machine Learning Spring 2020 Inas A. Yassine


AdaBoost Example:
ROUND 3

Machine Learning Spring 2020 Inas A. Yassine


Random Forest
• Introduce two sources of randomness: “Bagging” and “Random
input vectors”
• Bagging method: each tree is grown using a bootstrap sample of
training data
• Random vector method: At each node, best split is chosen from a
random sample of m attributes instead of all attributes

Machine Learning Spring 2020 Inas A. Yassine


Issues in Ensembles
• Parallelism in Ensembles: Bagging is easily parallelized,
Boosting is not.
• Variants of Boosting to handle noisy data.
• How “weak” should a base-learner for Boosting be?
• What is the theoretical explanation of boosting’s ability to
improve generalization?
• Exactly how does the diversity of ensembles affect their
generalization performance.
• Combining Boosting and Bagging.

Machine Learning Spring 2020 Inas A. Yassine


Thank You …

Introduction to Machine Learning Inas A. Yassine

You might also like