Lecture 07
Lecture 07
Lecture 07
Dr. Samana Batool
DECISION TREES
PARAMETRIC ML ALGORITHMS
Assumptions can greatly simplify the learning process, but can also limit what can be learned.
Algorithms that simplify the function to a known form are called parametric machine learning algorithms.
The algorithms involve two steps:
1.Select a form for the function.
2.Learn the coefficients for the function from the training data.
Examples: Logistic Regression, Linear Regression, Linear Discriminant Analysis, Perceptron, Naive
Bayes, Simple Neural Networks
Benefits of Parametric Machine Learning Algorithms:
•Simpler: Easier to understand and interpret results.
•Speed: Very fast to learn from data.
•Less Data: Do not require as much training data and can work well even if the fit is not perfect.
Limitations of Parametric Machine Learning Algorithms:
•Constrained: By choosing a functional form these methods are highly constrained to the specified
form.
•Limited Complexity: The methods are more suited to simpler problems.
•Poor Fit: In practice the methods are unlikely to match the underlying mapping function.
NON-PARAMETRIC ML ALGORITHMS
Algorithms that do not make strong assumptions about the form of the mapping function are called
nonparametric machine learning algorithms. By not making assumptions, they are free to learn any
functional form from the training data.
Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you
don’t want to worry too much about choosing just the right features.
Examples: k-Nearest Neighbors, Decision Trees, Support Vector Machines
Benefits of Nonparametric Machine Learning Algorithms:
Flexibility: Capable of fitting a large number of functional forms.
Power: No assumptions (or weak assumptions) about the underlying function.
Performance: Can result in higher performance models for prediction.
Limitations of Nonparametric Machine Learning Algorithms:
More data: Require a lot more training data to estimate the mapping function.
Slower: A lot slower to train as they often have far more parameters to train.
Overfitting: More of a risk to overfit the training data and it is harder to explain why specific
predictions are made.
CLASSIFICATION
The classification of an unknown input vector is done by traversing the tree from
the root node to a leaf node.
A record enters the tree at the root node.
At the root node, a test is applied to determine which child node the record will
encounter next.
This process is repeated until the record arrives at a leaf node.
All the records that end up at given leaf of the tree are classified in the same way.
There is a unique path from the root to each leaf.
The path is a rule which is used to classify the records.
40 60 40 60
28 42 12 18 40 60
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇 = − 𝑝𝑙 log 2 𝑝𝑙
𝑙=1
Min Entropy= 0 (No impurity)
Max Entropy = 1 (Max impurity for
binary classes)
𝑘
𝑁𝑙𝑒𝑓𝑡 𝑁𝑟𝑖𝑔ℎ𝑡
𝐼𝐺 = 𝐼 − 𝐼𝑙𝑒𝑓𝑡 − 𝐼𝑟𝑖𝑔ℎ𝑡
𝑁 𝑁
IG – Information Gain
I – Impurity calculated on parent node (Gini or Entropy)
Ileft – Impurity calculated on left child node
Iright – Impurity calculated on right child node
N – Total no. of samples
Nleft – No. of samples at left child node
Nright – No. of samples at left child node
INFORMATION GAIN FOR A1
2 2
29 35
𝐴𝑡 𝑟𝑜𝑜𝑡 𝑛𝑜𝑑𝑒; 𝐼 = 1 − − = 0.496
64 64
2 2
21 5
𝐴𝑡 𝑙𝑒𝑓𝑡 𝑛𝑜𝑑𝑒; 𝐼𝑙𝑒𝑓𝑡 =1 − − = 0.310
26 26
2 2
8 30
𝐴𝑡 𝑟𝑖𝑔ℎ𝑡 𝑛𝑜𝑑𝑒; 𝐼𝑟𝑖𝑔ℎ𝑡 = 1 − − = 0.332
38 38
𝑁𝑙𝑒𝑓𝑡 𝑁𝑟𝑖𝑔ℎ𝑡
𝐼𝐺 = 𝐼 − 𝐼𝑙𝑒𝑓𝑡 − 𝐼𝑟𝑖𝑔ℎ𝑡
𝑁 𝑁
26 38
𝐼𝐺 = 0.496 − 0.310 − 0.332
64 64
𝐼𝐺 = 0.496 − 0.33
𝐼𝐺 = 0.166
INFORMATION GAIN FOR A2
INFORMATION GAIN
2 2
29 35
𝐴𝑡 𝑟𝑜𝑜𝑡 𝑛𝑜𝑑𝑒; 𝐼 = 1 − − = 0.496
64 64
2 2
18 33
𝐴𝑡 𝑙𝑒𝑓𝑡 𝑛𝑜𝑑𝑒; 𝐼𝑙𝑒𝑓𝑡 =1 − − = 0.457
51 51
2 2
11 2
𝐴𝑡 𝑟𝑖𝑔ℎ𝑡 𝑛𝑜𝑑𝑒; 𝐼𝑟𝑖𝑔ℎ𝑡 = 1 − − = 0.260
13 13
𝑁𝑙𝑒𝑓𝑡 𝑁𝑟𝑖𝑔ℎ𝑡
𝐼𝐺 = 𝐼 − 𝐼𝑙𝑒𝑓𝑡 − 𝐼𝑟𝑖𝑔ℎ𝑡
𝑁 𝑁
51 13
𝐼𝐺 = 0.496 − 0.457 − 0.260
64 64
𝐼𝐺 = 0.496 − 0.417
𝐼𝐺 = 0.079
Evaluation data
Error rate
Training data
Performance
𝑹𝟑
𝑹𝟏
0%
0 year 5 years 20 years
Experience
TRAINING DATA EXAMPLE: GOAL IS TO PREDICT WHEN THIS
PLAYER WILL PLAY TENNIS?