0% found this document useful (0 votes)

21 views47 pages

Cs 171 18 IntroLearning Old

machine learning

Uploaded by

irfan786ceap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views47 pages

Cs 171 18 IntroLearning Old

machine learning

Uploaded by

irfan786ceap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 47

Introduction to Machine Learning

Reading for today: R&N 18.1-18.4

Next lecture: R&N 18.6-18.12, 20.1-20.3.2

Outline

• The importance of a good representation

• Different types of learning problems
• Different types of learning algorithms
• Supervised learning
– Decision trees
– Naïve Bayes
– Perceptrons, Multi-layer Neural Networks
– Boosting
• Unsupervised Learning
– K-means
• Applications: learning to detect faces in images
• Reading for today’s lecture: Chapter 18.1 to 18.4 (inclusive)
You will be expected to know

 Understand Attributes, Error function, Classification,

Regression, Hypothesis (Predictor function)

 What is Supervised Learning?

 Decision Tree Algorithm

 Entropy

 Information Gain

 Tradeoff between train and test with model complexity

 Cross validation
Automated Learning

• Why is it useful for our agent to be able to learn?

– Learning is a key hallmark of intelligence
– The ability of an agent to take in real data and feedback and improve
performance over time
– Check out USC Autonomous Flying Vehicle Project!

• Types of learning
– Supervised learning
• Learning a mapping from a set of inputs to a target variable
– Classification: target variable is discrete (e.g., spam email)
– Regression: target variable is real-valued (e.g., stock market)

– Unsupervised learning
• No target variable provided
– Clustering: grouping data into K groups

– Other types of learning

• Reinforcement learning: e.g., game-playing agent
• Learning to rank, e.g., document ranking in Web search
• And many others….
The importance of a good representation

• Properties of a good representation:

• Reveals important features

• Hides irrelevant detail
• Exposes useful constraints
• Makes frequent operations easy-to-do
• Supports local inferences from local features
• Called the “soda straw” principle or “locality” principle
• Inference from features “through a soda straw”
• Rapidly or efficiently computable
• It’s nice to be fast
Reveals important features / Hides irrelevant detail

• “You can’t learn what you can’t represent.” --- G. Sussman

• In search: A man is traveling to market with a fox, a goose,

and a bag of oats. He comes to a river. The only way across the
river is a boat that can hold the man and exactly one of the fox,
goose or bag of oats. The fox will eat the goose if left alone with
it, and the goose will eat the oats if left alone with it.

• A good representation makes this problem easy:

1110 0100
1110
0010
1010
1111 0000 1010 0010 1101 0101 1111

0001
0101
1011 0001
Simple illustrative learning problem

Problem:
decide whether to wait for a table at a restaurant, based on the following attributes:

1. Alternate: is there an alternative restaurant nearby?

2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Training Data for Supervised Learning
Terminology

• Attributes
– Also known as features, variables, independent variables,
covariates

• Target Variable
– Also known as goal predicate, dependent variable, …

• Classification
– Also known as discrimination, supervised classification, …

• Error function
– Objective function, loss function, …
Inductive learning

• Let x represent the input vector of attributes

• Let f(x) represent the value of the target variable for x

– The implicit mapping from x to f(x) is unknown to us
– We just have training data pairs, D = {x, f(x)} available

• We want to learn a mapping from x to f, i.e.,

h(x; ) is “close” to f(x) for all training data points x

 are the parameters of our predictor h(..)

• Examples:
– h(x; ) = sign(w1x1 + w2x2+ w3)

– hk(x) = (x1 OR x2) AND (x3 OR NOT(x4))

Empirical Error Functions

• Empirical error function:

E(h) = x distance[h(x; ) , f]

e.g., distance = squared error if h and f are real-valued (regression)

distance = delta-function if h and f are categorical (classification)

Sum is over all training pairs in the training data D

In learning, we get to choose

1. what class of functions h(..) that we want to learn

– potentially a huge space! (“hypothesis space”)

2. what error function/distance to use

- should be chosen to reflect real “loss” in problem
- but often chosen for mathematical/algorithmic convenience
Inductive Learning as Optimization or Search

• Empirical error function:

E(h) = x distance[h(x; ) , f]

• Empirical learning = finding h(x), or h(x; ) that minimizes E(h)

– In simple problems there may be a closed form solution
• E.g., “normal equations” when h is a linear function of x, E = squared error

– If E(h) is differentiable as a function of q, then we have a continuous optimization problem

and can use gradient descent, etc
• E.g., multi-layer neural networks

– If E(h) is non-differentiable (e.g., classification), then we typically have a systematic search

problem through the space of functions h
• E.g., decision tree classifiers

• Once we decide on what the functional form of h is, and what the error function E
is, then machine learning typically reduces to a large search or optimization
problem

• Additional aspect: we really want to learn an h(..) that will generalize well to new
data, not just memorize training data – will return to this later
Our training data example (again)

• If all attributes were binary, h(..) could be any arbitrary Boolean function

• Natural error function E(h) to use is classification error, i.e., how many incorrect
predictions does a hypothesis h make

• Note an implicit assumption:

– For any set of attribute values there is a unique target value
– This in effect assumes a “no-noise” mapping from inputs to targets
• This is often not true in practice (e.g., in medicine). Will return to this later
Learning Boolean Functions

• Given examples of the function, can we learn the function?

• How many Boolean functions can be defined on d attributes?

– Boolean function = Truth table + column for target function (binary)
– Truth table has 2d rows
– So there are 2 to the power of 2d different Boolean functions we can define
(!)
– This is the size of our hypothesis space

– E.g., d = 6, there are 18.4 x 1018 possible Boolean functions

• Observations:
– Huge hypothesis spaces –> directly searching over all functions is impossible
– Given a small data (n pairs) our learning problem may be underconstrained
• Ockham’s razor: if multiple candidate functions all explain the data
equally well, pick the simplest explanation (least complex function)
• Constrain our search to classes of Boolean functions, e.g.,
– decision trees
– Weighted linear sums of inputs (e.g., perceptrons)
Decision Tree Learning

Constrain h(..) to be a decision tree

Decision Tree Representations

Decision trees are fully expressive

can represent any Boolean function
Every path in the tree could represent 1 row in the truth table
Yields an exponentially large tree
Truth table is of size 2d, where d is the number of attributes
Decision Tree Representations

• Trees can be very inefficient for certain types of functions

– Parity function: 1 only if an even number of 1’s in the input vector
• Trees are very inefficient at representing such functions
– Majority function: 1 if more than ½ the inputs are 1’s
• Also inefficient
– Simple DNF formulae can be easily represented
• E.g., f = (A AND B) OR (NOT(A) AND D)
• DNF = disjunction of conjunctions

• Decision trees are in effect DNF representations

– often used in practice since they often result in compact approximate
representations for complex functions
– E.g., consider a truth table where most of the variables are irrelevant to the
function
Decision Tree Learning

• Find the smallest decision tree consistent with the n examples

– Unfortunately this is provably intractable to do optimally

• Greedy heuristic search used in practice:

– Select root node that is “best” in some sense
– Partition data into 2 subsets, depending on root attribute value
– Recursively grow subtrees
– Different termination criteria
• For noiseless data, if all examples at a node have the same label then
declare it a leaf and backup
• For noisy data it might not be possible to find a “pure” leaf using the
given attributes
– we’ll return to this later – but a simple approach is to have a depth-
bound on the tree (or go to max depth) and use majority vote

• We have talked about binary variables up until now, but we can trivially
extend to multi-valued variables
Pseudocode for Decision tree learning
Choosing an attribute

• Idea: a good attribute splits the examples into subsets that are (ideally)
"all positive" or "all negative"

• Patrons? is a better choice

– How can we quantify this?
– One approach would be to use the classification error E directly (greedily)
• Empirically it is found that this works poorly
– Much better is to use information gain (next slides)
Entropy

H(p) = entropy of distribution p = {pi}

(called “information” in text)

= E [pi log (1/pi) ] = - p log p - (1-p) log (1-p)

Entropy is the expected amount of information we gain, given a

probability distribution – its our average uncertainty

In general, H(p) is maximized when all pi are equal and minimized

(=0) when one of the pi’s is 1 and all others zero.
Entropy with only 2 outcomes

Consider 2 class problem: p = probability of class 1, 1 – p =

probability of class 2

In binary case, H(p) = - p log p - (1-p) log (1-p)

H(p)

0 0.5 1 p
Information Gain

• H(p) = entropy of class distribution at a particular node

• H(p | A) = conditional entropy = average entropy of conditional

class distribution, after we have partitioned the data according
to the values in A

• Gain(A) = H(p) – H(p | A)

• Simple rule in decision tree learning

– At each internal node, split on the node with the largest information
gain (or equivalently, with smallest H(p|A))

• Note that by definition, conditional entropy can’t be greater

than the entropy
Root Node Example
For the training set, 6 positives, 6 negatives, H(6/12, 6/12) = 1 bit

positive (p) negative (1-p)

>> H(6/12,6/12) = -(6/12)log2(6/12)-(6/12)log2(6/12)

Consider the attributes Patrons and Type:

2 4 6 2 4
IG ( Patrons ) 1  [ H (0,1)  H (1,0)  H ( , )] .0541 bits
12 12 12 6 6

2 1 1 2 1 1 4 2 2 4 2 2
IG (Type ) 1  [ H ( , )  H ( , )  H ( , )  H ( , )] 0 bits
12 2 2 12 2 2 12 4 4 12 4 4
Patrons has the highest IG of all attributes and so is chosen by the learning algorithm
as the root

Information gain is then repeatedly applied at internal nodes until all leaves contain
only examples from one class or the other
Choosing an attribute
Decision Tree Learned

• Decision tree learned from the 12 examples:

True Tree (left) versus Learned Tree (right)
Assessing Performance

Training data performance is typically optimistic

e.g., error rate on training data

Reasons?
- classifier may not have enough data to fully learn the concept (but
on training data we don’t know this)
- for noisy data, the classifier may overfit the training data

In practice we want to assess performance “out of sample”

how well will the classifier do on new unseen data? This is the
true test of what we have learned (just like a classroom)

With large data sets we can partition our data into 2 subsets, train and test
- build a model on the training data
- assess performance on the test data
Example of Test Performance

Restaurant problem
- simulate 100 data sets of different sizes
- train on this data, and assess performance on an independent test set
- learning curve = plotting accuracy as a function of training set size
- typical “diminishing returns” effect (some nice theory to explain this)
Overfitting and Underfitting

X
A Complex Model

Y = high-order polynomial in X

X
A Much Simpler Model

Y = a X + b + noise

X
Example 2
Example 2
Example 2
Example 2
Example 2
How Overfitting affects Prediction

Predictive
Error

Error on Training Data

Model Complexity
How Overfitting affects Prediction

Predictive
Error

Error on Test Data

Error on Training Data

Model Complexity
How Overfitting affects Prediction

Underfitting Overfitting

Predictive
Error

Error on Test Data

Error on Training Data

Model Complexity

Ideal Range
for Model Complexity
Training and Validation Data

Full Data Set

Idea: train each
Training Data model on the
“training data”

and then test

each model’s
Validation Data accuracy on
the validation data
The k-fold Cross-Validation Method

• Why just choose one particular 90/10 “split” of the data?

– In principle we could do this multiple times

• “k-fold Cross-Validation” (e.g., k=10)

– randomly partition our full data set into k disjoint subsets (each
roughly of size n/v, n = total number of training data points)
• for i = 1:10 (here k = 10)
– train on 90% of data,
– Acc(i) = accuracy on other 10%
• end
• Cross-Validation-Accuracy = 1/k  i Acc(i)
– choose the method with the highest cross-validation accuracy
– common values for k are 5 and 10
– Can also do “leave-one-out” where k = n
Disjoint Validation Data Sets

Validation Data (aka Test

Data)
Full Data Set

1st partition
Training Data
Disjoint Validation Data Sets

Validation Data (aka Test

Data)
Full Data Set

Validation
Data
1st partition 2nd partition
Training Data
Disjoint Validation Data Sets

Validation Data (aka Test

Data)
Full Data Set

Validation
Data
1st partition 2nd partition
Training Data

3rd partition 4th partition 5th partition

– by averaging over different partitions it is more robust than just a

single train/validate partition of the data

– “k-fold” cross-validation is a generalization

• partition data into disjoint validation subsets of size n/k
• train, validate, and average over the v partitions
• e.g., k=10 is commonly used

– k-fold cross-validation is approximately k times computationally

more expensive than just fitting a model to all of the data
Summary

• Inductive learning
– Error function, class of hypothesis/models {h}
– Want to minimize E on our training data
– Example: decision tree learning

• Generalization
– Training data error is over-optimistic
– We want to see performance on test data
– Cross-validation is a useful practical approach

• Learning to recognize faces

– Viola-Jones algorithm: state-of-the-art face detector, entirely
learned from data, using boosting+decision-stumps

LECTURE SET 07 - Machine Learning For Artificial Intelligence
No ratings yet
LECTURE SET 07 - Machine Learning For Artificial Intelligence
75 pages
TTNT 09 Learning From Examples
No ratings yet
TTNT 09 Learning From Examples
58 pages
Computational Learning Theorem
No ratings yet
Computational Learning Theorem
91 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
Learning
No ratings yet
Learning
51 pages
AI Unit 4
No ratings yet
AI Unit 4
91 pages
Notes
No ratings yet
Notes
125 pages
ML Merge
No ratings yet
ML Merge
145 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
LECTURE SET 07 - Machine Learning For Artificial Intelligence
No ratings yet
LECTURE SET 07 - Machine Learning For Artificial Intelligence
48 pages
10 Learning
No ratings yet
10 Learning
32 pages
10 Learning Annot
No ratings yet
10 Learning Annot
32 pages
Chapter19 4e
No ratings yet
Chapter19 4e
67 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
Artificial Intelligence: Machine Learning
No ratings yet
Artificial Intelligence: Machine Learning
110 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
5 Learning
No ratings yet
5 Learning
42 pages
Practicalintroductiontomachinelearning1561472049990 PDF
No ratings yet
Practicalintroductiontomachinelearning1561472049990 PDF
110 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
31 pages
19 ML Intro
No ratings yet
19 ML Intro
33 pages
Unit 5
No ratings yet
Unit 5
21 pages
01 Intro
No ratings yet
01 Intro
22 pages
Fall 2022 Midterm Notes PDF
No ratings yet
Fall 2022 Midterm Notes PDF
15 pages
DTreesAndOverfitting 1 11 2011 - Final
No ratings yet
DTreesAndOverfitting 1 11 2011 - Final
20 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
2-Inductive Learning
No ratings yet
2-Inductive Learning
37 pages
Decision Trees
No ratings yet
Decision Trees
42 pages
Ai Unit V
No ratings yet
Ai Unit V
18 pages
Tycs Ai Unit 2
No ratings yet
Tycs Ai Unit 2
84 pages
Ai Module V Part2
No ratings yet
Ai Module V Part2
8 pages
Ai Unit 5 Part 3
No ratings yet
Ai Unit 5 Part 3
9 pages
CS 343: Artificial Intelligence Machine Learning: Raymond J. Mooney
No ratings yet
CS 343: Artificial Intelligence Machine Learning: Raymond J. Mooney
35 pages
JU Ch9
No ratings yet
JU Ch9
21 pages
Lec 3
No ratings yet
Lec 3
21 pages
ITML U1 Overview
No ratings yet
ITML U1 Overview
45 pages
CS7641 Machine Learning Midterm Notes PDF
No ratings yet
CS7641 Machine Learning Midterm Notes PDF
239 pages
Tutorial
No ratings yet
Tutorial
81 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
54 pages
Lect6 PDF
No ratings yet
Lect6 PDF
66 pages
ML Sit1305
No ratings yet
ML Sit1305
127 pages
Unit 2
No ratings yet
Unit 2
76 pages
Unit 5 2
No ratings yet
Unit 5 2
31 pages
Chap 18
No ratings yet
Chap 18
51 pages
Learning From Observations: Chapter 18, Sections 1-3
No ratings yet
Learning From Observations: Chapter 18, Sections 1-3
30 pages
Chapter 8: Learning: By, Safa Hamdare
No ratings yet
Chapter 8: Learning: By, Safa Hamdare
46 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Machine Learning Learning
No ratings yet
Machine Learning Learning
35 pages
06 Learning
No ratings yet
06 Learning
51 pages
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
No ratings yet
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
30 pages
Chapter 6:artificial Intelligence Learning: By. Getaneh T
No ratings yet
Chapter 6:artificial Intelligence Learning: By. Getaneh T
59 pages
Machine Learning Notes
100% (3)
Machine Learning Notes
134 pages
Learning From Observations: Section 1 - 3
No ratings yet
Learning From Observations: Section 1 - 3
26 pages
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
No ratings yet
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
101 pages
Key Ideas in Machine Learning
No ratings yet
Key Ideas in Machine Learning
11 pages
Data Science Lab-KTU
No ratings yet
Data Science Lab-KTU
5 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Decision Tree
No ratings yet
Decision Tree
52 pages
This Story Paraphrased From A Post On 9/4/12
No ratings yet
This Story Paraphrased From A Post On 9/4/12
7 pages
Classifying The Supervised Machine Learning and Comparing The Performances of The Algorithms
No ratings yet
Classifying The Supervised Machine Learning and Comparing The Performances of The Algorithms
17 pages
Merged Presentation Choladeck Choladeck-Compressed
No ratings yet
Merged Presentation Choladeck Choladeck-Compressed
239 pages
Python Lab Manual Detail
No ratings yet
Python Lab Manual Detail
49 pages
Machine-Learning Paradigms
No ratings yet
Machine-Learning Paradigms
32 pages
DWM
No ratings yet
DWM
64 pages
Decision Tree Classifier For Parametric Fault Detection in Electrical Submersible Pumps
No ratings yet
Decision Tree Classifier For Parametric Fault Detection in Electrical Submersible Pumps
4 pages
Decision Tree and Random Forest
No ratings yet
Decision Tree and Random Forest
23 pages
Artificial Intelligence in Geotechnical Engineering
No ratings yet
Artificial Intelligence in Geotechnical Engineering
35 pages
6 Benefits of DL Techniques For Credit Scoring
No ratings yet
6 Benefits of DL Techniques For Credit Scoring
14 pages
IE0005 SummaryLecture Week3
No ratings yet
IE0005 SummaryLecture Week3
37 pages
Fake Review Detection
No ratings yet
Fake Review Detection
27 pages
Downloads Papers N5cb9ce4592aef PDF
No ratings yet
Downloads Papers N5cb9ce4592aef PDF
7 pages
Heart Disease Predicition
No ratings yet
Heart Disease Predicition
42 pages
Decision Tree Clips
100% (1)
Decision Tree Clips
25 pages
Data Analytics With Python - Week 12 - 2022
No ratings yet
Data Analytics With Python - Week 12 - 2022
3 pages
2014 - Predicting The Price of Used Cars Using Machine Learning Techniques PDF
No ratings yet
2014 - Predicting The Price of Used Cars Using Machine Learning Techniques PDF
12 pages
Sample Questions
No ratings yet
Sample Questions
51 pages
Project Report On Customer Lifetime Value
No ratings yet
Project Report On Customer Lifetime Value
23 pages
ML DecisionTrees
No ratings yet
ML DecisionTrees
46 pages
ML Lab Manual TE 2021-22
No ratings yet
ML Lab Manual TE 2021-22
43 pages
EE 769 2020.02.29 Mid Term Solution
No ratings yet
EE 769 2020.02.29 Mid Term Solution
6 pages
SRS Project
No ratings yet
SRS Project
9 pages
Section A: Ques. 1
No ratings yet
Section A: Ques. 1
31 pages
International School of Engineering: (We Are Applied Engineering)
No ratings yet
International School of Engineering: (We Are Applied Engineering)
20 pages
Durham HART IR
No ratings yet
Durham HART IR
12 pages
Parasplit: A Scalable Architecture On Fpga For Terabit Packet Classification
No ratings yet
Parasplit: A Scalable Architecture On Fpga For Terabit Packet Classification
8 pages
ML 11 Decision Trees
No ratings yet
ML 11 Decision Trees
4 pages
UMBC CMSC 471 Final Exam,: 1. True/False (20 Points)
No ratings yet
UMBC CMSC 471 Final Exam,: 1. True/False (20 Points)
6 pages

Cs 171 18 IntroLearning Old

Uploaded by

Cs 171 18 IntroLearning Old

Uploaded by

Introduction to Machine Learning

Reading for today: R&N 18.1-18.4

Next lecture: R&N 18.6-18.12, 20.1-20.3.2

• The importance of a good representation

 Understand Attributes, Error function, Classification,

 What is Supervised Learning?

 Decision Tree Algorithm

 Tradeoff between train and test with model complexity

• Why is it useful for our agent to be able to learn?

– Other types of learning

• Properties of a good representation:

• Reveals important features

• “You can’t learn what you can’t represent.” --- G. Sussman

• In search: A man is traveling to market with a fox, a goose,

• A good representation makes this problem easy:

1. Alternate: is there an alternative restaurant nearby?

• Let x represent the input vector of attributes

• Let f(x) represent the value of the target variable for x

• We want to learn a mapping from x to f, i.e.,

 are the parameters of our predictor h(..)

– hk(x) = (x1 OR x2) AND (x3 OR NOT(x4))

• Empirical error function:

e.g., distance = squared error if h and f are real-valued (regression)

Sum is over all training pairs in the training data D

In learning, we get to choose

1. what class of functions h(..) that we want to learn

2. what error function/distance to use

• Empirical error function:

• Empirical learning = finding h(x), or h(x; ) that minimizes E(h)

– If E(h) is differentiable as a function of q, then we have a continuous optimization problem

– If E(h) is non-differentiable (e.g., classification), then we typically have a systematic search

• Note an implicit assumption:

• Given examples of the function, can we learn the function?

• How many Boolean functions can be defined on d attributes?

– E.g., d = 6, there are 18.4 x 1018 possible Boolean functions

Constrain h(..) to be a decision tree

Decision trees are fully expressive

• Trees can be very inefficient for certain types of functions

• Decision trees are in effect DNF representations

• Find the smallest decision tree consistent with the n examples

• Greedy heuristic search used in practice:

• Patrons? is a better choice

H(p) = entropy of distribution p = {pi}

= E [pi log (1/pi) ] = - p log p - (1-p) log (1-p)

Entropy is the expected amount of information we gain, given a

In general, H(p) is maximized when all pi are equal and minimized

Consider 2 class problem: p = probability of class 1, 1 – p =

In binary case, H(p) = - p log p - (1-p) log (1-p)

• H(p) = entropy of class distribution at a particular node

• H(p | A) = conditional entropy = average entropy of conditional

• Gain(A) = H(p) – H(p | A)

• Simple rule in decision tree learning

• Note that by definition, conditional entropy can’t be greater

positive (p) negative (1-p)

>> H(6/12,6/12) = -(6/12)*log2(6/12)-(6/12)*log2(6/12)

Consider the attributes Patrons and Type:

• Decision tree learned from the 12 examples:

Training data performance is typically optimistic

In practice we want to assess performance “out of sample”

Error on Training Data

Error on Test Data

Error on Training Data

Error on Test Data

Error on Training Data

Full Data Set

and then test

• Why just choose one particular 90/10 “split” of the data?

• “k-fold Cross-Validation” (e.g., k=10)

Validation Data (aka Test

Validation Data (aka Test

Validation Data (aka Test

3rd partition 4th partition 5th partition

– by averaging over different partitions it is more robust than just a

– “k-fold” cross-validation is a generalization

– k-fold cross-validation is approximately k times computationally

• Learning to recognize faces

>> H(6/12,6/12) = -(6/12)log2(6/12)-(6/12)log2(6/12)