Cs 171 18 IntroLearning Old
Cs 171 18 IntroLearning Old
Entropy
Information Gain
Cross validation
Automated Learning
• Types of learning
– Supervised learning
• Learning a mapping from a set of inputs to a target variable
– Classification: target variable is discrete (e.g., spam email)
– Regression: target variable is real-valued (e.g., stock market)
– Unsupervised learning
• No target variable provided
– Clustering: grouping data into K groups
0001
0101
1011 0001
Simple illustrative learning problem
Problem:
decide whether to wait for a table at a restaurant, based on the following attributes:
• Attributes
– Also known as features, variables, independent variables,
covariates
• Target Variable
– Also known as goal predicate, dependent variable, …
• Classification
– Also known as discrimination, supervised classification, …
• Error function
– Objective function, loss function, …
Inductive learning
• Examples:
– h(x; ) = sign(w1x1 + w2x2+ w3)
• Once we decide on what the functional form of h is, and what the error function E
is, then machine learning typically reduces to a large search or optimization
problem
• Additional aspect: we really want to learn an h(..) that will generalize well to new
data, not just memorize training data – will return to this later
Our training data example (again)
• If all attributes were binary, h(..) could be any arbitrary Boolean function
• Natural error function E(h) to use is classification error, i.e., how many incorrect
predictions does a hypothesis h make
• Observations:
– Huge hypothesis spaces –> directly searching over all functions is impossible
– Given a small data (n pairs) our learning problem may be underconstrained
• Ockham’s razor: if multiple candidate functions all explain the data
equally well, pick the simplest explanation (least complex function)
• Constrain our search to classes of Boolean functions, e.g.,
– decision trees
– Weighted linear sums of inputs (e.g., perceptrons)
Decision Tree Learning
• We have talked about binary variables up until now, but we can trivially
extend to multi-valued variables
Pseudocode for Decision tree learning
Choosing an attribute
• Idea: a good attribute splits the examples into subsets that are (ideally)
"all positive" or "all negative"
H(p)
0 0.5 1 p
Information Gain
2 4 6 2 4
IG ( Patrons ) 1 [ H (0,1) H (1,0) H ( , )] .0541 bits
12 12 12 6 6
2 1 1 2 1 1 4 2 2 4 2 2
IG (Type ) 1 [ H ( , ) H ( , ) H ( , ) H ( , )] 0 bits
12 2 2 12 2 2 12 4 4 12 4 4
Patrons has the highest IG of all attributes and so is chosen by the learning algorithm
as the root
Information gain is then repeatedly applied at internal nodes until all leaves contain
only examples from one class or the other
Choosing an attribute
Decision Tree Learned
Reasons?
- classifier may not have enough data to fully learn the concept (but
on training data we don’t know this)
- for noisy data, the classifier may overfit the training data
With large data sets we can partition our data into 2 subsets, train and test
- build a model on the training data
- assess performance on the test data
Example of Test Performance
Restaurant problem
- simulate 100 data sets of different sizes
- train on this data, and assess performance on an independent test set
- learning curve = plotting accuracy as a function of training set size
- typical “diminishing returns” effect (some nice theory to explain this)
Overfitting and Underfitting
X
A Complex Model
Y = high-order polynomial in X
X
A Much Simpler Model
Y = a X + b + noise
X
Example 2
Example 2
Example 2
Example 2
Example 2
How Overfitting affects Prediction
Predictive
Error
Model Complexity
How Overfitting affects Prediction
Predictive
Error
Model Complexity
How Overfitting affects Prediction
Underfitting Overfitting
Predictive
Error
Model Complexity
Ideal Range
for Model Complexity
Training and Validation Data
1st partition
Training Data
Disjoint Validation Data Sets
Validation
Data
1st partition 2nd partition
Training Data
Disjoint Validation Data Sets
Validation
Data
1st partition 2nd partition
Training Data
• Notes
– cross-validation generates an approximate estimate of how well the
learned model will do on “unseen” data
• Inductive learning
– Error function, class of hypothesis/models {h}
– Want to minimize E on our training data
– Example: decision tree learning
• Generalization
– Training data error is over-optimistic
– We want to see performance on test data
– Cross-validation is a useful practical approach