Learning
Learning
Chapter 18.1-18.3
2
Why Learn?
• Understand and improve efficiency of human learning
– Use to improve methods for teaching and tutoring people (e.g.,
better computer-aided instruction)
• Discover new things or structure that were previously
unknown to humans
– Examples: data mining, scientific discovery
• Fill in skeletal or incomplete specifications about a domain
– Large, complex AI systems cannot be completely derived by hand
and require dynamic updating to incorporate new information.
– Learning new characteristics expands the domain or expertise and
lessens the “brittleness” of the system
• Build software agents that can adapt to their users or to
other software agents
3
Major Paradigms of Machine Learning
• Rote learning – One-to-one mapping from inputs to stored
representation. “Learning by memorization.” Association-based
storage and retrieval.
• Induction – Use specific examples to reach general conclusions
• Clustering – Unsupervised identification of natural groups in data
• Analogy – Determine correspondence between two different
representations
• Discovery – Unsupervised, specific goal not given
• Genetic algorithms – “Evolutionary” search techniques, based on
an analogy to “survival of the fittest”
• Reinforcement – Feedback (positive or negative reward) given at
the end of a sequence of steps
4
5
6
Classification Learning: Definition
Given a collection of records (training set)
– Each record contains a set of attributes, one of the
attributes is the class
Find a model for the class attribute as a function of the
values of the other attributes
Goal: previously unseen records should be assigned a class
as accurately as possible
– Use test set to estimate the accuracy of the model
– Often, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it
Illustrating Classification
Learning
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Examples of Classification Task
Predicting tumor cells as benign or malignant
11
Model Spaces
• Decision trees
– Partition the instance space into axis-parallel regions, labeled with class
value
• Nearest-neighbor classifiers
– Partition the instance space into regions defined by the centroid instances
(or cluster of k instances)
• Bayesian networks (probabilistic dependencies of class on attributes)
– Naïve Bayes: special case of BNs where class each attribute
• Neural networks
– Nonlinear feed-forward functions of attribute values
• Support vector machines
– Find a separating plane in a high-dimensional feature space
• Associative rules (feature values → class)
• First-order logical rules
12
Learning Decision Trees
• Goal: Build a decision tree to classify
examples as positive or negative
instances of a concept using supervised
learning from a training set
• A decision tree is a tree where
– each non-leaf node has associated with it
an attribute (feature)
–each leaf node has associated with it a
classification (+ or -)
–each arc has associated with it one of the
possible values of the attribute at the node
from which the arc is directed
• Generalization: allow for >2 classes
–e.g., {sell, hold, buy}
13
Example of a Decision Tree
Tid Refund Marital Taxable
Status Income Cheat
No
Refund
1 Yes Single 125K
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Information Theory
• Information is measured in bits
• Information conveyed by a message depends on its probability
• With n equally probable possible messages, the probability p
of each is 1/n
• Information conveyed by message is log2(n) = -log2(p)
– e.g., with 16 messages, then log2 (16) = 4 and we need 4
bits to identify/send each message
• Given probability distribution for n messages P = (p1,p2…pn),
the information conveyed by distribution (aka entropy of P) is:
I(P) = -(p1*log2 (p1) + p2*log2 (p2) + .. + pn*log2 (pn))
43
Using Gain Ratios
• The information gain criterion favors attributes that have a large
number of values
– If we have an attribute D that has a distinct value for each
record, then Info(D,T) is 0, thus Gain(D,T) is maximal
• To compensate for this Quinlan suggests using the following
ratio instead of Gain:
GainRatio(D,T) = Gain(D,T) / SplitInfo(D,T)
• SplitInfo(D,T) is the information due to the split of T on the
basis of value of categorical attribute D
SplitInfo(D,T) = I(|T1|/|T|, |T2|/|T|, .., |Tm|/|T|)
where {T1, T2, .. Tm} is the partition of T induced by value of D
44
Computing Gain Ratio
45
Computing Gain Ratio
46
Choosing the Best Attribute
• The key problem is choosing which attribute to split a given
set of examples
• Some possibilities are:
– Random: Select any attribute at random
– Least-Values: Choose the attribute with the smallest number of
possible values
– Most-Values: Choose the attribute with the largest number of
possible values
– Max-Gain: Choose the attribute that has the largest expected
information gain–i.e., the attribute that will result in the smallest
expected size of the subtrees rooted at its children
• The ID3 algorithm uses the Max-Gain method of selecting
the best attribute
47
Measuring Model Quality
• How good is a model?
– Predictive accuracy
– False positives / false negatives for a given cutoff threshold
• Loss function (accounts for cost of different types of errors)
– Area under the (ROC) curve
– Minimizing loss can lead to problems with overfitting
• Training error
– Train on all data; measure error on all data
– Subject to overfitting (of course we’ll make good predictions on the data on
which we trained!)
• Regularization
– Attempt to avoid overfitting
– Explicitly minimize the complexity of the function while minimizing loss.
Tradeoff is modeled with a regularization parameter
48
Cross-Validation
• Holdout cross-validation:
– Divide data into training set and test set
– Train on training set; measure error on test set
– Better than training error, since we are measuring generalization to
new data
– To get a good estimate, we need a reasonably large test set
– But this gives less data to train on, reducing our model quality!
49
Cross-Validation, cont.
• k-fold cross-validation:
– Divide data into k folds
– Train on k-1 folds, use the kth fold to measure error
– Repeat k times; use average error to measure generalization
accuracy
– Statistically valid and gives good accuracy estimates
• Leave-one-out cross-validation (LOOCV)
– k-fold cross validation where k=N (test data = 1 instance!)
– Quite accurate, but also quite expensive, since it requires building N
models
50
Summary: Decision Tree Learning
• Inducing decision trees is one of the most widely used
learning methods in practice
• Can out-perform human experts in many problems
• Strengths include
– Fast
– Simple to implement
– Can convert result to a set of easily interpretable rules
– Empirically valid in many commercial products
– Handles noisy data
• Weaknesses include:
– Univariate splits/partitioning using only one attribute at a time so limits
types of possible trees
– Large decision trees may be hard to understand
– Requires fixed-length feature vectors
– Non-incremental (i.e., batch method)
51