Chapter 7
Chapter 7
2
Instance Based Learning
3
Instance-based learning
4
Learning prototypes
5
Speed up classification, combat noise
• David Aha’s IB2: save memory, speed up classification
• Work incrementally
• Only incorporate misclassified instances
• Problem: noisy data gets incorporated
• David Aha’s IB3: deal with noise
• Discard instances that do not perform well
• Compute confidence intervals for
1. Each instance’s success rate
2. Default accuracy of the instance’s class
• Accept/reject instances according to performance
1. Accept if lower limit of 1 exceeds upper limit of 2
2. Reject if upper limit of 1 is below lower limit of 2
6
Weight attributes
7
Generalized exemplars
8
Rectangular generalizations
9
Separating generalized exemplars
Class 1
Class 2
Separation
line
10
Generalized distance functions
11
Discussion and Bibliographic Notes
12
Extending Linear Models
13
Support vector machines
14
The maximum margin hyperplane
15
Support vectors
• The support vectors define the maximum margin hyperplane
• All other instances can be deleted without changing its position
and orientation
• The hyperplane
can be written as
16
Finding support vectors
17
Nonlinear SVMs
• We can create a nonlinear classifier by creating new “pseudo”
attributes from the original attributes in the data
• “Pseudo” attributes represent attribute combinations
• E.g.: all polynomials of degree 2 that can be formed from the original
attributes
• We can learn a linear SVM from this extended data
• The linear SVM in the extended space is a non-linear classifier
in the original attribute space
• Overfitting often not a significant problem with this approach
because the maximum margin hyperplane is stable
• There are often comparatively few support vectors relative to the size
of the training set
• Computation time still an issue
• Each time the dot product is computed, all the “pseudo attributes”
must be included
18
A mathematical trick
19
Other kernel functions
• Mapping is called a “kernel function”
• Polynomial kernel
*
20
Noise
21
Sparse data
22
Support vector regression
• Maximum margin hyperplane only applies to classification
• However, idea of support vectors and kernel functions can
be used for regression
• Basic method is the same as in linear regression: want to
minimize error
• Difference A: ignore errors smaller than e and use absolute
error instead of squared error
• Difference B: simultaneously aim to maximize flatness of
function
• User-specified parameter e defines “tube”
23
More on SVM regression
• If there are tubes that enclose all the training points, the flattest
of them is used
• E.g.: mean is used if 2e > range of target values
• Model can be written as:
24
Examples
e=2
e=1
e = 0.5
25
Kernel Ridge Regression
26
Comments on kernel ridge regression
• Like in an SVM, the predicted class value for a test
instance is expressed as a weighted sum of dot products
• But: all training instances are involved in this sum:
instead of
( where y is either -1 or +1)
• Now swap summation signs:
• Can be expressed as:
• Can replace dot product by kernel:
29
Comments on kernel perceptron
30
Multilayer perceptrons
31
Examples
32
Backpropagation
• How to learn the weights given a network structure?
• Cannot simply use perceptron learning rule because we have
hidden layer(s)
• Function we are trying to minimize: error
• Can use a general function minimization technique called
gradient descent
• Activation function needs to provide gradient information:
can use sigmoid function instead of threshold function
33
Threshold vs. sigmoid activation function
34
Gradient descent example
• Function: x2+1
• Derivative: 2x
• Learning rate: 0.1
• Start value: 4
36
Minimizing the error II
• What about the weights for the connections from the input to
the hidden layer? More application of the chain rule…
37
Remarks
• The same process works for multiple hidden layers and
multiple output units (e.g., for multiple classes)
• Can update weights after all training instances have been
processed or incrementally:
• batch learning vs. stochastic backpropagation
• Weights are initialized to small random values
• How to avoid overfitting?
• Early stopping: use validation set to check when to stop
• Weight decay: add penalty term to error function
• How to speed up learning?
• Momentum: re-use proportion of old weight change
• Use optimization method that employs 2nd derivative
38
Radial basis function networks
39
Learning RBF networks
• Parameters to be learned: centers and widths of the RBFs +
weights in output layer
• Can learn the two sets of parameters independently and still get
fairly accurate models
• E.g.: clusters from k-means can be used to form basis functions
• Linear model for output layer can be based on fixed RBFs found using
clustering, which makes learning very efficient
• However, for best accuracy it is best to train the entire network in
a fully supervised manner
• Can use the same methods that are used for training multilayer perceptrons
• Disadvantage of standard RBF networks: no built-in attribute
weighting based on relevance
• But: can introduce attribute weights into the distance function
• RBF networks are related to RBF SVMs, which have a basis
function centered on each support vector
40
Stochastic gradient descent
• We have have seen gradient descent + stochastic gradient
descent for learning weights in a neural network
• Gradient descent is a general-purpose optimization
technique
• Can be applied whenever the objective function is differentiable
• Actually, can be used even when the objective function is not
completely differentiable!
• This based on the concept of subgradients, which we will not get
into here
• One application: learning linear models – e.g. linear SVMs
or logistic regression
• Very fast, simple method for learning from large datasets
41
Stochastic gradient descent cont.
42
Loss functions
• For SVMs, the error function (to be minimized) is called
the hinge loss
43
Optimizing the hinge loss
• In the linearly separable case, the hinge loss is 0 for a
function that successfully separates the data
• The maximum margin hyperplane is given by the smallest weight
vector that achieves 0 hinge loss
• Corresponding optimization problem that needs to be solved:
user-specified
regularization
parameter z
• But: hinge loss is not differentiable at z = 1; cannot
compute gradient for all values of z
• Can use subgradient – something that resembles a gradient
• Can use 0 at z = 1
• In fact, loss is 0 for z 1, so we can focus on z 1 and proceed as
usual with stochastic gradient descent
• Also yields a solution if the data is not separable
44
Discussion and Bibliographic Notes
• SVMs stem from statistical learning theory (Vapnik 1999)
• A good starting point for exploration is a tutorial by Burges (1998)
• Soft-margin SVMs were discussed by Cortes and Vapnik (1995)
• Tutorial on support vector regression: Smola and Schölkopf (2004)
• Schölkopf et al. (1999) present support vector regression with just
one parameter instead of two (C and ε)
• Fletcher (1987) covers constrained quadratic optimization
• The SMO algorithm for training SVMs is due to Platt (1998)
• Ridge regression was introduced by Hoerl and Kennard (1970)
• Hastie et al. (2009) give a good description of kernel ridge regression
• Kernel ridge regression is equivalent to Gaussian process regression,
a Bayesian approach that also provides estimates of uncertainty
45
Discussion and Bibliographic Notes
• The kernel perceptron is due to Freund and Schapire (1999)
• Cristianini and Shawe-Taylor (2000) provide an introduction
to support vector machines and kernel-based methods
• Shawe-Taylor and Cristianini (2004) and Schölkopf and Smola
(2002) cover kernel-based learning in detail
• Bishop (1995) provides an excellent introduction to both
multilayer perceptrons and RBF networks
• Kivinen et al. (2002), Zhang (2004) and Shalev-Shwartz et al.
(2007) explore gradient methods for SVMs
• Kivinen et al. and Shalev-Shwartz et al. provide heuristics for
setting the learning rate for gradient descent
46
Numeric Prediction with
Local Linear Models
47
Numeric prediction (aka regression)
48
Regression trees
49
Model trees
• Build a regression tree
• Each leaf linear regression function
• Smoothing: factor in ancestor’s predictions
• Smoothing formula:
51
Nominal attributes
• Convert nominal attributes to binary ones
• Sort attribute values by their average class values
• If attribute has k values,
generate k – 1 binary attributes
• i th attribute is 0 if original nominal value is part of the first i
nominal values in the sorted list, and 1 otherwise
• Treat binary attributes as numeric in linear regression models
and when selecting splits
• Can prove: best SDR split on one of the new binary attributes
is the best (binary) SDR split on original nominal attribute
• In practice this process is not applied at every node of the
tree but globally at the root node of the tree
• Splits are no longer optimal but runtime and potential for overfitting
are reduced this way
52
Missing values
53
Surrogate splitting based on class
• Choose split point based on instances with known values
• Split point divides instances into 2 subsets
• L (smaller class average)
• R (larger)
• m is the average of the two averages
• For an instance with a missing value:
• Choose L if class value < m
• Otherwise R
• Once full tree is built, replace missing values with averages of
corresponding leaf nodes
• Linear regression models can then be built on the completed
(“imputed”) dataset
54
Pseudo-code for M5'
• Let us consider the pseudo code for the model tree inducer M5’
• Four methods:
• Main method: MakeModelTree
• Method for splitting: split
• Method for pruning: prune
• Method that computes error: subtreeError
• We will briefly look at each method in turn
• We will assume that the linear regression method performs
attribute subset selection based on error (discussed previously)
• Nominal attributes are replaced globally at the root node
55
MakeModelTree
MakeModelTree (instances)
{
SD = sd(instances)
for each k-valued nominal attribute
convert into k-1 synthetic binary attributes
root = newNode
root.instances = instances
split(root)
prune(root)
printTree(root)
}
56
split
split(node)
{
if sizeof(node.instances) < 4 or
sd(node.instances) < 0.05*SD
node.type = LEAF
else
node.type = INTERIOR
for each attribute
for all possible split positions of attribute
calculate the attribute's SDR
node.attribute = attribute with maximum SDR
split(node.left)
split(node.right)
}
57
prune
prune(node)
{
if node = INTERIOR then
prune(node.leftChild)
prune(node.rightChild)
node.model = linearRegression(node)
if subtreeError(node) > error(node) then
node.type = LEAF
}
58
subtreeError
subtreeError(node)
{
l = node.left; r = node.right
if node = INTERIOR then
return (sizeof(l.instances)*subtreeError(l)
+ sizeof(r.instances)*subtreeError(r))
/sizeof(node.instances)
else return error(node)
}
59
Model tree for servo data
60
Rules from model trees
• PART algorithm generates classification rules by building
partial decision trees
• Can use the same method to build rule sets for regression
• Use model trees instead of decision trees
• Use variance instead of entropy to choose node to expand when
building a partial tree
• Rules that are generated will have linear models on right-
hand side
• Caveat: using smoothed trees may not be appropriate due
to the separate-and-conquer strategy used in rule learning
• Empirical evidence shows that smoothing does not help
• Full trees can be used instead of partial trees at the expense
of runtime
61
Locally weighted regression
• Locally weighted regression is a numeric prediction method
that combines
• instance-based learning
• linear regression
• It is a “lazy” learning method:
• Computes new regression function for each test instance at
prediction time
• Works incrementally
• Weights training instances
• according to distance to test instance
• builds linear regression model from weighted data
• requires weighted version of linear regression (straightforward)
• Advantage: nonlinear approximation
• Slow if implemented using brute-force search; however, fast
data structures can be used for the nearest-neighbor search
62
Design decisions
• Weighting functions:
• Inverse Euclidean distance
• Gaussian kernel applied to Euclidean distance
• Triangular kernel used the same way
• etc.
• Empirically, performance does not appear to depend much on
the weighting method that is used
• Ideally, weighting function has bounded support so that most
training instances receive weight 0 and can be ignored
• Smoothing parameter is used to scale the distance function for
computation of the weights
• Multiply distance by inverse of this parameter
• Possible choice: distance to the kth nearest training instance (renders
choice of smoothing parameter data dependent)
63
Discussion and Bibliographic Notes
• Regression trees were introduced in the “classification and
regression trees”, or CART system (Breiman et al., 1984)
• The method of handling nominal attributes and the surrogate
device for dealing with missing values were included in CART
• M5 model trees were first described by Quinlan (1992)
• The M5’ version is given by Wang and Witten (1997)
• Using model trees (although not partial trees) for generating
rule sets has been explored by Hall et al. (1999)
• There are many variations of locally weighted learning.
• Statisticians have considered using locally quadratic models
• They have applied locally weighted logistic regression to classification
• Frank et al. (2003) evaluated the use of locally weighted learning in
conjunction with Naïve Bayes
• Atkeson et al. (1997) provide a survey on locally weighted learning
64