Machine Learning Tutorial PDF
Machine Learning Tutorial PDF
Contents
1. Introduction 3
2. What is Machine Learning 4
2.1 Notation of Dataset 4
2.2 Training Set and Test Set 4
2.3 No Free Lunch Rule 6
2.4 Relationships with Other Disciplines 7
3. Basic Concepts and Ideals of Machine Learning 8
3.1 Designing versus Learning 8
3.2 The Categorization of Machine learning 9
3.3 The Structure of Learning 10
3.4 What are We Seeking? 13
3.5 The Optimization Criterion of Supervised Learning 14
3.6 The Strategies of Supervised Learning 20
4. Principles and Effects of Machine Learning 22
4.1 The VC bound and Generalization Error 23
4.2 Three Learning Effects 24
4.3 Feature Transform 29
4.4 Model Selection 32
4.5 Three Learning Principles 35
4.6 Practical Usage: The First Glance 35
5. Techniques of Supervised Learning 37
5.1 Supervised Learning Overview 37
5.2 Linear Model (Numerical Functions) 39
5.2.1 Perceptron Learning Algorithm (PLA) Classification 40
5.2.2 From Linear to Nonlinear 41
5.2.3 Adaptive Perceptron Learning Algorithm (PLA) Classification 43
5.2.4 Linear Regression Regression 46
5.2.5 Rigid Regression Regression 47
5.2.6 Support Vector Machine (SVM) and Regression (SVR) 48
1
5.2.7 Extension to Multi-class Problems 49
5.3 Conclusion and Summary 50
6. Techniques of Unsupervised Learning 51
7. Practical Usage: Pattern Recognition 52
8. Conclusion 53
Notation
General notation:
a : scalar
a : vector
A: matrix
ai : the ith entry of a
aij : the entry (i, j ) of A
a ( n ) : the nth vector a in a dataset
A( n ) : the nth matrix A in a dataset
bk : the vector corresponding to the kth class in a dataset (or kth component in a model)
Bk : the matrix corresponding to the kth class in a dataset (or kth component in a model)
bk( i ) : the ith vector of the kth class in a dataset
A , A( n ) , Bk : the number of column vectors in A, A( n ) , and Bk
Special notation:
* In some conditions, some special notations will be used and desribed at those places.
Ex: bk denotes a k -dimensional vector, and Bk j denotes a k j matrix
2
1. Introduction
In the topics of face recognition, face detection, and facial age estimation,
machine learning plays an important role and is served as the fundamental technique
in many existing literatures.
For example, in face recognition, many researchers focus on using dimensionality
reduction techniques for extracting personal features. The most well-known ones are
(1) eigenfaces [1], which is based on principal component analysis (PCA,) and (2)
fisherfaces [2], which is based on linear discriminant analysis (LDA).
In face detection, the popular and efficient technique based on Adaboost cascade
structure [3][4], which drastically reduces the detection time while maintains
comparable accuracy, has made itself available in practical usage. Based on our
knowledge, this technique is the basis of automatic face focusing in digital cameras.
Machine learning techniques are also widely used in facial age estimation to extract
the hardly found features and to build the mapping from the facial features to the
predicted age.
Although machine learning is not the only method in pattern recognition (for
example, there are still many researches aiming to extract useful features through
image and video analysis), it could provide some theoretical analysis and practical
guidelines to refine and improve the recognition performance. In addition, with the
fast development of technology and the burst usage of Internet, now people can easily
take, make, and access lots of digital photos and videos either by their own digital
cameras or from popular on-line photo and video collections such as Flicker [5],
Facebook [6], and Youtube [7]. Based on the large amount of available data and the
intrinsic ability to learn knowledge from data, we believe that the machine learning
techniques will attract much more attention in pattern recognition, data mining, and
information retrieval.
In this tutorial, a brief but broad overview of machine learning is given, both in
theoretical and practical aspects. In Section 2, we describe what machine learning is
and its availability. In Section 3, the basic concepts of machine learning are presented,
including categorization and learning criteria. The principles and effects about the
learning performance are discussed in Section 4, and several supervised and
unsupervised learning algorithms are introduced in Sections 5 and 6. In Section 7, a
general framework of pattern recognition based on machine learning technique is
provided. Finally, in Section 8, we give a conclusion.
3
2. What is Machine Learning?
Optimizing a performance criterion using example data and past experience,
said by E. Alpaydin [8], gives an easy but faithful description about machine learning.
In machine learning, data plays an indispensable role, and the learning algorithm is
used to discover and learn knowledge or properties from the data. The quality or
quantity of the dataset will affect the learning and prediction performance. The
textbook (have not been published yet) written by Professor Hsuan-Tien Lin, the
machine learning course instructor in National Taiwan University (NTU), is also titled
as Learning from Data, which emphasizes the importance of data in machine
learning. Fig. 1 shows an example of two-class dataset.
d-dimensional vector x ( n ) [ x1( n ) , x2( n) ,......, xd( n) ]T and called a feature vector or feature
In machine learning, what we desire is that these learned properties can not only
explain the training set, but also be used to predict unseen samples or future events. In
order to examine the performance of learning, another dataset may be reserved for
testing, called the test set or test data. For example, before final exams, the teacher
may give students several questions for practice (training set), and the way he judges
the performances of students is to examine them with another problem set (test set). In
order to distinguish the training set and the test set when they appear together, we use
We have not clearly discussed what kinds of properties can be learned from the
dataset and how to estimate the learning performance, while the readers can just leave
it as a black box and go forth. In Fig. 2, an explanation of the three datasets above is
presented, and the first property a machine can learn in a labeled data set is shown, the
separating boundary.
5
Fig. 2 An explanation of three labeled datasets. The universal set is assumed to
exist but unknown, and through the data acquisition process, only a subset of
universal set is observed and used for training (training set). Two learned separating
lines (the first example of properties a machine can learn in this tutorial) are shown in
both the training set and test set. As you can see, these two lines definitely give 100%
accuracy on the training set, while they may perform differently in the test set (the
curved line shows higher error rate).
In this inequality, N denotes the size of training set, and and describe how the
leaned properties perform in the training set and the test set. For example, if the
learned property is a separating boundary, these two quantities usually correspond to
the classification errors. Finally, is the tolerance gap between and . Details of
the Hoeffding inequality are beyond the scope of this tutorial, and later an extended
version of the inequality will be discussed.
6
(a) (b) (c)
Fig. 3 The no free lunch rule for dataset: (a) is the training set we have, and (b), (c)
are two test sets. As you can see, (c) has different sample distributions from (a) and
(b), so we cannot expect that the properties learned from (a) to be useful in (c).
While (1) gives us the confidence on applying machine learning, there are some
necessary rules to ensure its availability. These rules are called the no free lunch
rules and are defined on both the dataset and the properties to learn. On the dataset,
the no free lunch rules require the training set and the test set to come from the same
distribution (same universal set). And on the properties, the no free lunch rules ask the
users to make assumptions on what property to learn and how to model the property.
For example, if the separating boundary in a labeled dataset is desired, we also need
to define the type of the boundary (ex. a straight line or a curve). On the other hand, if
we want to estimate the probability distribution of an unlabeled dataset, the
distribution type should also be defined (ex. Gaussian distribution). Fig. 3 illustrates
the no free lunch rules for dataset.
8
3.2 The Categorization of Machine Learning
There are generally three types of machine learning based on the ongoing
problem and the given data set, (1) supervised learning, (2) unsupervised learning,
and (3) reinforcement learning:
Supervised learning: The training set given for supervised learning is the
labeled dataset defined in Section 2.1. Supervised learning tries to find the
relationships between the feature set and the label set, which is the
knowledge and properties we can learn from labeled dataset. If each feature
vector x is corresponding to a label y L, L {l1 , l2 ,......, lc } (c is usually ranged
from 2 to a hundred), the learning problem is denoted as classification. On the
other hand, if each feature vector x is corresponding to a real value y R , the
learning problem is defined as regression problem. The knowledge extracted
from supervised learning is often utilized for prediction and recognition.
Unsupervised learning: The training set given for unsupervised leaning is the
unlabeled dataset also defined in Section 2.1. Unsupervised learning aims at
clustering [12], probability density estimation, finding association among
features, and dimensionality reduction [13]. In general, an unsupervised
algorithm may simultaneously learn more than one properties listed above, and
the results from unsupervised learning could be further used for supervised
learning.
9
3.3 The Structure of Learning
In this subsection, the structure of machine learning is presented. In order to
avoid confusion about the variety of unsupervised learning structures, only the
supervised learning structure is shown. While in later sections, several unsupervised
learning techniques will still be mentioned and introduced, and important references
for further reading are listed. An overall illustration of the supervised learning
structure is given in Fig. 7. Above the horizontal dotted line, an unknown target
function f (or target distribution) that maps each feature sample in the universal
dataset to its corresponding label is assumed to exist. And below the dotted line, a
training set coming from the unknown target function is used to learn or approximate
the target function. Because there is no idea about the target function or distribution f
(looks like a linear boundary or a circular boundary?), a hypothesis set H is necessary
to be defined, which contains several hypotheses h (a mapping function or
distribution).
(a) (b)
Fig. 4 Supervised learning: (a) presents a three-class labeled dataset, where the
color shows the corresponding label of each sample. After supervised learning, the
class-separating boundary could be found as the dotted lines in (b).
Insides the hypothesis set H, the goal of supervised learning is to find the best h,
called the final hypothesis g, in some sense approximating the target function f. In
order to do so, we need further define the learning algorithm A, which includes the
objective function (the function to be optimized for searching g) and the
optimization methods. The hypothesis set and the objective function jointly model
the property to learn of the no free lunch rules, as mentioned in Section 2.3. Finally,
the final hypothesis g is expected to approximate f in some way and used for
future prediction. Fig. 8 provides an explanation on how hypothesis set works with
the learning algorithm.
10
(a) (b)
Fig. 5 Unsupervise learning (clustering): (a) shows the same feature set as above
while missing the label set. After performing the clustering lagorithm, three
underlined groups are discovered from the data in (b). Also, users can perform other
konds of unsupervides learning algorithm to learn different kinds of knowledge (ex.
Probability distributuion) from the unlabeled dataset.
(a) (b)
Fig. 6 Semi-supervised learning. (a) presents a labeled dataset (with red, green,
and blue) together with a unlabeled dataset (marked with black). The distribution of
the unlabeled dataset could guide the position of separating boundary. After learning,
a different boundary is depicted against the one in Fig. 4.
There are three general requirements for the learning algorithm. First, the
algorithm should find a stable final hypothesis g for the specific d and N of the
training set (ex. convergence). Second, it has to search out the correct and optimal g
defined through the objective function. The last but not the least, the algorithm is
expected to be efficient.
11
Fig. 7 The overall illustration of supervised learning structure: The part above the
dotted line is assumed but inaccessible, and the part below the line is trying to
approximate the unknown target function (f is the true target function and g is the
learned function).
1 N
true 1
E ( h) y (n)
h ( x (n)
) , (2)
N n 1 false 0
, where stands for the indicator function. When the error rate (2) is defined on
the training set, it is named the in-sample error Ein (h) , while the error rate
calculated on the universal set or more practically the (unknown or reserved)
test set is named the out-of-sample error Eout (h) . Based on these definitions,
the desired final hypothesis g is the one that achieves the lowest out-of-sample
error over the whole hypothesis set:
g arg min Eout (h). (3)
h
While in the learning phase, we can only observe the training set, measure Ein (h) ,
and search g based on the objective function. From the contradiction above, a
question the readers may ask, What is the connection among the objective
function, Ein ( g ) , and Eout ( g ) , and what should we optimize in the learning
phase?
As mentioned in (1), the connection between the learned knowledge from
the training set and its availability on the test set can be formulated as a
probability equation. That equation is indeed available when the hypothesis set
contains only one hypothesis. For more practical hypothesis sets which may
contain infinite many hypotheses, an extended version of (1) is introduced as:
dVC
Eout ( g ) Ein ( g ) O( log N ), with probability 1 . (4)
N
13
This inequality is called the VC bound (VapnikChervonenkis bound), where
dVC is the VC dimension used as a measure of model (hypothesis set and
objective function) complexity, and N is the training set size. The VC bound
listed here is a simplified version, but provides a valuable relationship between
Eout ( g ) and Ein ( g ) : A hypothesis g that can minimize Ein (h) may induce a low
Eout ( g ) . The complete definition of the VC dimension is beyond the scope of
this tutorial.
Based on the VC bound, a supervised learning strategy called empirical
risk minimization (ERM) is proposed to achieve low Eout ( g ) by minimizing
Ein ( g ) :
x [1, x1 , x2 ,......, xd ]T is the extended version of x also with (d+1) dimensions. The
vector w stands for the classifier parameters, and the additional 1 in x is used to
compute the offset of the classification line. Based on the goal of ERM introduced in
Section 3.4, the objective function is the in-sample error term (or say the loss term,
calculating how many training samples are wrongly predicted) and the optimization
method is used to find a linear classifier to minimize the objective function. Fig. 9
shows a linearly-separable training set as well as the corresponding final hypothesis
g. As you can see, there are many hypotheses that could achieve zero error.
15
Fig. 10 The objective (in-sample error rate) considering only a sample with
y ( n ) 1 based on linear classfiers. The x-axis denotes the inner product of the
extended feature vector and the parameter vector of the linear classifier. As shown, the
(n)
objective function is non-continuous around wT x 0.
If we simplify (2) and just look at one sample without normalization, the error
term will become:
loss ( n ) ( g ) y ( n ) g ( x ( n ) ) . (8)
Fig. 10 shows this one-sample objective function for a sample with y ( n ) 1 . As can be
(n)
seen, the fuction is non-continuous around wT x 0 and flat in the other ranges, so
with zero gradients according to w, the optimization algorithm has no idea to adjust
the current w towards lower error rate. Fig. 11 illustrates this problem for a
linear-separable training set.
Differentiation-based optimization methods are probably the most widely-used
optimization techniques in machine learning, especially for objective functions that
can be directly written as a function form of the traning samples and the classifier or
regressor parameters w (not always in the vector form). The popular gradient descent,
stochastic gradeint descent, Newtons method, coordinate descent, and convex
optimization are of this optimization category. The differentiation-based methods are
usually performed in the iterative manner, which may suffer from the local optimal
problem. Besides, some of them cannot even reach the exact local optimal due to
convergence concern, where slow updating and small vibration usually occur around
the exact optimal parameters. Despite these drawbacks, the optimization category is
16
popular because of its intuitive geometrical meaning and usually easy to start with by
simple caluculus such as the Taylors expansion.
Fig. 11 Assume at the current iteration, the attained w and one of the desired w are
shown as the black and the gray dotted lines, the optimization algorithm may have
no idea on how to adjust w towards its desired quantities.
The basic concerns to exploit this optimization category are the differentiability
of the objective function and the continuity of the parameter space. The objective
function may have some non-continuous or undifferentiable points, while it should at
least be in a piecewise differentiable form. In addition to differentiability, we also
expect that the fuction has non-zero gradients along the path of optimization, and the
zero gradients only happen at the desired optimal position. As shown in !
, the in-sample error rate of linear classifiers is neither continuous nor
with non-zero gradients along the optimization path. This non-continuous objective
function may still be solved by some other optimization techniques such as the
perceptron learning algorithm, the neural evolution, and the genetic algorithm, etc.,
while they are either much more complicated, require more computational time, or are
only available in certain convergence-guarantee conditions. The objective function
should not be confused with the classifier or regressor functions. The second term is
the function for predicting the label of the feature vector, and the first term is the
function used to find the optimal parameters of classifiers or regressors.
To make differentiation-based optimization methods available for ERM in the
linear classifier case, we need to modify the in-sample error term into some other
approximation functions that is (piecewise) differentiable and continuous. There are
many choices of approximation functions (denoted as Eapp (h) ), and the only
17
Eapp (h) Ein (h) (9)
and universal set, the learning phase turns to optimize Ein (h) with g by ERM, and
Eapp (h) is defined to take place of Ein (h) . Trough searching g which optimizes Eapp (h)
with the constraints defined in (9), we expect the final hypothesis g could achieve low
Ein ( g ) as well as low Eout ( g ) ( Ein ( g ) Ein ( g ) and Eout ( g ) Eout ( g * ) ). Table 1
for the whole training set ( Eapp (h) ) is just the normalized summation of these
one-sample functions.
Fig. 12 Different objective functions for linear classifiers defined on a sample with
y ( n ) 1 . The terms in parentheses are the corresponding algorithm names.
18
Table 1 The supervised learning concept from ERM to the objective functions.
Find a final hypothesis g , which approaches the target function
Original goal:
f through achieving the minimum Eout (h) .
Given: A labeled training set train .
! .
19
Square error with regularization (stochastic) gradient descent
In addition to the linear classifiers, there are still many kinds of hypothesis sets
as well as different objective functions and optimization techniques (as listed in
Table 2) for supervised learning on a given training set. To be noticed, both the
hypothesis set types and the corresponding objective functions affect the VC
dimension introduced in (4) for model complexity measurement. And even based on
the same hypothesis set, different objective functions may result in different final
hypothesis g.
20
hypothesis g from the hypothesis set H. Table 3 lists both the classifier modeling and
the optimization criteria, and Fig. 13 illustrates the different learning structures of
these two strategies.
Compared to the one-shot strategy which only outputs the predicted label, the
two-stage strategy comes up with a soft decision, the probability of each label given a
feature vector. The generative model further discovers the joint distribution between
feature vectors and labels, and provides a unified framework for supervised,
semi-supervised, and unsupervised learning. Although the two-stage strategy seems to
extract more information from the training set, the strong assumption, the training
samples come from a user-defined probability distribution model, may misleads the
learning process if the assumption is wrong, and results in a poor model. Besides, the
optimization of a flexible probability distribution model is usually highly complicated
and requires much more computational time and resources. In Table 4, a general
comparison and illustration of the one-shot and two-stage strategies is presented. In
this tutorial, we focus more on the one-shot strategy, and readers who are interested in
the two-stage strategy can referred to several excellent published books [9][15]. To be
noticed, although the two strategies are different through what they are seeking during
the learning phase, in the testing phase, both of them are measured by the
classification error for performance evaluation.
Fig. 13 The learning structures of the one-shot and two-stage strategies: (a) The
one-shot strategy, and (b) the two-shot strategy, where is the parameter set of the
selected probability distribution model.
21
Table 3 The classifier modeling and optimization criteria for these two strategies.
Classifier type Classifier modeling Optimization criterion
One-shot 1 N
(discriminant)
y* f ( x ) g arg min
h
y ( n ) h( x ( n ) )
N n1
Two-stage y * arg max P( y | x) * arg max P(Y | X ; )
y
(discriminative)
y * arg max P( y | x )
y
Two-stage
P( x | y ) P( y ) * arg max P( X , Y ; )
(generative) arg max
y P( x )
Table 4 Comparisons of the one-shot and two-stage strategies from several aspects.
Category One-shot Two-stage
Model Discriminant Discriminative, Generative
Fewer assumptions More flexible
Model: direct towards the More discovery power
Advantage classification goal Provide uncertainty
Optimization: direct Domain knowledge is easily
towards low error rate included
More assumption
Disadvantage No probabilistic information
Computational complexity
Usage Usually supervised learning Supervised and unsupervised
Adaboost Gaussian discriminant analysis,
Symbolic
support vector machines (SVM), Hidden Markov model (HMM),
classifiers
multilayer perceptrons (MLP) Nave Bayes
22
practical issues will be introduced and discussed, especially for classification
problems. At the beginning, the VC bound is revisited and explained in more detail,
and three effects based on it are introduced. Then, How to select and modify a model
(hypothesis set + objective function) for the training set and problem at hand is
discussed. Three principles which we should keep in mind when considering a
machine learning problem are coming later, and finally we take a fist glance on some
practical issues.
23
dVC
dVC O( log N ) . (12)
N
From the VC bound, we know that the desired quantity to be minimized, Eout ( g ) , is
dependent on both the terms in (11), (12), which means even Ein ( g ) is small, an
dVC
additional term O( log N ) should also be kept small to ensure a small bound of
N
Eout ( g ) . Unfortunately, by changing the elements (hypothesis sets, objective
functions, and the optimization methods) in learning structures as well as changing
dVC to reduce one term in the VC bound, the other term will increase, and we dont
know how Ein ( g ) will vary.
In addition to dVC , which strongly depends on the selected model, Ein ( g ) and
dVC
O( log N ) are also affected by the training set characteristics, such as N and d:
N
dVC
N O( log N ) (13)
N
dVC
d dVC O( log N ) (14)
N
, where d is the dimensionality of feature vectors. The more samples the training set
contains, the higher credibility the properties learned from it. Besides, the feature
dimensionality has some positive connection with the VC dimension dVC . Different
feature dimensionalities will result in hypothesis sets with different dimensionalities
or numbers of parameters, which indicates the change in model complexity. So when
dVC
d increases, Ein ( g ) decreases, while O( log N ) will become larger. Later some
N
illustrations are shown to give the readers more details.
Over-fitting versus under-fitting (fixed N): As shown in Fig. 14, this effect
illustrates the relation between dVC , model complexity, Ein ( g ) , and Eout ( g ) .
Given a training set, if a too-easy model is used for learning, then both Ein ( g )
and Eout ( g ) will be really high, which is called under-fitting. On the other hand,
if a over-complicated model is exploited, although a really small Ein ( g ) could
dVC
probably be achieved, Eout ( g ) will still be high due to a large O( log N ) ,
N
which is called over-fitting. Based on this effect and observation, selecting a
suitable model as well as a moderate dVC plays an important role in machine
learning. The VC dimension dVC could be controlled by the type of hypothesis set,
the objective function, and the feature dimensionality. Although the feature
dimensionality d is given by the training set, several operations could be
performed to reduce or increase it when feeding the training set into the learning
structure (which will be further discussed in later sections).
Bias versus variance (fixed N): As shown in Fig. 15, the bias versus variance
effect has a similar curve as the under-fitting versus over-fitting curve shown in
25
Fig. 14, while the explanation is different and it focuses more on statistics and
regression analysis. Bias means the ability to fit the training set (the smaller the
better), where stronger assumptions on the training set will result in a larger bias.
For example, the bias of using linear classifiers is bigger than the bias of using
nonlinear classifiers, because the set of nonlinear classifiers contains the set of
linear classifiers and seems to be more general. On the other hand, the variance
term means the variation of the final hypotheses when different training sets
coming from the same universal set are given.
Now let me take a revisiting to the Hoeffding inequality and the VC bound
mentioned in (1) and (10). The readers may have questions why there is a
probability term in these two inequalities, and the reason comes from the
quality of the training set. The desired goal of machine learning is to find the
properties of the universal set, while the only thing we observe during learning is
the training set. There exists an uncertainty that how representative the training
set is for the universal set, and the probability term stands for the chance that a
poor training set is observed.
As the definitions of bias and variance go, a low bias model has strong
abilities to fit the training set and reach low Ein ( g ) as mentioned in (2) and (6). If
the training set is representative, the final hypothesis g will be really closed to
the target function f, while if the training set is poor, g can be really dissimilar
26
from f. These effects result in a large variance for a low bias model. In contrary, a
high bias model has poor abilities to fit the training data, while the variance
among the final hypotheses based on different training sets is small due to
limited variation in the hypothesis sets. For statisticians and regression analysis,
the balance between bias and variance is the key to judge the learning
performance, and the relationship between dVC and the bias versus variance effect
is illustrated in Fig. 15. Although the shape of bias and variance looks really
dVC
similar to Ein ( g ) and O( log N ) , there is no strong yet direct relationship
N
among them.
27
E X E f ( y g ( x )) 2
E X E f ( y E f ( y ) E f ( y ) g ( x )) 2
E X E f ( y E f ( y )) 2 E f ( E f ( y ) g ( x )) 2
E X var( y ) ( E f ( y ) E X [ g ( x )] E X [ g ( x )] g ( x )) 2 (16)
var( y ) E X ( E f ( y ) E X [ g ( x )]) 2 E X ( E X [ g ( x )] g ( x)) 2
var( y ) ( E f ( y ) E X [ g ( x )]) 2 var( g ( x ))
var( y ) bias(g ( x )) var( g ( x ))
, where E f means the expectation over the target distribution given x, and E X is
the expectation over all possible training set (maybe poor or representative). The
first term in the last row shows the intrinsic data variation coming from the target
distribution f, and the two following terms stand for the bias and variance of the
selected hypothesis set as well the objective function.
Learning curve (fixed dVC ): The learning curve shown in Fig. 16 looks very
different from the previous two figures, and the relationship considering in this
effect is between the VC bound and the training set size N. As mentioned earlier
dVC
in (13), when N increases, O( log N ) decreases, and the learning curve
N
mentioned here will show how N affects Ein ( g ) and Eout ( g ) . When the data size is
very small, a selected model has the chance to achieve extremely low Ein ( g ) , for
example, if only two different 2-D features are included in the training set, then
no matter what their labels are, linear classifiers could attain Ein ( g ) 0 . While
with N increasing, there will be more and more training samples that the selected
model cant handle and results in wrong predictions. But surprisingly, the
increasing speed of Ein ( g ) is lower than the decreasing speed of generalization
gap along N, which means increasing N generally improve the learning
performance.
To summarize the three effects introduced above, we found that the selection of model
is one of the most important parts in machine learning. A suitable model not only
reaches an acceptable Ein ( g ) but also limits the generalization gap as well as
dVC
O( log N ) . An over-complicated model with an extremely low Ein ( g ) may cause
N
28
dVC
over-fitting effect while a too-easy model with an extremely small O( log N )
N
may result in under-fitting effect. These two cases both degrade the learning
performance Eout ( g ) . Besides, when the model and feature dimensionality are fixed,
increasing the size of the training set generally improves Eout ( g ) . Furthermore, when
judging if a model is complicated or not for a given problem, not only dVC but also N
should be considered. Table 5 lists these important concepts based on VC bound.
Now we revise the learning process to a more generalized procedure. As
summarized in Table 1, given a fixed model with its fixed dVC , the objective function
is minimized over the training set and the attained final hypothesis g is expected to
induce low Eout ( g ) through the VC bound relationship and used for further application.
And if there are many possible models and a fixed training set at hand, model
selection is a necessary and important step during learning, which aims at searching
the best model with the lowest generalization error. To be noticed, we cannot
perform model selection based on Ein ( g ) or Eapp ( g ) , because now each model has its
specific dVC . Unfortunately again, the only term we can measure during learning is
dVC
Ein ( g ) , with the test set preserved and O( log N ) nearly unachievable (just as the
N
dVC
upper bound and dVC is usually hard to define). In fact, O( log N ) often serves as a
N
theoretical adjustment and is used just as a guideline in model selection. Furthermore,
not only the VC dimension affects the performance of learning, but different
hypothesis types and different objective functions having their specific learning
properties would discover various aspects of knowledge and result in different
performances for the given problem, even if they are of the same VC dimension
dVC . According to these diversities of models, a more practical method for selecting
suitable models is of high demand. In the next two subsections, several popular
methods for generating model diversities and performing model selection are
introduced and discussed.
dVC
small Ein ( g ) , high O( log N ) dVC or N
N
29
dVC
To maintain a same O( log N ) ,when
N
Practical usage of dVC for a given N
dVC increases, N also increases.
N 10dVC usually performs well
Hypothesis set types and objective functions (Type I): Different hypothesis set
types (ex. KNN, decision trees, and linear classifiers) result in different models.
Furthermore, even in the same class such as linear classifiers, different objective
functions (ex. square error and hinge loss) come up with different learning
performances.
Model parameter (Type II): Even under the same hypothesis set type and
objective function, there are still some free parameters to adjust the hypothesis
set. For example, in KNN (K-nearest neighbors), different selections of K may
result in different learning performances. The use of SVM and multi-layer
perceptron also requires users to set some parameters before execution.
Generally, these parameters have connections with model complexity and dVC .
Feature transform (Type III): The last but not the least, changing the
dimensionality of feature vectors will result in different dVC of the model. There
are a bunch of methods to modify the feature vector dimensionality, and the
general framework is formulated by basis functions:
30
( x) [1 ( x), 2 ( x),......, d ( x)]T (17)
, where the added 1 is for offset when using linear classifiers as in (7). Table 6
and Table 7 list several useful feature transforms and their definitions, and as you
can see, we can always perform feature transform before feeding feature vectors
into the learning machine. In addition to these kinds of geometry- or
mathematics-driven feature transforms, there are also data-driven feature
transforms defining their basis functions form learning (ex. PCA and LDA) and
knowledge-driven feature transforms based on the characteristics of problems
(ex. DCT and DFT). We will mention these transforms in later sections.
Based on these three methods for achieving different models as well as different
model complexities, now we can generate several models and perform model
selection to choose the best model among them.
31
Decision stump S ( x) [1, x j ]T , where 1 j d
1st-order 1 ( x) [1, x1 , x2 ]T
than a line boundary. To search the hypothesis which minimizes Ein (or Eapp )
, where the (h) term is used to penalty hypothesis h with higher complexity
Section 3.4, the objective function is composed of a loss function as well as other
32
pursuits, such as the penalty function (h) , and the approximation functions
introduced in Section 3.5 are indeed loss functions because they are defined for
measuring classification error of the training set. In fact, regularization searches
the best hypothesis insides a hypothesis set, not among several hypothesis sets.
Several widely used penalty functions and the corresponding objective functions
are list in Table 8, where is the model parameter (Type II defined in Section 3.3)
used to balance classification error of the training set and the penalty term. The
reason why objective function could affect model complexity as well as dVC is
because the penalty function introduced inside has abilities to control them.
1
Eobj (h) Eapp (h) wi
2
L2 minimization
2 i
Before (2) A training set: train base val (for base and validation)
learning ( m)
(3) Eobj (, base ) is the objective function of model m measured on base
for m 1: M
find gm arg min Eobj
( m)
(hm , base )
hm
Training
end
(Find the final hypothesis for each model m)
Validation
(Run these M final hypotheses on the validation set, and select the model
with the lowest validation error.)
33
find gl arg min Eobj
(l )
(hl , train ) , and used it for prediction
hl
Retrain
(Retrain the selected model on the whole training set)
34
validation
end
1 N
Find the model l that, l arg min
m
EVal ( gm , n) , then retrain
N n 1
set. The model with the best g (achieving the lowest classification error EVal ( g ) ) is
selected as the winner model for the ongoing problem and expected to perform well
on unseen data. Table 9 describes this procedure in more detailed. There are generally
four kinds of validation strategies: One shot validation, multi-shot validation, cross
validation (CV), and leave-one-out (LOO) validation, as listed in Table 10. Besides
the LOO and the cross validation, the other two strategies have
" K (10% ~ 40%) N " in practice. The availability of validation is based on some
theoretical proofs, which is beyond the scope of this thesis. In recent pattern
recognition researches, validation is the most popular methods for performance
comparison.
Occams razor: The simplest model that fits the data is also the most plausible,
which means if two models could achieve the same expected Ein ( g ) , then the
simpler one is the suitable model.
Sampling bias: If the sampling data is sampled in a biased way, then learning
will produce a similarly biased outcome. For example, if an examination of
how Internet affects your life?
is performed on-line, the statistical result has risk to over-estimate the goodness
of Internet because people who dont like to use Internet are likely to miss this
test.
Data snooping: If a data set has affected any step in the learning process, it
cannot be fully trusted in assessing the outcome. During learning, the hypothesis
g which best fits the training data through minimizing the objective function is
35
selected, and during testing, we test how this learned hypothesis could be
generalized in the test data. The reason why there exists a generalization gap is
because the learned hypothesis g is biased by and may over-fit to the training
data. But if a data set both affects the learning and test phase, we cannot
correctly detect the generalization gap and will over-estimate the performance of
the model.
Parametric model: For clear description of the four categories, the parametric
model is introduced before the other two methods. A model is called parametric
as it is built on well-defined probabilistic distribution model, which means
when the parameters of the distribution model is learned from the training set,
we could discard the training set and only reserve these parameters for testing
and prediction. Generally speaking, when the type of probabilistic model is set,
no matter how many samples are in the training set, if the number of parameters
(values the model needs to remember) doesnt change, then the model is called a
parametric model.
Readers might also have questions on what the relationship between the one-shot
/ two-stage strategies mentioned in Section 3.6 and the four categories defined in this
section is. In Sec. 3.6, we discussed about what information a classifier (regressor)
can provide as well as the optimization criteria during learning, while in this section,
these four categories are defined based on their basic ideas on the hypothesis set types.
Indeed, a linear classifier which is usually categorized as a one-shot method can also
be modified into a probabilistic version based on some probabilistic model
assumption, which means each category in this section may contains both one-shot
and two-shot classifiers (regressors). As a consequence, the one-shot and two-shot
strategies are not explicitly mentioned in this section.
Learning:
40
for t 1:
(n)
randomly pick a { x ( n ) , y ( n ) } where hw ( t ) ( x ( n ) ) sign( w (t )T x ) y ( n )
(n)
w (t 1) w (t ) y ( n ) x
if Ein ( hw (t 1) ) 0 Key process
g hw (t 1) , break
end
end
41
for t 1: T
(n)
randomly pick a { x ( n ) , y ( n ) } where hw ( t ) ( x ( n ) ) sign( w (t )T x ) y ( n )
(n)
w (t 1) w (t ) y ( n ) x
if Ein ( hw (t 1) ) Ein ( hw* )
w * w (t 1) Extra process
end
if Ein ( hw* ) 0 or t T
g hw* , break
end
end
The pocket algorithm can find the best hypothesis which reaches the minimum
in-sample error in T iterations, while for a non-linear-separable training set, there is no
guarantee that within how large T a plausibly low in-sample error could be achieved.
Besides, because linear classifiers can only generate linear classification boundaries,
the pocket algorithm still cannot solve non-linear-separable training set very well,
especially when the class boundary of the training set is far away from a line. To solve
this problem, we show in the next subsection that performing feature transform can
make the linear classifier available for non-linear boundary cases.
training set that can be separated by a circle x12 x2 2 1 is presented. If the linear
The feature transform does bring linear classifiers into nonlinear-separable cases,
while what feature transform should be used is yet a problem. A higher-order feature
transform has a bigger chance to achieve linear-separable boundary, while it may
cause over-fitting problem. On the other hand, if the transformed training set is still
not linear-separable, the pocket algorithm has no guarantee to achieve a plausibly low
in-sample error in T iterations because the updating rule of PLA doesnt ensure
monotonic decreasing of in-sample error. In order to speed-up the learning of linear
classifiers and confirm its stability, modification on the non-continuous objective
function to make other optimization methods available is required.
The loss function of a sample is shown in (24), and the objective function of the
43
1 N (n)
training set is defined as: Eapp (hw ) ( y wT x ( n ) ) 2 y ( n ) ( wT x ( n ) ) 1
N n1
Preset w (1) , usually assume (d+1)-dimensional zero vector.
The maximum number of iterations T
Preset the learning step , ex. 0.01
Learning:
for t 1: T
randomly pick a { x ( n ) , y ( n ) }
(n)
if y ( n ) ( w (t )T x ) 1
w (t 1) w (t ) ( y ( n ) w (t )T x ) x
(n) (n)
end
end
g hw (T 1)
T (n) 2 T ( n)
( y w x ) , if y ( w x ) 1
(n) (n)
loss ( n ) (hw ) (24)
0, otherwise
, which is both continuous and differential at any w with a given data pair, so
Table 15 Concept of gradient descent
Presetting:
Assume a function E ( w ) is to be minimized, w is k-dimensional
w* arg(E ( w) 0) is a solution while sometimes hard to compute due to
coupling across parameters and a large summation caused by the training set
size.
Assume now we are at w w0 , and we want to take a small modification w ,
which makes E (w0 + w) E (w0 ) , then Taylors expansion can be applied.
Taylors expansion:
44
E "( w0 ) 2
scalar w: E ( w0 + w) E ( w0 ) E '( w0 ) w w H .O.T (high order terms)
2!
1
vector w: E ( w0 + w ) E ( w0 ) J ( w0 )w w T H ( w0 )w T H .O.T
2!
E ( w )
w
1
, where J ( w ) is the Jacobian matrix: J ( w ) E ( w )
E ( w )
wk
2 E (w) 2 E(w)
w w w1wk
1 1
, and H ( w ) is the Hessian matrix: H ( w ) E ( w )
2
2
E (w) 2 E(w)
w w wk wk
k 1
In machine learning, the H.O.T is often discarded.
E ( w 0 )
w * E ( w0 ), is set as a small constant for convenience
E ( w 0 )
2
E ( w0 + w * ) E ( w0 ) E ( w0 )w * E ( w0 )
w0 ( new) w0 ( old ) E ( w0 )
45
for t 1: T
N
w (t 1) w (t ) E ( w (t )) w (t ) loss ( n ) ( w (t ))
n 1
end
g w (T 1)
Algorithm of stochastic gradient descent (SGD):
1 N
E (w)
N n 1
loss ( n ) ( w ) E[loss ( n ) ( w )]
for t 1: T
randomly choose a n*
*
w (t 1) w (t ) loss ( n ) ( w (t ))
end
g hw (T 1)
(SGD only perform GD on one sample each time, and perform iteratively.)
differentiation- based optimization methods are available now. Adaline uses stochastic
gradient descent (SGD) to search the best hypothesis which minimizes the objective
function (without any penalty term). In Table 14, the pseudocode of Adaline algorithm
is presented, and in Adaptive Perceptron Learning Algorithm- Classification
From this subsection on, several approximated loss functions as well as their
optimization procedures and methods will be introduced. The first approximated loss
function is a variant of the so-called Adaline (Adaptive linear neuron) algorithm for
perceptron learning. The loss function of each sample is modified as:
T (n) 2 T ( n)
( y w x ) , if y ( w x ) 1
(n) (n)
loss ( n ) (hw ) (24)
0, otherwise
, which is both continuous and differential at any w with a given data pair, so
Table 15 and , both the Adaline algorithm and SGD are described in detailed.
From these tables, the mechanism of using gradient descent for optimization is
presented. There are several adjustable items of gradient descent algorithm, such as
the learning step and the number of iterations T. The learning step should be kept
small to fit the requirement of Taylors expansion. A too small learning step takes
more number of iterations towards convergence. On the other hand, a large learning
step may spend less number of iterations to converge, while it has chances to diverge
out or reach a wrong solution. The number of iterations can be defined before or
46
during learning, where the modification between w(t 1) and w (t ) is used as a measure
of convergence, and generally, SGD takes more iteration than GD towards
convergence.
GD, SGD, and w* arg(E ( w) 0) are searching for the local minima" of E ( w ) ,
which means the achieved final hypothesis g may not actually minimizes the objective
function. While if E ( w ) is a convex function of w , then any local minima of E ( w ) is
exactly the global minima of E ( w ) . The definition of convexity can be referred to the
textbook written by S. Boyd et al. [22]. In fact, the stochastic gradient descent cant
even achieve the local minima, but vibrates around it after a number of iterations.
Adaline provides a much stable learning algorithm than PLA. Although the
function minimized by Adaline is just an approximated loss function, not directly the
in-sample error of the ERM strategy, the in-sample error resulting from the final
hypothesis g is usually not far away from the minimum value.
After achieving the regression model, we also need to define the objective function
and optimization method for training. As mentioned in Section 3.4, the most
widely-used criterion for regression is to minimize the root mean square error
(RMSE):
1 N (n) 1 N (n) 2
y h( x ( n ) ) y ( n ) w T x .
2
E ( h) (26)
N n1 N n1
This equation is definitely continuous and differentiable, so gradient descent or
stochastic gradient descent can be applied for optimization. Furthermore, because
(n) 2
y ( n ) wT x is a convex function on w and the positive weighted summation of a set
of convex functions is still convex [22], the final solution provided by gradient
descent is a global minimum solution (or very close to the global minima due to the
iteration limitation).
Besides applying the general differentiation optimization methods, there exists a
closed form solution for linear regressors with the RMSE criterion. This closed form
47
solution is quite general, which can be applied not only for linear regressors but also
for many optimization problems with the RMSE criterion. In Table 17, we summarize
this formulation. To be noticed, for linear regressors with other kinds of objective
functions, this closed form solution may not exist. To get more understanding on
linear regression and other kinds of objective functions, the textbook [16] is
recommended.
Table 17 The closed form solution for linear regression with the RMSE criterion
Presetting:
1 N (n) 1 N (n) 2
y h( x ( n ) ) y ( n ) w T x
2
A function E (h) is to be minimized.
N n1 N n1
1 T
YY T 2wT XY T wT X X w
N
1 T T
w YY T 2wT XY T wT X X w 2 XY T 2( X X ) w
N
T T T
2 XY T 2( X X )w* 0 ( X X ) w* XY T , where X X is d d
T T
If X X is nonsingular (when N d , it is usually the case), w* ( X X )1 XY T .
T
If X X is singular, other treatments like pseudo-inverse or SVD can be applied.
1
YY T 2 w T XY T w T X X I w
N
T
1
w YY T 2wT XY T wT X X I w 2 XY T 2 X X I w
N
T
T
2 XY T 2 X X I w* 0
T
X X T
I w* XY T , where X X is d d
T
1
w* X X I
T
XY T
One-versus-one (OVO): Assume that there are totally c classes. OVO builds a
binary classifier for each pair of classes, which means totally c(c 1) / 2 binary
classifiers are built. When given an input sample x, each classifier predicts a
50
possible class label, and the final predicted label is the one with the most votes
among all c(c 1) / 2 classifiers.
One-versus-all (OVA): OVA build a binary classifier for each class (positive)
against all other classes (negative), which means totally c binary classifiers are
built. When given an input sample x, the class label corresponded to the
classifier which gives a positive decision on x is selected as the final predicted
label.
In general, the OVA method suffers from two problems. The first problem is that
there may be more than one positive class or no positive class, and the second one is
the unbalance problem. The unbalance problem means the number of positive training
samples is either much larger or smaller than the number of negative training samples.
In this condition, the trained classifier will tend to always predict the class with more
training samples and lead to poor performances for unseen samples. For example, if
there are 100 positive training samples and 9900 negative samples, always predicting
the negative class could simply results in a 0.01 in-sample error. Although the OVO
method doesnt suffer from these problems, it needs to build more binary classifiers
( (c 1) / 2 times more) than the OVA method, which is of highly computational cost
especially when c is large.
There are also other methods to extend binary linear classifiers into multi-class
linear classifiers, either based on a similar concept of OVO and OVA or from
theoretical modifications. And for non-metric, non-parametric, and parametric models,
multi-class classifiers are usually embedded in the basic formulation without extra
modifications.
51
6. Techniques of Unsupervised Learning
In this section, we briefly introduce the techniques of unsupervised learning and
its categorization. Compared to supervised learning, unsupervised learning only
acquires the feature set, not the label set. As mentioned in Section 3.2, the main goal
of unsupervised learning can be categorized into clustering, probability density
estimation, and dimensionality reduction:
Clustering: Given a set of samples, clustering aims to separate them into several
groups based on some kinds of similarity / distance measures, and the basic
criterion for doing so is to minimize the intra-group distance while maximize the
inter-group distance. Clustering can discover the underlying structure in the
samples, which is very important in applications such as business and medical
issues: Separating the customers or patients into groups based on their attributes
and designing specific strategies or treatments for each group. In addition, the
discovered groups can be used as the label of each sample, and then the
supervised learning techniques can be applied for further applications.
In Table 19, several important techniques for each category as well as the important
references are listed for further studying.
52
Table 19 The category and important techniques for unsupervised learning
Category Techniques Reference
K-means clustering [8]
Clustering
Spectral clustering [27][28]
Gaussian mixture model (GMM) [9]
Density Estimation
Graphical models [9][15]
Principal component analysis (PCA) [8]
Dimensionality reduction
Factor analysis [8]
8. Conclusion
In this tutorial, a broad overview of machine learning containing both theoretical
and practical aspects is presented. Machine learning is generally composed of
modeling (hypothesis set + objective function) and optimization, and the necessary
part to perform machine learning is a suitable dataset for knowledge learning. For the
theoretical aspect, we introduce the basic idea, categorization, structure, and criteria
of machine learning. And for the practical aspect, several principles and techniques of
both unsupervised and supervised learning are presented in this tutorial.
9. Reference
[1] M. Turk and A. Pentland, Eigenfaces for recognition, Journal of Cognitive
54
Neuroscience, vol. 3, no.1, pp. 72-86, 1991.
[2] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs.
Fisherfaces: Recognition using class specific linear projection, IEEE Trans.
Pattern Analysis and Machine Intelligence, vol. 19, no. 7, 711-720, 1997.
[3] P. Viola and M. Jones, Rapid object detection using a boosted cascade of
simple features, Proc. IEEE Conf. Computer Vision and Pattern Recognition,
pp. 511 -518, 2001.
[4] P. Viola and M. Jones, Robust real-time face detection, Intl Journal of
Computer Vision, vol. 57, no. 2, pp. 137-154, 2004.
[5] Flicker: http://www.flickr.com/
[6] Facebook: http://www.facebook.com/
[7] Youtube: http://www.youtube.com/
[8] E. Alpaydin, Introduction to machine learning, 2nd ed., The MIT Press, 2010.
[9] C. M. Bishop, Pattern recognition and machine learning, Springer, 2006.
[10] W. Hoeffding, Probability Inequalities for Sums of Bounded Random
Variables, American Statistical Association Journal, vol. 58, pp. 13-30, March
1963.
[11] K. Sayood, Introduction to Data Compression, Morgan Kaufmann Publishers,
1996.
[12] R. Xu, D.I.I. Wunsch, Survey of clustering algorithms, IEEE Trans.Neural
Networks, vol. 16, no. 3, pp. 645678, 2005.
[13] I.K. Fodor, A survey of dimension reduction techniques, Technical report
UCRL-ID-148494, LLNL, 2002.
[14] L. P. Kaelbling, M. L. Littman, and A. W. Moore, Reinforcement learning: a
survey, J. Artif. Intell. Res. 4, pp. 237-285, 1996.
[15] D. Koller and N. Friedman, Probabilistic Graphical Models, MIT Press, 2009.
[16] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning,
2nd ed., Springer, 2005.
[17] J. L. Schafer, J. W. Graham, Missing data: our view of the state of the art,
Psychological Methods, vol. 7, no. 2, pp. 147-177, 2002.
[18] KDD Cup 2009: http://www.kddcup-orange.com/
[19] I. Guyon, V. Lemaire, G. Dror, and D. Vogel, Analysis of the KDD cup 2009:
Fast scoring on a large orange customer database, JMLR: Workshop and
Conference Proceedings, vol. 7, pp. 1-22, 2009.
[20] H. Y. Lo et al, An ensemble of three classifiers for KDD cup 2009: Expanded
linear model, heterogeneous boosting, and selective naive Bayes, In JMLR
W&CP, vol.7, KDD cup 2009, Paris, 2009.
55
[21] A. Niculescu-Mizil, C. Perlich, G. Swirszcz, V. Sindhwani, Y. Liu, P. Melville,
D. Wang, J. Xiao, J. Hu, M. Singh, et al, Winning the KDD Cup Orange
Challenge with Ensemble Selection, In KDD Cup and Workshop in
conjunction with KDD 2009, 2009.
[22] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press,
Cambridge, 2004.
[23] C. C. Chang and C. J. Lin, LIBSVM: a library for support vector machines,
ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
[24] J. Platt, Sequential minimal optimization: A fast algorithm for training support
vector machines, in Advances in Kernel Methods - Support Vector Learning,
MIT Press, pp. 185-208, 1999.
[25] R. E. Fan, P. H. Chen, and C. J. Lin, Working set selection using second order
information for training SVM, Journal of Machine Learning Research,
6:1889{1918, 2005.
[26] C. W. Hsu and C. J. Lin, A comparison of methods for multi-class support
vector machines, IEEE Transactions on Neural Networks, vol. 13, no. 2, pp.
415-425, 2002.
[27] U. Von Luxburg, A Tutorial on Spectral Clustering, Tech. Rep. TR-149, Max
Plank Institute for Biological Cybernetics, 2006.
[28] A. Ng, M. Jordan, and Y. Weiss, On spectral clustering: analysis and an
algorithm, Advances in Neural Information Processing Systems, vol. 14, 2002.
56