Section 1.1
Section 1.1
The Learning P ro bl em
If you show a picture to a three-year-old and ask if there is a tree in it, you will
likely get the correct answer. If you ask a thirty-year-old what the definition
of a tree is, you will likely get an inconclusive answer. We didn't learn what
a tree is by studying the mathematical definition of trees. We learned it by
looking at trees. In other words, we learned from 'data'.
Learning from data is used in situations where we don't have an analytic
solution, but we do have data that we can use to construct an empirical solu
tion. This premise covers a lot of territory, and indeed learning from data is
one of the most widely used techniques in science, engineering, and economics,
among other fields.
In this chapter, we present examples of learning from data and formalize
the learning problem. We also discuss the main concepts associated with
learning, and the different paradigms of learning that have been developed.
1
1 . THE LEARNING PROBLEM 1 . 1 . PROBLEM SETUP
l
viewer
movie
know that the historical rating data reveal a lot about how people rate movies,
so we may be able to construct a good empirical solution. There is a great
deal of data available to movie rental companies, since they often ask their
viewers to rate the movies that they have already seen.
Figure 1.1 illustrates a specific approach that was widely used in the
million-dollar competition. Here is how it works. You describe a movie as
a long array of different factors, e.g. , how much comedy is in it, how com
plicated is the plot, how handsome is the lead actor, etc. Now, you describe
each viewer with corresponding factors; how much do they like comedy, do
they prefer simple or complicated plots, how important are the looks of the
lead actor, and so on. How this viewer will rate that movie is now estimated
based on the match/mismatch of these factors. For example, if the movie is
pure comedy and the viewer hates comedies, the chances are he won't like it.
If you take dozens of these factors describing many facets of a movie's content
and a viewer's taste, the conclusion based on matching all the factors will be
a good predictor of how the viewer will rate the movie.
The power of learning from data is that this entire process can be auto
mated, without any need for analyzing movie content or viewer taste. To do
so, the learning algorithm 'reverse-engineers' these factors based solely on pre-
2
1 . THE LEARNING PROBLEM 1 . 1 . PROBLEM SETUP
vious ratings. It starts with random factors, then tunes these factors to make
them more and more aligned with how viewers have rated movies before, until
they are ultimately able to predict how viewers rate movies in general. The
factors we end up with may not be as intuitive as 'comedy content', and in
fact can be quite subtle or even incomprehensible. After all, the algorithm is
only trying to find the best way to predict how a viewer would rate a movie,
not necessarily explain to us how it is done. This algorithm was part of the
winning solution in the million-dollar competition.
1. 1. 1 Components of Learning
The movie rating application captures the essence of learning from data, and
so do many other applications from vastly different fields. In order to abstract
the common core of the learning problem, we will pick one application and
use it as a metaphor for the different components of the problem. Let us take
credit approval as our metaphor.
Suppose that a bank receives thousands of credit card applications every
day, and it wants to automate the process of evaluating them. Just as in the
case of movie ratings, the bank knows of no magical formula that can pinpoint
when credit should be approved, but it has a lot of data. This calls for learning
from data, so the bank uses historical records of previous customers to figure
out a good formula for credit approval.
Each customer record has personal information related to credit , such as
annual salary, years in residence, outstanding loans, etc. The record also keeps
track of whether approving credit for that customer was a good idea, i.e . , did
the bank make money on that customer. This data guides the construction of
a successful formula for credit approval that can be used on future applicants.
Let us give names and symbols to the main components of this learning
problem. There is the input x (customer information that is used to make
a credit decision) , the unknown target function f: X -- Y (ideal formula for
credit approval) , where X is the input space ( set of all possible inputs x) , and Y
is the output space (set of all possible outputs, in this case just a yes/no deci
sion) . There is a data set D of input-output examples (x1 , Y1 ) , · , (xN , YN ) ,
· ·
3
1 . THE LEARNING PROBLEM 1 . 1 . PROBLEM SETUP
TRAINING EXAMPLES
· · · ,
(xN, YN)
FINAL
HYPOTHESIS
g� f
(learned credit approval forrn'Ula)
HYPOTHESIS SET
1-
4
1 . THE LEARNING PROBLEM 1 . 1 . PROBLEM SETUP
We will use the setup in Figure 1.2 as our definition of the learning problem.
Later on, we will consider a number of refinements and variations to this basic
setup as needed. However, the essence of the problem will remain the same.
There is a target to be learned. It is unknown to us. We have a set of examples
generated by the target. The learning algorithm uses these examples to look
for a hypothesis that approximates the target.
i=I:l
d
Approve credit if WiXi > threshold,
i=I:l
d
Deny credit if WiXi < threshold.
(1.1)
where x i , ··· , x d are the components of the vector x; h(x) = + 1 means 'ap
prove credit' and h(x) = - 1 means 'deny credit'; sign(s) = + 1 if s > 0 and
sign(s) = - 1 if s < 0. 1 The weights are w1, ··· , wd , and the threshold is
determined by the bias term b since in Equation (1.1) , credit is approved if
I::=l WiXi > - b.
This model of 1{ is called the perceptron, a name that it got in the context
of artificial intelligence. The learning algorithm will search 1{ by looking for
1 The value of sign (s) whens 0 is a simple technicality that we ignore for the moment.
5
1 . THE LEARNING PROBLEM 1 . 1 . PROBLEM SETUP
weights and bias that perform well o n the data set. Some o f the weights
w1, · , Wd may end up being negative, corresponding to an adverse effect on
· ·
credit approval. For instance, the weight of the 'outstanding debt' field should
come out negative since more debt is not good for credit. The bias value b
may end up being large or small, reflecting how lenient or stringent the bank
should be in extending credit. The optimal choices of weights and bias define
the final hypothesis g E 1-l that the algorithm produces.
Exercise 1. 2
S uppose that we use a perceptron to detect spam messages. Let's say
that each email message is represented by the frequency of occurrence of
keywords, a nd the output is if the message is considered spa m .
( a ) Can you t h i n k o f some keywords that wil l e n d u p with a large positive
weight in the perceptron?
( b ) H ow a bout keywords that wil l get a negative weight?
( c) What parameter in the perceptron d i rectly affects how many border
line messages end up being classified as spam ?
6
1 . THE LEARNING PROBLEM 1 . 1 . PROBLEM S ETUP
To simplify the notation of the perceptron formula, we will treat the bias b
as a weight wo = b and merge it with the other weights into one vector
w = [w0, w 1 , , wd]T, where T denotes the transpose of a vector, so w is a
· · ·
With this convention, wTx = ��=O WiXi, and so Equation (1.1) can be rewrit
ten in vector form as
h (x) = sign(wTx) . (1.2)
We now introduce the perceptron learning algorithm (PLA) . The algorithm
will determine what w should be, based on the data. Let us assume that the
data set is linearly separable, which means that there is a vector w that
makes (1.2) achieve the correct decision h (xn ) = Yn on all the training exam
ples, as shown in Figure 1.3.
Our learning algorithm will find this w using a simple iterative method.
Here is how it works. At iteration t, where t = 0, 1, 2, . . . , there is a current
value of the weight vector, call it w(t) . The algorithm picks an example
from (x1 , Y1 ) (xN , YN) that is currently misclassified, call it (x(t) , y (t) ) , and
· · ·
This rule moves the boundary in the direction of classifying x(t) correctly, as
depicted in the figure above. The algorithm continues with further iterations
until there are no longer misclassified examples in the data set .
7
1 . THE LEARNING PROBLEM 1 . 1 . PROBLEM SETUP
Exercise 1.3
The weight u pdate rule i n {1.3) has the n ice interpretation that it moves
in the direction of classifying x(t) correctly.
(a) Show that y(t)wT(t)x(t) < 0. [Hint: x(t) is misclassified by w(t).]
(b) S how that y(t)wT(t l)x(t) > y(t)wT(t)x(t). [Hint: Use (1.3).]
( c) As far as classifying x(t) is concerned, argue that the move from w(t)
to w(t + 1) is a move ' i n the right direction ' .
Exercise 1 .4
Let us create our own target function f a nd data set 1) a n d see how the
perceptron learning a lgorithm works. Take d = 2 so you can visua lize the
problem , a nd choose a random l i ne i n the plane as you r target function ,
where o ne side of the line m a ps to 1 a nd the other m a ps to - 1. Choose
the i n puts Xn of the data set as random points in the pla ne, a n d eval u ate
the target function on each Xn to get the corresponding output Yn ·
Now, generate a data set of size 20. Try the perceptron learning a lgorithm
on you r data set a n d see how long it takes to converge a n d how wel l the
fin a l hypothesis g matches you r target f. You can find other ways to play
with this experiment in Problem 1.4.
The perceptron learning algorithm succeeds in achieving its goal; finding a hy
pothesis that classifies all the points in the data set V = { (x1, y1) · · · (xN, yN) }
correctly. Does this mean that this hypothesis will also be successful in classi
fying new data points that are not in V? This turns out to be the key question
in the theory of learning, a question that will be thoroughly examined in this
book.
8
1 . THE LEARNING PROBLEM 1 . 1 . PROBLEM S ETUP
Size Size
(a ) Coin data ( b) Learned classifier
9
1 . THE LEARNING P ROBLEM 1 . 1 . P ROBLEM SETUP
Size Size
(a) Probabilistic model of data (b) Inferred classifier
2 This is called Bayes optimal decision theory. Some learning models are based on the
same theory by estimating the probability from data.
10
1 . THE LEARNING PROBLEM 1 . 2 . TYPES OF LEARNING
Exercise 1. 5
Which of the following problems a re more suited for the learning a pproach
and which a re more suited for the d esign approach?
(a) Determining the a ge at which a particular med ica l test should be
performed
(b) Classifying n u m bers into primes a n d non-primes
( c) Detecting potentia l fraud i n credit card charges
( d) Determi ning the time it wou ld ta ke a fal l i ng object to h it the ground
(e) Determining the optima l cycle for traffic lights i n a busy intersection
1. 2 Types of Learning
The basic premise of learning from data is the use of a set of observations to
uncover an underlying process. It is a very broad premise, and difficult to fit
into a single framework. As a result, different learning paradigms have arisen
to deal with different situations and different assumptions. In this section, we
introduce some of these paradigms.
The learning paradigm that we have discussed so far is called supervised
learning. It is the most studied and most utilized type of learning, but it is
not the only one. Some variations of supervised learning are simple enough
to be accommodated within the same framework. Other variations are more
profound and lead to new concepts and techniques that take on lives of their
own. The most important variations have to do with the nature of the data
set.
11