Ai Unit V
Ai Unit V
Learning
An agent is learning if it improves its performance LEARNING on future tasks after making
observations about the world.
Forms of Learning:
Any component of an agent can be improved by learning from data. The improvements, and
the techniques used to make them, depend on four major factors:
• Which component is to be improved?
• What prior knowledge the agent already has?
• What representation is used for the data and the component?
• What feedback is available to learn from?
Components to be learned:
The components of these agents include:
1. A direct mapping from conditions on the current state to actions.
2. A means to infer relevant properties of the world from the percept sequence.
3. Information about the way the world evolves and about the results of possible actions the
agent can take.
4. Utility information indicating the desirability of world states.
5. Action-value information indicating the desirability of actions.
6. Goals that describe classes of states whose achievement maximizes the agent’s utility.
Forms of Learning or Feedback to learn from:
There are three types of feedback that determine the three main types of learning:
1) In unsupervised learning the agent learns patterns in the input even though no explicit
feedback is supplied. The most common unsupervised learning task is clustering: detecting
potentially useful clusters of input examples. For example, a taxi agent might gradually develop
a concept of “good traffic days” and “bad traffic days” without ever being given labeled
examples of each by a teacher.
3) In supervised learning the agent observes some example input–output pairs and learns a
function that maps from input to output. In component 1 above, the inputs are percepts and the
output are provided by a teacher who says “Brake!” or “Turn left.” In component 2, the inputs
are camera images and the outputs again come from a teacher who says “that’s a bus.” In 3, the
theory of braking is a function from states and braking actions to stopping distance in feet. In
this case the output value is available directly from the agent’s percepts (after the fact); the
environment is the teacher.
4) In semi-supervised learning we are given a few labeled examples and must make what we
can of a large collection of unlabelled examples. Even the labels themselves may not be the
oracular truths that we hope for. Imagine that you are trying to build a system to guess a
person’s age from a photo. You gather some labeled examples by snapping pictures of people
and asking their age.
Inductive Learning:
An algorithm for deterministic supervised learning is given as input the correct value of the
unknown function for particular inputs and must try to recover the unknown function or
1|Page
something close to it. More formally, we say that an example is a pair (x, f (z)), where x is
the input and f(x) is the output of the function applied to x. The task of pure inductive
inference (or induction) is this:
Given a collection of examples of f, return a function h that approximates f.
The function h is called a hypothesis. The reason that learning is difficult, from a conceptual
point of view, is that it is not easy to tell whether any p,articular h is a good approximation of
f . A good hypothesis will generalize well-that is, will predict unseen examples correctly. This
is the fundamental problem of induction.
Example: Figure: 18.1 shows a familiar example: fitting a function of a single variable to some
data points. The examples are (x, f (x)) pairs, where both x and f (x) are real numbers. We
choose the hypothesis space H-the set of hypotheses we will consider-to be the set of
polynomials of degree at most k, such as , and so on. Figure 18.l(a) shows some
data with an exact fit by a straight line (a polynomial of degree I). The line is CONSISTENT
called a consistent hypothesis because it agrees with all the data. Figure 18.l(b) shows a high-
degree polynomial that is also consistent with the same data.
This illustrates the first issue in inductive learning: how do we choose from among multiple
consistent hypotheses?
Ockham’s razor: One answer is Ockham's razor: prefer the simplest hypothesis consistent with
the data. Intuitively, this makes sense, because hypotheses that are no simpler than the data
themselves are failing to extract any pattern from the data. Defining simplicity is not easy, but
it seems reasonable to say that a degree- 1 polynomial is simpler than a degree-12 polynomial.
Figure 18.1(c) shows a second data set. There is no consistent straight line for this data set; in
fact, it requires a degree-6 polynomial for an exact fit. There are just 7 data points, so a
polynomial with 7 parameters does not seem to be finding any pattern in the data and we do
not expect it to generalize well. A straight line that is not consistent with any of the data points,
but might generalize fairly well for unseen values of x, is also shown in (c). In general, there
is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses
that may generalize better. In Figure 18.1(d) we expand the hypothesis space H to allow
polynomials over both x and sin(x), and find that the data in (c) can be fitted exactly by a simple
function of the form ax + b + c sin(x).
2|Page
We say that a learning problem is realizable if the hypothesis space contains the true function.
Unfortunately, we cannot always tell whether a given learning problem is realizable, because
the true function is not known.
Supervised learning can be done by choosing the hypothesis h* that is most probable given
the data:
Example: As an example, we will build a decision tree to decide whether to wait for a table
at a restaurant. The aim here is to learn a definition for the goal predicate WillWait. First we
list the attributes that we will consider as part of the input:
A Boolean decision tree consists of an (x, y) pair, where x is a vector of values for the input
attributes, and y is a single Boolean output value. A training set of 12 examples is shown in
Figure 18.3. The positive examples are the ones in which the goal WillWait is true (x1, x3, . .
.); the negative examples are the ones in which it is false (x2, x5, . . .).
3|Page
Expressiveness of decision tree:
A Boolean decision tree is logically equivalent to the assertion that the goal attribute is true if
and only if the input attributes satisfy one of the paths leading to a leaf with value true. Writing
this out in propositional logic, we have
Choosing the best attribute: The greedy search used in decision tree learning is designed to
approximately minimize the depth of the final tree. The idea is to pick the attribute that goes
as far as possible toward providing an exact classification of the examples. A perfect attribute
divides the examples into sets, each of which are all positive or all negative and thus will be
leaves of the tree.
We use entropy for choosing the best attribute, the fundamental quantity in information theory.
Entropy is a measure of the uncertainty of a random variable; acquisition of information
corresponds to a reduction in entropy.
4|Page
Information gain is used to find the entropy given in the dataset. The information gain from
the attribute test on A is the expected reduction in entropy:
The information gain ranges from 0 to 1. For example, Gain(pat)=0.541 and Gain(Type)=0.2
then we consider that patron is the best attribute to split on. So, the decision tree shown in
Figure 18.6 using Information gain as an entropy measure.
The decision tree learning algorithm: The learning algorithm to construct a decision tree is
shown in Figure 18.5.
Learning Curve: We can evaluate the accuracy of a learning algorithm with a learning curve,
as shown in Figure 18.7. We have 100 examples at our disposal, which we split into a training
set and a test set. We learn a hypothesis h with the training set and measure its accuracy with
the test set. We do this starting with a training set of size 1 and increasing one at a time up to
size 99. For each size we actually repeat the process of randomly splitting 20 times, and average
5|Page
the results of the 20 trials. The curve shows that as the training set size grows, the accuracy
increases. (For this reason, learning curves are also called happy graphs.) In this graph we
reach 95% accuracy, and it looks like the curve might continue to increase with more data.
For decision trees, a technique called decision tree pruning combats overfitting. Pruning
works by eliminating nodes that are not clearly relevant. We start with a full tree, as generated
by DECISION-TREE-LEARNING. We then look at a test node that has only leaf nodes as
descendants. If the test appears to be irrelevant—detecting only noise in the data then we
eliminate the test, replacing it with a leaf node. We repeat this process, considering each test
with only leaf descendants, until each one has either been pruned or accepted as is.
ENSEMBLE LEARNING:
The idea of ensemble learning methods is to select a whole collection, or ensemble, of
hypotheses from the hypothesis space and combine their predictions. For example, we might
generate a hundred different decision trees from the same training set and have them vote on
the best classification for a new example. The motivation for ensemble learning is simple.
Consider an ensemble of M = 5 hypotheses and suppose that we combine their predictions
using simple majority voting. For the ensemble to misclassify a new example, at least three of
6|Page
the five hypotheses have to misclassify it. The hope is that this is much less likely than a
misclassification by a single hypothesis.
Suppose we assume that each hypothesis hi in the ensemble has an error of p-that is, the
probability that a randomly chosen example is rnisclassified by hi is p. Furthermore, suppose
we assume that the errors made by each hypothesis are independent. In that case, if p is small,
then the probability of a large number of misclassifications occurring is minuscule. For
example, a simple calculation (Exercise 18.14) shows that using an ensemble of five
hypotheses reduces an error rate of 1 in 10 down to an error rate of less than 1 in 100.
Example:
Figure 18.8 shows how this can result in a more expressive hypothesis space. If the original
hypothesis space allows for a simple and efficient learning algorithm, then the ensemble
method provides a way to learn a much more expressive class of hypotheses without incurring
much additional computational or algorithmic complexity.
The most widely used ensemble method is called boosting. To understand how it works, we
need first to explain the idea of a weighted training set. In such a training set, each example
has an associated weight wJ > 0. The higher the weight of an example, the higher is the
importance attached to it during the learning of a hypothesis. It is straightforward to modify
the learning algorithms we have seen so far to operate with weighted training sets.
Boosting starts with wj, = 1 for all the examples (i.e., a normal training set). From this set, it
generates the first hypothesis, hl. This hypothesis wall classify some of the training examples
correctly and some incorrectly. We would like the next hypothesis to do better on the
misclassified examples, so we increase their weights while decreasing the weights of the
correctly classified examples. From this new weighted training set, we generate hypothesis h2.
The process continues in this way until we have generated M hypotheses, where M is an input
to the boosting algorithm. The final ensemble hypothesis is a weighted-majority combination
of all the M hypotheses, each weighted according to how well it performed on the training set.
Figure 18.9 shows how the algorithm works conceptually.
7|Page
The boosting algorithm shown in Figure 18.10.
Example: Let us see how well boosting does on the restaurant data. We will choose as our
original hypothesis space the class of decision stumps, which are decision trees with just one
test at the root. The lower curve in Figure 18.1 1(a) shows that unboosted decision stumps are
not very effective for this data set, reaching a prediction performance of only 8 1 % on 100
training examples. When boosting is applied (with M = 5), the performance is better, reaching
93% after 100 examples.
An interesting thing happens as the ensemble size M increases. Figure 18.11(b) shows the
training set performance (on 100 examples) as a function of M. Notice that the error reaches
zero (as the boosting theorem tells us) when M is 20; that is, a weighted-majority combination
of 20 decision stumps suffices to fit the 100 examples exactly. As more stumps are added to
8|Page
the ensemble, the error remains at zero. The graph also shows that the test set performance
continues to increase long afer the training set error has reached zero. At M = 20, the test
performance is 0.95 (or 0.05 error), and the performance increases to 0.98 as late as M = 137,
before gradually dropping to 0.95.
9|Page
Learning decision lists:
A decision list is a logical expression of a restricted form. It consists of a series of tests, each
of which is a conjunction of literals. If a test succeeds when applied to an example description,
the decision list specifies the value to be returned. If the test fails, processing continues with
the next test in the list.6 Decision lists resemble decision trees, but their overall structure is
simpler. In contrast, the individual tests are more complex. Figure 18.13 shows a decision list
that represents the following hypothesis:
It would seem reasonable to prefer small tests that match large sets of uniformly classified
examples, so that the overall decision list will be as compact as possible. The simplest strategy
is to find the smallest test t that matches any uniformly classified subset, regardless of the size
of the subset. Even this approach works quite well, as Figure 18.15 suggests.
10 | P a g e
Statistical Learning: Instance Based Learning:
In contrast to parametric learning, nonparametric learning methods allow the hypothesis
complexity to grow with the data. The more data we have, the wigglier the hypothesis can be.
We will look at two very simple families of nonpararnetric instance-based learning (or
memory-based learning) methods, so called because they construct hypotheses directly from
the training instances themselves.
Nearest-neighbor models:
The key idea of nearest-neighbor models is that the properties of any particular input point x
are likely to be similar to those of points in the neighborhood (of x. For example, if we want to
do density estimation-that is, estimate the value of an unknown probability density at x then
we can simply measure the density with which points are scattered in the neighborhood of x.
This sounds very simple, until we realize that we need to specify exactly what we mean by
"neighborhood." If the neighborhood is too small, it won't contain any data points; too large,
and it may include all the data points, resulting in a density estimate that is the same
everywhere. One solution is to define the neighborhood to be just big enough to include k
points, where k is large enough to ensure a meaningful estimate. For fixed k, the size of the
neighborhood varies-where data are sparse, the neighborhood is large, but where data are
dense, the neighborhood is small.
Example: Figure 20.12(a) shows an example for data scattered in two dimensions. Figure
20.13 shows the results of k-nearest-neighbor density estimation from these data with k = 3,
10, and 40 respectively. For k == 3, the density estimate at any point is based on only 3
neighboring points and is highly variable. For k = 10, the estimate provides a good
reconstruction of the true density shown in Figure 20.12(b). For k = 40, the neighborhood
becomes too large and structure of the data is altogether lost. In practice, using a value of k
somewhere between 5 and 10 gives good results for most low-dimensional data sets. A good
value of k can also be chosen by using cross-validation.
11 | P a g e
To identify the nearest neighbors of a query point, we need a distance metric, D(x1, x2). It is
also possible to use the nearest-neighbor idea for direct supervised learning. Given a test
example with input x, the output y = h(x) is obtained from the y-values of the k nearest
neighbors ad x. In the discrete case, we can obtain a single prediction by majority vote. In the
continuous case, we can average the k values or do local linear regression, fitting a hyperplane
to the k points and predicting the value at x according to the hyperplane.
Advantage: The k-nearest-neighbor learning algorithm is very simple to implement, requires
little in the way of tuning, and often performs quite well.
Neural Networks:
A neuron is a cell in the brain whose principal function is the collection, processing, and
dissemination of electrical signals. The brain's information-processing capacity is thought to
emerge primarily from networks of such neurons. For this reason, some of the earliest A1 work
aimed to create artificial neural networks. (Other names for the field include connectionism,
parallel distributed processing, and neural computation.) Figure 20.15 shows a simple
mathematical model of the neuron devised by McCulloch and Pitts (1943). Roughly speaking,
it "fires" when a linear combination of its inputs exceeds some threshold.
Notice that we have included a bias weight Wo,i connected to a fixed input ao = -1 The
activation function g is designed to meet two things. First, we want the unit to be "active" (near
+1) when the "right" inputs are given, and "inactive" (near 0) when the "wrong" inputs are
given. Second, the activation needs to be nonlinear, otherwise the entire neural network
12 | P a g e
collapses into a simple linear function. Two choices for g are shown in Figure 20.16: the
threshold function and the sigmoid function (also known as the logistic function). The
sigmoid function has the advantage of being differentiable, which we will see later is important
for the weight-learning algorithm. Notice that both functions have a threshold (either hard or
soft) at zero; the bias weight sets Wo,i the actual threshold for the unit, in the sense that the unit
is activated when the weighted sum of "real" inputs.
Network Structures: There are two main categories of neural network structures: acyclic or
feed-forward networks and cyclic or recurrent networks. A feed-forward network
represents a function of its current input; thus, it has no internal state other than the weights
themselves. A recurrent network, on the other hand, feeds its outputs back into1 its own inputs.
This means that the activation levels of the network form a dynamical system that may reach a
stable state or exhibit oscillations or even chaotic behaviour.
Feed-Forward Network: Let us look more closely into the assertion that a feed-forward
network represents a function of its inputs. Consider the simple network shown in Figure 20.18,
which has two input units, two hidden units, and an output unit. (To keep things simple, we
have omitted the bias units in this example.) Given an input vector x == ( xl , x2), the activations
of the input units are set to (a1, a2) = (x1, x2) and the network computes
A neural network can be used for classification or regression. For Boolean classification with
continuous outputs (e.g., with sigmoid units), it is traditional to have a single output unit, with
a value over 0.5 interpreted as one class and a value bellow 0.5 as the other. For k-way
13 | P a g e
classification, one could divide the single output unit's range intlo k portions, but itt is more
common to have k separate output units, with the value of each one representing the relative
likelihood of that class given the current input.
Single layer feed-forward neural networks (perceptrons): A network with all the inputs
connected directly to the outputs is called a single-layer neural network, or a perceptron
network. Since each output unit is independent of the others each weight affects only one of
the outputs-we can limit our study to perceptrons with a single output unit, as explained in
Figure 20.19(a).
The threshold perceptron returns 1 if and only if the weighted sum of its inputs (including the
bias) is positive:
Now, the equation W. x = 0 defines a hyperplane in the input space, so the perceptron returns
1 if and only if the input is on one side of that hyperplane. For this reason, the threshold
LINEARSEPAF~ATOR perceptron is called a linear separator. Figure 18.21(a) and (b) show
this hyperplane (a line, in two dimensions) for the perceptron representations of the AND and
OR functions of two inputs. The perceptron can represent these functions because there is some
line that separates all the white dots from all the black dots. Such functions are called linearly
separable.
14 | P a g e
Figure 18.21(c) shows an example of a function that is not linearly separable-the XOR function.
Clearly, there is no way for a threshold perceptron to learn this function. In general, threshold
perceptrons can ,represent only linearly separable functions.
The Gradient descent learning algorithm is used in perceptron for learning. The perceptron
learning algorithm is outlined in Figure 20.21
Figure 20.22 shows the learning curve for a perceptron on two different problems. On the left,
we show the curve for learning the majority function with 11 Boolean inputs (i.e., the function
outputs a 1 if 6 or more inputs are 1). On the right, we have the restaurant example. The solution
problem is easily represented as a decision tree, but is not linearly separable. The best plane
through the data correctly classifies only 65%.
15 | P a g e
Multilayer feed-forward neural networks:
A Multilayer feed-forward neural networks consists of hidden layers. The most common case
involves a single hidden layer,as in Figure 20.24. The advantage of adding hidden layers is that
it enlarges the space of hypotheses that the network can represent. With more hidden units, we
can produce more bumps of different sizes in more places. In fact, with a single, sufficiently
large hidden layer, it is possible to represent any continuous function of the inputs with
arbitrary accuracy; with two layers, even discontinuous functions can be represented.
Unfortunately, for any particular network structure, it is harder to characterize exactly which
functions can be represented and which ones cannot.
Learning algorithms for multilayer networks are similar to the perceptron learning algorithm
show in Figure 20.21. One minor difference is that we may have several outputs, so we have
an output vector hw(x) rather than a single value, and each example has an output vector y. The
major difference is that, whereas the error y- hw at the output layer is clear, the error at the
hidden layers seems mysterious because the training data does not say what value the hidden
nodes should have. It turns out that we can back-propagate the ell-or from the output layer to
the hidden layers. The back-propagation process emerges directly from a derivation of the
overall error gradient. First, we will describe the process with an intuitive justification; then,
we will show the derivation.
16 | P a g e
For the mathematically inclined, we will now derive the back-propagation equations from first
principles. The squared error on a single example is defined as
Figure 20.26 shows single hidden layer network performs on the restaurant problem. In Figure
20.26, we show two curves. The first is a training curve, which shows the mean squared error
on a given training set of 100 restaurant examples during the weight-updating process. This
demonstrates that the network does indeed converge to a perfect fit to the training data. The
second curve is the standard learning curve for the restaurant data. The neural network does
learn well, although not quite as fast as decision-tree learning; this is perhaps not surprising,
because the data were generated from a simple decision tree in the first place.
17 | P a g e
Advantage: Neural networks are capable of far more complex learning tasks of course,
although it must be said that a certain amount of twiddling is needed to get the network structure
right and to achieve convergence to something close to the global optimum in weight space.
Learning neural network structures: we also need to understand how to find the best
network structure. If we choose a network that is too big, it will be able to memorize all the
examples by forming a large lookup table, but will not necessarily generalize well to inputs
that have not been seen before. In other words, like all statistical models, neural networks are
subject to overfitting when there are too many parameters in the model.
If we stick to fully connected networks, the only choices to be made concern the number of
hidden layers and their sizes. The usual approach is to try several and keep the best. The cross-
validation techniques of Chapter 18 are needed if we are to avoid peeking at the test set. That
is, we choose the network architecture that gives the highest prediction accuracy on the
validation sets.
If we want to consider networks that are not fully connected, then we need to find some
effective search method through the very large space of possible connection topologies. The
optimal brain damage algorithm begins with a fully connected network and removes
connections from it. After the network is trained for the first time, an information-theoretic
approach identifies an optimal selection of connections that can be dropped. The network is
then retrained, and if its performance has not decreased then the process is repeated. In addition
to removing connections, it is also possible to remove units that are not contributing much to
the result.
Several algorithms have been proposed for growing a larger network from a smaller one. One,
the tiling algorithm, resembles decision-list learning. The idea is to start with a single unit that
does its best to produce the correct output on as many of the training examples as possible.
Subsequent units are added to take care of the examples that the first unit got wrong. The
algorithm adds only as many units as are needed to cover all the examples.
18 | P a g e