0% found this document useful (0 votes)

13 views18 pages

Ai Unit V

Uploaded by

bendakayalarazak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views18 pages

Ai Unit V

Uploaded by

bendakayalarazak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

V.

Learning

An agent is learning if it improves its performance LEARNING on future tasks after making
observations about the world.
Forms of Learning:
Any component of an agent can be improved by learning from data. The improvements, and
the techniques used to make them, depend on four major factors:
• Which component is to be improved?
• What prior knowledge the agent already has?
• What representation is used for the data and the component?
• What feedback is available to learn from?
Components to be learned:
The components of these agents include:
1. A direct mapping from conditions on the current state to actions.
2. A means to infer relevant properties of the world from the percept sequence.
3. Information about the way the world evolves and about the results of possible actions the
agent can take.
4. Utility information indicating the desirability of world states.
5. Action-value information indicating the desirability of actions.
6. Goals that describe classes of states whose achievement maximizes the agent’s utility.
Forms of Learning or Feedback to learn from:
There are three types of feedback that determine the three main types of learning:

1) In unsupervised learning the agent learns patterns in the input even though no explicit
feedback is supplied. The most common unsupervised learning task is clustering: detecting
potentially useful clusters of input examples. For example, a taxi agent might gradually develop
a concept of “good traffic days” and “bad traffic days” without ever being given labeled
examples of each by a teacher.

2) In reinforcement learning the agent learns from a series of reinforcements—rewards or

punishments. For example, the lack of a tip at the end of the journey gives the taxi agent an
indication that it did something wrong. The two points for a win at the end of a chess game
tells the agent it did something right. It is up to the agent to decide which of the actions prior
to the reinforcement were most responsible for it.

3) In supervised learning the agent observes some example input–output pairs and learns a
function that maps from input to output. In component 1 above, the inputs are percepts and the
output are provided by a teacher who says “Brake!” or “Turn left.” In component 2, the inputs
are camera images and the outputs again come from a teacher who says “that’s a bus.” In 3, the
theory of braking is a function from states and braking actions to stopping distance in feet. In
this case the output value is available directly from the agent’s percepts (after the fact); the
environment is the teacher.

4) In semi-supervised learning we are given a few labeled examples and must make what we
can of a large collection of unlabelled examples. Even the labels themselves may not be the
oracular truths that we hope for. Imagine that you are trying to build a system to guess a
person’s age from a photo. You gather some labeled examples by snapping pictures of people
and asking their age.
Inductive Learning:
An algorithm for deterministic supervised learning is given as input the correct value of the
unknown function for particular inputs and must try to recover the unknown function or
1|Page
something close to it. More formally, we say that an example is a pair (x, f (z)), where x is
the input and f(x) is the output of the function applied to x. The task of pure inductive
inference (or induction) is this:
Given a collection of examples of f, return a function h that approximates f.

The function h is called a hypothesis. The reason that learning is difficult, from a conceptual
point of view, is that it is not easy to tell whether any p,articular h is a good approximation of
f . A good hypothesis will generalize well-that is, will predict unseen examples correctly. This
is the fundamental problem of induction.

Example: Figure: 18.1 shows a familiar example: fitting a function of a single variable to some
data points. The examples are (x, f (x)) pairs, where both x and f (x) are real numbers. We
choose the hypothesis space H-the set of hypotheses we will consider-to be the set of
polynomials of degree at most k, such as , and so on. Figure 18.l(a) shows some
data with an exact fit by a straight line (a polynomial of degree I). The line is CONSISTENT
called a consistent hypothesis because it agrees with all the data. Figure 18.l(b) shows a high-
degree polynomial that is also consistent with the same data.

This illustrates the first issue in inductive learning: how do we choose from among multiple
consistent hypotheses?
Ockham’s razor: One answer is Ockham's razor: prefer the simplest hypothesis consistent with
the data. Intuitively, this makes sense, because hypotheses that are no simpler than the data
themselves are failing to extract any pattern from the data. Defining simplicity is not easy, but
it seems reasonable to say that a degree- 1 polynomial is simpler than a degree-12 polynomial.

Figure 18.1(c) shows a second data set. There is no consistent straight line for this data set; in
fact, it requires a degree-6 polynomial for an exact fit. There are just 7 data points, so a
polynomial with 7 parameters does not seem to be finding any pattern in the data and we do
not expect it to generalize well. A straight line that is not consistent with any of the data points,
but might generalize fairly well for unseen values of x, is also shown in (c). In general, there
is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses
that may generalize better. In Figure 18.1(d) we expand the hypothesis space H to allow
polynomials over both x and sin(x), and find that the data in (c) can be fitted exactly by a simple
function of the form ax + b + c sin(x).

2|Page
We say that a learning problem is realizable if the hypothesis space contains the true function.
Unfortunately, we cannot always tell whether a given learning problem is realizable, because
the true function is not known.
Supervised learning can be done by choosing the hypothesis h* that is most probable given
the data:

Learning Decision Trees:

Decision tree induction is one of the simplest and yet most successful forms of machine
learning. We first describe the representation—the hypothesis space—and then show how to
learn a good hypothesis.
The decision tree representation:
A decision tree represents a function that takes as input a vector of attribute values and returns
a “decision”—a single output value. The input and output values can be discrete or continuous.
For now we will concentrate on problems where the inputs have discrete values and the output
has exactly two possible values; this is Boolean classification, where each example input will
be classified as true (a positive example) or false (a negative example).

Example: As an example, we will build a decision tree to decide whether to wait for a table
at a restaurant. The aim here is to learn a definition for the goal predicate WillWait. First we
list the attributes that we will consider as part of the input:

1. Alternate: whether there is a suitable alternative restaurant nearby.

2. Bar: whether the restaurant has a comfortable bar area to wait in.
3. Fri/Sat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values are None, Some, and Full).
6. Price: the restaurant’s price range ($, $$, $$$).
7. Raining: whether it is raining outside.
8. Reservation: whether we made a reservation.
9. Type: the kind of restaurant (French, Italian, Thai, or burger).
10. WaitEstimate: the wait estimated by the host (0–10 minutes, 10–30, 30–60, or >60).

A Boolean decision tree consists of an (x, y) pair, where x is a vector of values for the input
attributes, and y is a single Boolean output value. A training set of 12 examples is shown in
Figure 18.3. The positive examples are the ones in which the goal WillWait is true (x1, x3, . .
.); the negative examples are the ones in which it is false (x2, x5, . . .).

3|Page
Expressiveness of decision tree:

A decision tree for the above data shown in Figure 18.2.

A Boolean decision tree is logically equivalent to the assertion that the goal attribute is true if
and only if the input attributes satisfy one of the paths leading to a leaf with value true. Writing
this out in propositional logic, we have

The logical rules can be expressed in the form of propositional logic

Choosing the best attribute: The greedy search used in decision tree learning is designed to
approximately minimize the depth of the final tree. The idea is to pick the attribute that goes
as far as possible toward providing an exact classification of the examples. A perfect attribute
divides the examples into sets, each of which are all positive or all negative and thus will be
leaves of the tree.

We use entropy for choosing the best attribute, the fundamental quantity in information theory.
Entropy is a measure of the uncertainty of a random variable; acquisition of information
corresponds to a reduction in entropy.
4|Page
Information gain is used to find the entropy given in the dataset. The information gain from
the attribute test on A is the expected reduction in entropy:

The information gain ranges from 0 to 1. For example, Gain(pat)=0.541 and Gain(Type)=0.2
then we consider that patron is the best attribute to split on. So, the decision tree shown in
Figure 18.6 using Information gain as an entropy measure.

The decision tree learning algorithm: The learning algorithm to construct a decision tree is
shown in Figure 18.5.

Learning Curve: We can evaluate the accuracy of a learning algorithm with a learning curve,
as shown in Figure 18.7. We have 100 examples at our disposal, which we split into a training
set and a test set. We learn a hypothesis h with the training set and measure its accuracy with
the test set. We do this starting with a training set of size 1 and increasing one at a time up to
size 99. For each size we actually repeat the process of randomly splitting 20 times, and average
5|Page
the results of the 20 trials. The curve shows that as the training set size grows, the accuracy
increases. (For this reason, learning curves are also called happy graphs.) In this graph we
reach 95% accuracy, and it looks like the curve might continue to increase with more data.

Generalization and overfitting: On some problems, the DECISION-TREE-LEARNING

algorithm will generate a large tree when there is actually no pattern to be found. This problem
is called overfitting. A general phenomenon, overfitting occurs with all types of learners, even
when the target function is not at all random.

For decision trees, a technique called decision tree pruning combats overfitting. Pruning
works by eliminating nodes that are not clearly relevant. We start with a full tree, as generated
by DECISION-TREE-LEARNING. We then look at a test node that has only leaf nodes as
descendants. If the test appears to be irrelevant—detecting only noise in the data then we
eliminate the test, replacing it with a leaf node. We repeat this process, considering each test
with only leaf descendants, until each one has either been pruned or accepted as is.

Broadening the applicability of decision trees:

In order to extend decision tree induction to a wider variety of problems, a number of issues
must be addressed. We will briefly mention several, suggesting that a full understanding is best
obtained by doing the associated exercises:
•Missing data: In many domains, not all the attribute values will be known for every example.
The values might have gone unrecorded, or they might be too expensive to obtain.
•Multivalued attributes: When an attribute has many possible values, the information gain
measure gives an inappropriate indication of the attribute’s usefulness.
•Continuous and integer-valued input attributes: Continuous or integer-valued attributes
such as Height and Weight, have an infinite set of possible values. Rather than generate
infinitely many branches, decision-tree learning algorithms typically find the split point that
gives the highest information gain.
•Continuous-valued output attributes: If we are trying to predict a numerical output value,
such as the price of an apartment, then we need a regression tree rather than a classification
tree. A regression tree has at each leaf a linear function of some subset of numerical attributes,
rather than a single value.

ENSEMBLE LEARNING:
The idea of ensemble learning methods is to select a whole collection, or ensemble, of
hypotheses from the hypothesis space and combine their predictions. For example, we might
generate a hundred different decision trees from the same training set and have them vote on
the best classification for a new example. The motivation for ensemble learning is simple.
Consider an ensemble of M = 5 hypotheses and suppose that we combine their predictions
using simple majority voting. For the ensemble to misclassify a new example, at least three of
6|Page
the five hypotheses have to misclassify it. The hope is that this is much less likely than a
misclassification by a single hypothesis.

Suppose we assume that each hypothesis hi in the ensemble has an error of p-that is, the
probability that a randomly chosen example is rnisclassified by hi is p. Furthermore, suppose
we assume that the errors made by each hypothesis are independent. In that case, if p is small,
then the probability of a large number of misclassifications occurring is minuscule. For
example, a simple calculation (Exercise 18.14) shows that using an ensemble of five
hypotheses reduces an error rate of 1 in 10 down to an error rate of less than 1 in 100.

Example:
Figure 18.8 shows how this can result in a more expressive hypothesis space. If the original
hypothesis space allows for a simple and efficient learning algorithm, then the ensemble
method provides a way to learn a much more expressive class of hypotheses without incurring
much additional computational or algorithmic complexity.

The most widely used ensemble method is called boosting. To understand how it works, we
need first to explain the idea of a weighted training set. In such a training set, each example
has an associated weight wJ > 0. The higher the weight of an example, the higher is the
importance attached to it during the learning of a hypothesis. It is straightforward to modify
the learning algorithms we have seen so far to operate with weighted training sets.

Boosting starts with wj, = 1 for all the examples (i.e., a normal training set). From this set, it
generates the first hypothesis, hl. This hypothesis wall classify some of the training examples
correctly and some incorrectly. We would like the next hypothesis to do better on the
misclassified examples, so we increase their weights while decreasing the weights of the
correctly classified examples. From this new weighted training set, we generate hypothesis h2.
The process continues in this way until we have generated M hypotheses, where M is an input
to the boosting algorithm. The final ensemble hypothesis is a weighted-majority combination
of all the M hypotheses, each weighted according to how well it performed on the training set.
Figure 18.9 shows how the algorithm works conceptually.

7|Page
The boosting algorithm shown in Figure 18.10.

Example: Let us see how well boosting does on the restaurant data. We will choose as our
original hypothesis space the class of decision stumps, which are decision trees with just one
test at the root. The lower curve in Figure 18.1 1(a) shows that unboosted decision stumps are
not very effective for this data set, reaching a prediction performance of only 8 1 % on 100
training examples. When boosting is applied (with M = 5), the performance is better, reaching
93% after 100 examples.

An interesting thing happens as the ensemble size M increases. Figure 18.11(b) shows the
training set performance (on 100 examples) as a function of M. Notice that the error reaches
zero (as the boosting theorem tells us) when M is 20; that is, a weighted-majority combination
of 20 decision stumps suffices to fit the 100 examples exactly. As more stumps are added to

8|Page
the ensemble, the error remains at zero. The graph also shows that the test set performance
continues to increase long afer the training set error has reached zero. At M = 20, the test
performance is 0.95 (or 0.05 error), and the performance increases to 0.98 as late as M = 137,
before gradually dropping to 0.95.

Computational Learning Theory:

Computational learning theory, a field at the intersection of AI, statistics, and theoretical
computer science. The underlying principle is the following: any hypothesis that is seriously
wrong will almost certainly be "found out" with high probability after a small number of
examples, because it will make an incorrect prediction. Thus, any hypothesis that is consistent
with a suficiently large set of training examples is unlikely to be seriously wrong: that is, it
must beprobably approximately correct. Any learning algorithm that returns hypotheses that
are probably approximately correct is called a PAC-learning algorithm.

How many examples are needed?

In order to put these insights into practice, we will need some notation:
Let X be the set of all possible examples.
Let D be the distribution from which examples aire drawn.
Let H be the set of possible hypotheses.
Let N be the number of examples in the training set.
Initially, we will assume that the true function f is a member of H. Now we can define the
ERROR error of a hypothesis h with respect to the true functiion f given a distribution L) over
the examples a.s the probability that h is different from f on an example:

9|Page
Learning decision lists:
A decision list is a logical expression of a restricted form. It consists of a series of tests, each
of which is a conjunction of literals. If a test succeeds when applied to an example description,
the decision list specifies the value to be returned. If the test fails, processing continues with
the next test in the list.6 Decision lists resemble decision trees, but their overall structure is
simpler. In contrast, the individual tests are more complex. Figure 18.13 shows a decision list
that represents the following hypothesis:

The decision list algorithm is shown in Figure 18.14.

It would seem reasonable to prefer small tests that match large sets of uniformly classified
examples, so that the overall decision list will be as compact as possible. The simplest strategy
is to find the smallest test t that matches any uniformly classified subset, regardless of the size
of the subset. Even this approach works quite well, as Figure 18.15 suggests.

10 | P a g e
Statistical Learning: Instance Based Learning:
In contrast to parametric learning, nonparametric learning methods allow the hypothesis
complexity to grow with the data. The more data we have, the wigglier the hypothesis can be.
We will look at two very simple families of nonpararnetric instance-based learning (or
memory-based learning) methods, so called because they construct hypotheses directly from
the training instances themselves.

Nearest-neighbor models:

The key idea of nearest-neighbor models is that the properties of any particular input point x
are likely to be similar to those of points in the neighborhood (of x. For example, if we want to
do density estimation-that is, estimate the value of an unknown probability density at x then
we can simply measure the density with which points are scattered in the neighborhood of x.
This sounds very simple, until we realize that we need to specify exactly what we mean by
"neighborhood." If the neighborhood is too small, it won't contain any data points; too large,
and it may include all the data points, resulting in a density estimate that is the same
everywhere. One solution is to define the neighborhood to be just big enough to include k
points, where k is large enough to ensure a meaningful estimate. For fixed k, the size of the
neighborhood varies-where data are sparse, the neighborhood is large, but where data are
dense, the neighborhood is small.
Example: Figure 20.12(a) shows an example for data scattered in two dimensions. Figure
20.13 shows the results of k-nearest-neighbor density estimation from these data with k = 3,
10, and 40 respectively. For k == 3, the density estimate at any point is based on only 3
neighboring points and is highly variable. For k = 10, the estimate provides a good
reconstruction of the true density shown in Figure 20.12(b). For k = 40, the neighborhood
becomes too large and structure of the data is altogether lost. In practice, using a value of k
somewhere between 5 and 10 gives good results for most low-dimensional data sets. A good
value of k can also be chosen by using cross-validation.

11 | P a g e
To identify the nearest neighbors of a query point, we need a distance metric, D(x1, x2). It is
also possible to use the nearest-neighbor idea for direct supervised learning. Given a test
example with input x, the output y = h(x) is obtained from the y-values of the k nearest
neighbors ad x. In the discrete case, we can obtain a single prediction by majority vote. In the
continuous case, we can average the k values or do local linear regression, fitting a hyperplane
to the k points and predicting the value at x according to the hyperplane.
Advantage: The k-nearest-neighbor learning algorithm is very simple to implement, requires
little in the way of tuning, and often performs quite well.

Neural Networks:
A neuron is a cell in the brain whose principal function is the collection, processing, and
dissemination of electrical signals. The brain's information-processing capacity is thought to
emerge primarily from networks of such neurons. For this reason, some of the earliest A1 work
aimed to create artificial neural networks. (Other names for the field include connectionism,
parallel distributed processing, and neural computation.) Figure 20.15 shows a simple
mathematical model of the neuron devised by McCulloch and Pitts (1943). Roughly speaking,
it "fires" when a linear combination of its inputs exceeds some threshold.

Units in neural networks:

Neural networks are composed of nodes or units (see Figure 2:0.15) connected by directed
links. A link from unit j to unit i serves to propagate tlhe activation aj from j to i. Each link
also has a numeric weight Wj,i associated with it, which determines the strength and sign of
the connection. Each unit i first computes a weighted sum of its inputs:

Then it applies an activation function g to this sum to derive the output:

Notice that we have included a bias weight Wo,i connected to a fixed input ao = -1 The
activation function g is designed to meet two things. First, we want the unit to be "active" (near
+1) when the "right" inputs are given, and "inactive" (near 0) when the "wrong" inputs are
given. Second, the activation needs to be nonlinear, otherwise the entire neural network
12 | P a g e
collapses into a simple linear function. Two choices for g are shown in Figure 20.16: the
threshold function and the sigmoid function (also known as the logistic function). The
sigmoid function has the advantage of being differentiable, which we will see later is important
for the weight-learning algorithm. Notice that both functions have a threshold (either hard or
soft) at zero; the bias weight sets Wo,i the actual threshold for the unit, in the sense that the unit
is activated when the weighted sum of "real" inputs.

Network Structures: There are two main categories of neural network structures: acyclic or
feed-forward networks and cyclic or recurrent networks. A feed-forward network
represents a function of its current input; thus, it has no internal state other than the weights
themselves. A recurrent network, on the other hand, feeds its outputs back into1 its own inputs.
This means that the activation levels of the network form a dynamical system that may reach a
stable state or exhibit oscillations or even chaotic behaviour.
Feed-Forward Network: Let us look more closely into the assertion that a feed-forward
network represents a function of its inputs. Consider the simple network shown in Figure 20.18,
which has two input units, two hidden units, and an output unit. (To keep things simple, we
have omitted the bias units in this example.) Given an input vector x == ( xl , x2), the activations
of the input units are set to (a1, a2) = (x1, x2) and the network computes

A neural network can be used for classification or regression. For Boolean classification with
continuous outputs (e.g., with sigmoid units), it is traditional to have a single output unit, with
a value over 0.5 interpreted as one class and a value bellow 0.5 as the other. For k-way

13 | P a g e
classification, one could divide the single output unit's range intlo k portions, but itt is more
common to have k separate output units, with the value of each one representing the relative
likelihood of that class given the current input.

Single layer feed-forward neural networks (perceptrons): A network with all the inputs
connected directly to the outputs is called a single-layer neural network, or a perceptron
network. Since each output unit is independent of the others each weight affects only one of
the outputs-we can limit our study to perceptrons with a single output unit, as explained in
Figure 20.19(a).

The threshold perceptron returns 1 if and only if the weighted sum of its inputs (including the
bias) is positive:
Now, the equation W. x = 0 defines a hyperplane in the input space, so the perceptron returns
1 if and only if the input is on one side of that hyperplane. For this reason, the threshold
LINEARSEPAF~ATOR perceptron is called a linear separator. Figure 18.21(a) and (b) show
this hyperplane (a line, in two dimensions) for the perceptron representations of the AND and
OR functions of two inputs. The perceptron can represent these functions because there is some
line that separates all the white dots from all the black dots. Such functions are called linearly
separable.

14 | P a g e
Figure 18.21(c) shows an example of a function that is not linearly separable-the XOR function.
Clearly, there is no way for a threshold perceptron to learn this function. In general, threshold
perceptrons can ,represent only linearly separable functions.
The Gradient descent learning algorithm is used in perceptron for learning. The perceptron
learning algorithm is outlined in Figure 20.21

Figure 20.22 shows the learning curve for a perceptron on two different problems. On the left,
we show the curve for learning the majority function with 11 Boolean inputs (i.e., the function
outputs a 1 if 6 or more inputs are 1). On the right, we have the restaurant example. The solution
problem is easily represented as a decision tree, but is not linearly separable. The best plane
through the data correctly classifies only 65%.

15 | P a g e
Multilayer feed-forward neural networks:
A Multilayer feed-forward neural networks consists of hidden layers. The most common case
involves a single hidden layer,as in Figure 20.24. The advantage of adding hidden layers is that
it enlarges the space of hypotheses that the network can represent. With more hidden units, we
can produce more bumps of different sizes in more places. In fact, with a single, sufficiently
large hidden layer, it is possible to represent any continuous function of the inputs with
arbitrary accuracy; with two layers, even discontinuous functions can be represented.
Unfortunately, for any particular network structure, it is harder to characterize exactly which
functions can be represented and which ones cannot.

Learning algorithms for multilayer networks are similar to the perceptron learning algorithm
show in Figure 20.21. One minor difference is that we may have several outputs, so we have
an output vector hw(x) rather than a single value, and each example has an output vector y. The
major difference is that, whereas the error y- hw at the output layer is clear, the error at the
hidden layers seems mysterious because the training data does not say what value the hidden
nodes should have. It turns out that we can back-propagate the ell-or from the output layer to
the hidden layers. The back-propagation process emerges directly from a derivation of the
overall error gradient. First, we will describe the process with an intuitive justification; then,
we will show the derivation.

The back-propagation process can be summarized as follows:

16 | P a g e
For the mathematically inclined, we will now derive the back-propagation equations from first
principles. The squared error on a single example is defined as

The error can be back propagated through network:

Figure 20.26 shows single hidden layer network performs on the restaurant problem. In Figure
20.26, we show two curves. The first is a training curve, which shows the mean squared error
on a given training set of 100 restaurant examples during the weight-updating process. This
demonstrates that the network does indeed converge to a perfect fit to the training data. The
second curve is the standard learning curve for the restaurant data. The neural network does
learn well, although not quite as fast as decision-tree learning; this is perhaps not surprising,
because the data were generated from a simple decision tree in the first place.

17 | P a g e
Advantage: Neural networks are capable of far more complex learning tasks of course,
although it must be said that a certain amount of twiddling is needed to get the network structure
right and to achieve convergence to something close to the global optimum in weight space.

Learning neural network structures: we also need to understand how to find the best
network structure. If we choose a network that is too big, it will be able to memorize all the
examples by forming a large lookup table, but will not necessarily generalize well to inputs
that have not been seen before. In other words, like all statistical models, neural networks are
subject to overfitting when there are too many parameters in the model.

If we stick to fully connected networks, the only choices to be made concern the number of
hidden layers and their sizes. The usual approach is to try several and keep the best. The cross-
validation techniques of Chapter 18 are needed if we are to avoid peeking at the test set. That
is, we choose the network architecture that gives the highest prediction accuracy on the
validation sets.

If we want to consider networks that are not fully connected, then we need to find some
effective search method through the very large space of possible connection topologies. The
optimal brain damage algorithm begins with a fully connected network and removes
connections from it. After the network is trained for the first time, an information-theoretic
approach identifies an optimal selection of connections that can be dropped. The network is
then retrained, and if its performance has not decreased then the process is repeated. In addition
to removing connections, it is also possible to remove units that are not contributing much to
the result.
Several algorithms have been proposed for growing a larger network from a smaller one. One,
the tiling algorithm, resembles decision-list learning. The idea is to start with a single unit that
does its best to produce the correct output on as many of the training examples as possible.
Subsequent units are added to take care of the examples that the first unit got wrong. The
algorithm adds only as many units as are needed to cover all the examples.

18 | P a g e

Lecture 1
No ratings yet
Lecture 1
37 pages
Machine Learning-2
No ratings yet
Machine Learning-2
16 pages
TTNT 09 Learning From Examples
No ratings yet
TTNT 09 Learning From Examples
58 pages
CHP 4
No ratings yet
CHP 4
22 pages
Ai - Unit Vi
No ratings yet
Ai - Unit Vi
40 pages
Notes On Machine - Learning
No ratings yet
Notes On Machine - Learning
88 pages
Amanuel Ai
No ratings yet
Amanuel Ai
28 pages
M Tech Ai Unit Iii
No ratings yet
M Tech Ai Unit Iii
6 pages
Lecture - 32 - 33
No ratings yet
Lecture - 32 - 33
65 pages
AI Unit 4
No ratings yet
AI Unit 4
91 pages
86 37 196 Mod 5
No ratings yet
86 37 196 Mod 5
52 pages
Mlnotes 2 Srija
No ratings yet
Mlnotes 2 Srija
15 pages
10 Learning Annot
No ratings yet
10 Learning Annot
32 pages
10 Learning
No ratings yet
10 Learning
32 pages
Mod 4-1
No ratings yet
Mod 4-1
42 pages
Ai Unit 5 Part 3
No ratings yet
Ai Unit 5 Part 3
9 pages
Unit 5
No ratings yet
Unit 5
21 pages
Chapter19 4e
No ratings yet
Chapter19 4e
67 pages
5 Learning
No ratings yet
5 Learning
42 pages
UNIT-VI Learning
No ratings yet
UNIT-VI Learning
19 pages
Learning From Examples
No ratings yet
Learning From Examples
22 pages
Ai Module V Part2
No ratings yet
Ai Module V Part2
8 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
9 Learning
No ratings yet
9 Learning
16 pages
Learning From Observations: Chapter 18, Sections 1-3
No ratings yet
Learning From Observations: Chapter 18, Sections 1-3
30 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
AIMA Decision Trees
No ratings yet
AIMA Decision Trees
11 pages
2-Inductive Learning
No ratings yet
2-Inductive Learning
37 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
JU Ch9
No ratings yet
JU Ch9
21 pages
Artificial Intelligence: Machine Learning
No ratings yet
Artificial Intelligence: Machine Learning
110 pages
Chap 18
No ratings yet
Chap 18
51 pages
Ai Unit5 Learning
No ratings yet
Ai Unit5 Learning
62 pages
Unit 5 Half Ai
No ratings yet
Unit 5 Half Ai
9 pages
Unit I
No ratings yet
Unit I
17 pages
What Is Supervise
No ratings yet
What Is Supervise
3 pages
Tycs Ai Unit 2
No ratings yet
Tycs Ai Unit 2
84 pages
Notes Artificial Intelligence Unit 5
No ratings yet
Notes Artificial Intelligence Unit 5
11 pages
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
No ratings yet
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
30 pages
ML Sit1305
No ratings yet
ML Sit1305
127 pages
Learning From Observations: Section 1 - 3
No ratings yet
Learning From Observations: Section 1 - 3
26 pages
Artificial Intelligence Chapter 18 (Updated)
No ratings yet
Artificial Intelligence Chapter 18 (Updated)
19 pages
AI Notes Module - 4
No ratings yet
AI Notes Module - 4
13 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
54 pages
Chapter 6:artificial Intelligence Learning: By. Getaneh T
No ratings yet
Chapter 6:artificial Intelligence Learning: By. Getaneh T
59 pages
Larning Introduction
No ratings yet
Larning Introduction
6 pages
Lect6 PDF
No ratings yet
Lect6 PDF
66 pages
Machine Learning Learning
No ratings yet
Machine Learning Learning
35 pages
Chapter 8: Learning: By, Safa Hamdare
No ratings yet
Chapter 8: Learning: By, Safa Hamdare
46 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
2702 PDF
No ratings yet
2702 PDF
7 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Constraint Interview Ques Solutions
No ratings yet
Constraint Interview Ques Solutions
5 pages
Machine Learning 1
No ratings yet
Machine Learning 1
29 pages
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
No ratings yet
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
101 pages
This Story Paraphrased From A Post On 9/4/12
No ratings yet
This Story Paraphrased From A Post On 9/4/12
7 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
32 pages
Question Paper Summer 2023
No ratings yet
Question Paper Summer 2023
4 pages
IIP Midterm Sol
No ratings yet
IIP Midterm Sol
7 pages
Data Structure 1 Fyit
No ratings yet
Data Structure 1 Fyit
39 pages
Image Classification Report
No ratings yet
Image Classification Report
7 pages
Btech Cs 5 Sem Design and Analysis of Algorithms Rcs 502 2018 19
No ratings yet
Btech Cs 5 Sem Design and Analysis of Algorithms Rcs 502 2018 19
2 pages
Recurrence Relations-2
No ratings yet
Recurrence Relations-2
9 pages
I Ntegrat
No ratings yet
I Ntegrat
18 pages
FP6 SI0: A Numerical Approximation Method For Stochastic Differential Equations of It6 Type Kaneko
No ratings yet
FP6 SI0: A Numerical Approximation Method For Stochastic Differential Equations of It6 Type Kaneko
2 pages
Apio2009 Solutions
No ratings yet
Apio2009 Solutions
5 pages
Filter Banks Using Wavelets
No ratings yet
Filter Banks Using Wavelets
26 pages
Particle Swarm Optimization - 1
No ratings yet
Particle Swarm Optimization - 1
21 pages
Unit 1 - Assignment
No ratings yet
Unit 1 - Assignment
10 pages
Facial Expression Recognition Via Deep Sparse Autoencoders - Nianyin Zeng, Hong Zhang, Baoye Song, Weibo Liu, Yurong Li, Abdullah M. Dobaie
No ratings yet
Facial Expression Recognition Via Deep Sparse Autoencoders - Nianyin Zeng, Hong Zhang, Baoye Song, Weibo Liu, Yurong Li, Abdullah M. Dobaie
24 pages
Ch.10 Numerical Methods
No ratings yet
Ch.10 Numerical Methods
1 page
Wipro Engineering Online Test Curriculum: Every Eligible Candidate Must Go Through Below Online Assessment (110 Minutes)
No ratings yet
Wipro Engineering Online Test Curriculum: Every Eligible Candidate Must Go Through Below Online Assessment (110 Minutes)
2 pages
Unit V
No ratings yet
Unit V
43 pages
CMPE371 Lecture - 4 - 2324 - FALL - PART IV
No ratings yet
CMPE371 Lecture - 4 - 2324 - FALL - PART IV
19 pages
Float Exercise
No ratings yet
Float Exercise
9 pages
Bmi 401-Design and Analysis of Algorithms Course Outline
No ratings yet
Bmi 401-Design and Analysis of Algorithms Course Outline
4 pages
Recurrence Relation
No ratings yet
Recurrence Relation
5 pages
Assignment 1
No ratings yet
Assignment 1
24 pages
Jitorres - Lesson 1-2 - The DB in Communications PDF
No ratings yet
Jitorres - Lesson 1-2 - The DB in Communications PDF
14 pages
Low Pass Fir Filter Design Using Genetic Algorithm
No ratings yet
Low Pass Fir Filter Design Using Genetic Algorithm
5 pages
11110 計算方法設計許建平 quiz1
No ratings yet
11110 計算方法設計許建平 quiz1
6 pages
Unit - 1: Analysis of Algorithm
No ratings yet
Unit - 1: Analysis of Algorithm
16 pages
Hopfield
No ratings yet
Hopfield
3 pages
BCS303 - Artificial Intelligence - Game Theory
No ratings yet
BCS303 - Artificial Intelligence - Game Theory
7 pages
DSA-Class-Assignment 3
No ratings yet
DSA-Class-Assignment 3
2 pages
Programme For Quick Sort
No ratings yet
Programme For Quick Sort
4 pages