Ai Module V Part2
Ai Module V Part2
Learning
Syllabus
Uncertain Knowledge: Uncertainty: Acting under uncertainty, basic probability notation, the axioms of
Probability, Inference using full joint distributions, independence, Baye’s rule and its use, the wumpus world
revisited. Learning: Learning from Observations: Forms of learning, Inductive learning, learning decision
trees, ensemble learning. Why Learning Works: Computational learning theory.
1
CSEN2031: ARTIFICIAL INTELLIGENCE Module V LECTURE NOTES
In (3), the theory of braking is a function from states and braking actions to, say, stopping distance in feet.
Notice that in cases (1) and (2), a teacher provided the correct output value of the examples; in the third, the
output value was available directly from the agent's percepts. For fully observable environments, it will
always be the case that an agent can observe the effects of its actions and hence can use supervised learning
methods to learn to predict them. For partially observable environments, the problem is more difficult,
because the immediate effects might be invisible.
Reinforcement learning:
Rather than being told what to do by a teacher, a reinforcement learning agent must learn from
reinforcement.
The representation of learned information plays a role in determining how the learning algorithm
must work. Any of the components of an agent can be represented using any of the representation
schemes like linear weighted polynomials, propositional and first-order logical sentences,
probabilistic descriptions etc.,
Availability of prior knowledge also plays major role in design of learning system.
1.2 Inductive Learning:
Consider an example pair (x, f (x)), where x is the input and f(x) is the output of the function applied to x.
The task of pure inductive inference is “Given a collection of examples of f, return a function h that
approximates f”. The function h is called a hypothesis.
Inductive learning involves finding a consistent hypothesis that agrees with the examples.
Figure: 1.2.1 shows a familiar example: fitting a function of a single variable to some data points. The
examples are (x, f (x)) pairs, where both x and f (x) are real numbers. We choose the hypothesis
space H-the set of hypotheses we will consider-to be the set of polynomials of degree at most k.
Figure (a) shows some data with an exact fit by a straight line (a polynomial of degree 1). The line is
called a consistent hypothesis because it agrees with all the data. Figure (b) shows a high-degree
polynomial that is also consistent with the same data. This illustrates the first issue in inductive
learning: how do we choose from among multiple consistent hypotheses?
Ockham’s razor suggests the simplest hypothesis consistent with the data.
Figure (c) shows a second data set. There is no consistent straight line for this data set; in fact, it
requires a degree-6 polynomial (with 7 parameters) for an exact fit. There are just 7 data points, so the
polynomial has as many parameters as there are data points: thus, it does not seem to be finding any
pattern in the data and we do not expect it to generalize well. It might be better to fit a simple straight
line that is not exactly consistent but might make reasonable predictions.
Figure (d) shows that the data in (c) can be fit exactly by a simple function of the form ax + b + c sin
x. This example shows the importance of the choice of hypothesis space. A hypothesis space
consisting of polynomials of finite degree cannot represent sinusoidal functions accurately, so a
learner using that hypothesis space will not be able to learn from sinusoidal data.
A learning problem is realizable if the hypothesis space contains the true function; otherwise, it is
unrealizable.
2
CSEN2031: ARTIFICIAL INTELLIGENCE Module V LECTURE NOTES
Figure 1.2.1 (a) Example (x, f (x)) pairs and a consistent, linear hypothesis. (b) A consistent,
degree-7 polynomial hypothesis for the same data set. (c) A different data set that admits an exact
degree-6 polynomial fit or an approximate linear fit. (d) A simple, exact sinusoidal fit to the same
data set.
1.3 Learning Decision Trees:
A Decision tree takes as input an object or situation described by a set of attributes and returns a "decision”
-the predicted output value for the input.
The input attributes and the output values can be discrete or continuous.
Learning discrete-valued function is called “classification “whereas continuous-valued functions are
called “regression”.
A decision tree reaches its decision by performing a sequence of tests.
Each internal node in the tree corresponds to a test of the value of one of the properties.
The branches from the node are labelled with the possible values of the test.
Each leaf node in the tree specifies the value to be returned if that leaf is reached.
Consider a list of Attributes: [ Example-To wait for a table at restaurant?]
Decision tree:
Here, Attributes are processed by the tree starting at the root and following the appropriate branch until a leaf
is reached. For instance, an example with Patrons = Full and Wait Estimate = 0-10 will be classified as
positive (i.e., yes, we will wait for a table).
3
CSEN2031: ARTIFICIAL INTELLIGENCE Module V LECTURE NOTES
Where each condition Pi(s) is a conjunction of tests corresponding to a path from the root of the tree to a leaf
with a positive outcome.
Decision trees can express any function of the input attributes. For Boolean functions, truth table row gives
path to leaf.
If the function is the parity function, which returns 1 if and only if an even number of inputs are 1, then an
exponentially large decision tree will be needed. It is also difficult to use a decision tree to represent a
majority function, which returns 1 if more than half of its inputs are 1.
The truth table has 2n rows, because each input case is described by n attributes. We can consider the
"answer" column of the table as a 2n-bit number that defines the function.
Inducing Decision trees for examples:
An example for a Boolean decision tree consists of a vector of' input attributes, X, and a single Boolean
output value y. A set of examples (X1, yl) . . . , (XI2, y12) is shown in following Figure.
4
CSEN2031: ARTIFICIAL INTELLIGENCE Module V LECTURE NOTES
The positive examples are the ones in which the goal Will Wait is true (X1, X3, . . .); the negative
examples are the ones in which it is false (X2, X5,...)
Construct a decision tree that has one path to a leaf for each example, where the path tests each
attribute in turn and follows the value for the example and the leaf has the classification of the
example. When given the same example again, the decision tree will come up with the right
classification.
Algorithm:
5
CSEN2031: ARTIFICIAL INTELLIGENCE Module V LECTURE NOTES
Splitting the examples by testing on attributes. (a) Splitting on Type brings us no nearer to distinguishing
between positive and negative examples. (b) Splitting on Patrons does a good job of separating positive and
negative examples? After splitting on patrons, Hungry is a fairly good second test.
2. If all the remaining examples are positive (or all. negative), then we are done: we can answer Yes or No.
Figure (b) shows examples of this in the none and some cases.
3. If there are no examples left, it means that no such example has been observed, and we return a default
value calculated from the majority classification at the node's parent.
6
CSEN2031: ARTIFICIAL INTELLIGENCE Module V LECTURE NOTES
4. If there are no attributes left, but both positive and negative examples, we have a problem. It means that
these examples have exactly the same description, but different classifications. This happens when some of
the data are incorrect; we say there is noise in the data. It also happens either when the attributes do not give
enough information to describe the situation fully, or when the domain is truly nondeterministic. One simple
way out of the problem is to use a majority vote.
Choosing Attribute Sets:
One suitable measure is the expected amount of information provided by the attribute.
It can think of as giving answer to question. The amount of information contained in the answer
depends on one's prior knowledge.
If the possible answers vi have probabilities P(vi), then the information content I of the actual answer
is given b.
Suppose the training set contains p positive examples and n negative examples. Then an estimate of
the information contained in a correct answer is
The information gain from the attribute test is the difference between the original information
requirement and the new requirement
7
CSEN2031: ARTIFICIAL INTELLIGENCE Module V LECTURE NOTES
A learning curve for the decision tree algorithm on 100 randomly generated examples in the
restaurant domain. The graph summarizes 20 trials.