0% found this document useful (0 votes)
23 views21 pages

Module 5 - AI

Module V of CST 401 focuses on machine learning, covering learning from examples, forms of learning, supervised learning, decision trees, and evaluating hypotheses. It discusses various learning types, including supervised, unsupervised, and reinforcement learning, as well as techniques like decision tree induction and the importance of generalization to avoid overfitting. The module emphasizes the need for effective evaluation methods such as k-fold cross-validation to ensure the accuracy of learned hypotheses.

Uploaded by

zoro96437
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views21 pages

Module 5 - AI

Module V of CST 401 focuses on machine learning, covering learning from examples, forms of learning, supervised learning, decision trees, and evaluating hypotheses. It discusses various learning types, including supervised, unsupervised, and reinforcement learning, as well as techniques like decision tree induction and the importance of generalization to avoid overfitting. The module emphasizes the need for effective evaluation methods such as k-fold cross-validation to ensure the accuracy of learned hypotheses.

Uploaded by

zoro96437
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module V

CST 401 Artificial Intelligence

MODULE V

MACHINE LEARNING
SYLLABUS:
Learning from Examples – Forms of Learning, Supervised Learning, Learning
Decision Trees, Evaluating and choosing the best hypothesis, Regression and
classification with Linear models.

LEARNING FROM EXAMPLES


Here we describe agents that can improve their behavior through diligent study of
their own experiences
An agent is learning if it improves its performance on future tasks after making
observations about the world.

FORMS OF LEARNING
Any component of an agent can be improved by learning from data. The
improvements, and
the techniques used to make them, depend on four major factors:
• Which component is to be improved.
• What prior knowledge the agent already has.
• What representation is used for the data and the component.
• What feedback is available to learn from.

Components to be learned
The components of agents include:
1. A direct mapping from conditions on the current state to actions.
2. A means to infer relevant properties of the world from the percept sequence.
3. Information about the way the world evolves and about the results of possible
actions the agent can take.
4. Utility information indicating the desirability of world states.
5. Action-value information indicating the desirability of actions.

Universal Engineering College Page | 1


Module V
CST 401 Artificial Intelligence

6. Goals that describe classes of states whose achievement maximizes the agent’s
utility.
Each of these components can be learned.

Representation and prior knowledge


Representations for agent components: propositional and first-order logical
sentences for the components in a logical agent; Bayesian networks for the inferential
components of a decision-theoretic agent and so on.
There is another way to look at the various types of learning. Learning a (possibly
incorrect) general function or rule from specific input–output pairs is called
inductive learning. Analytical or deductive learning means going from a known
general rule to a new rule that is logically entailed, but is useful because it allows
more efficient processing.

Feedback to learn from


There are three types of feedback that determine the three main types of learning:
In unsupervised learning the agent learns patterns in the input even though no
explicit feedback is supplied. The most common unsupervised learning task is
clustering: detecting potentially useful clusters of input examples. For example, a
taxi agent might gradually develop a concept of ―good traffic days‖ and ―bad traffic
days‖ without ever being given labelled examples of each by a teacher.
In reinforcement learning the agent learns from a series of reinforcements—rewards
or punishments. For example, the lack of a tip at the end of the journey gives the taxi
agent an indication that it did something wrong. The two points for a win at the end
of a chess game tells the agent it did something right. It is up to the agent to decide
which of the actions prior to the reinforcement were most responsible for it.
In supervised learning the agent observes some example input–output pairs and
learns a function that maps from input to output.
In semi-supervised learning we are given a few labelled examples and must make
what we can of a large collection of unlabelled examples.

Universal Engineering College Page | 2


Module V
CST 401 Artificial Intelligence

SUPERVISED LEARNING

The function h is a hypothesis. Learning is a search through the space of possible


hypotheses for one that will perform well, even on new examples beyond the
training set. To measure the accuracy of a hypothesis we give it a test set of
examples that are distinct from the training set. We say a hypothesis generalizes
well if it correctly predicts the value of y for novel examples.
When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the
learning problem is called classification, and is called Boolean or binary
classification if there are only two values. When y is a number (such as tomorrow’s
temperature), the learning problem is called regression.
Figure 18.1 shows a familiar example: fitting a function of a single variable to some
data points. The examples are points in the (x, y) plane, where y = f(x). We don’t
know what f is, but we will approximate it with a function h selected from a
hypothesis space, H.

Figure 18.1(b) shows a high degree polynomial that is also consistent with the same
data. This illustrates a fundamental problem in inductive learning: how do we choose

Universal Engineering College Page | 3


Module V
CST 401 Artificial Intelligence

from among multiple consistent hypotheses? The answer is to prefer the simplest
hypothesis consistent with the data. This principle is called Ockham’s razor
principle. Ockham’s razor principle suggests that in machine learning, we should
prefer simpler models with fewer coefficients over complex models like ensembles.

LEARNING DECISION TREES


Decision tree induction is one of the simplest and yet most successful forms of
machine learning.

The decision tree representation


A decision tree represents a function that takes as input a vector of attribute values
and returns a ―decision‖—a single output value. The input and output values can be
discrete or continuous. For now we will concentrate on problems where the inputs
have discrete values and the output has exactly two possible values; this is Boolean
classification, where each example input will be classified as true (a positive
example) or false (a negative example).
A decision tree reaches its decision by performing a sequence of tests. Each internal
node in the tree corresponds to a test of the value of one of the input attributes, 𝐴𝑖,
and the branches from the node are labelled with the possible values of the attribute,
𝐴𝑖.= 𝑣𝑖𝑘. Each leaf node in the tree specifies a value to be returned by the function.
As an example, we will build a decision tree to decide whether to wait for a table at a
restaurant. The aim here is to learn a definition for the goal predicate, WillWait. First
we list the attributes that we will consider as part of the input:
1. Alternate: whether there is a suitable alternative restaurant nearby.
2. Bar: whether the restaurant has a comfortable bar area to wait in.
3. Fri/Sat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values are None, Some, and Full).
6. Price: the restaurant’s price range ($, $$, $$$).
7. Raining: whether it is raining outside.
8. Reservation: whether we made a reservation.

Universal Engineering College Page | 4


Module V
CST 401 Artificial Intelligence

9. Type: the kind of restaurant (French, Italian, Thai, or burger).


10. WaitEstimate: the wait estimated by the host (0–10 minutes, 10–30, 30–60, or >60).
One possible decision tree is shown in figure 18.2.

Inducing decision trees from examples


An example for a Boolean decision tree consists of an (x, y) pair, where x is a vector
of values for the input attributes, and y is a single Boolean output value. A training
set of 12 examples is shown in Figure 18.3. The positive examples are the ones in
which the goal WillWait is true (x1, x3, . . .); the negative examples are the ones in
which it is false (x2, x5, . . .).

Universal Engineering College Page | 5


Module V
CST 401 Artificial Intelligence

The DECISION-TREE-LEARNING algorithm adopts a greedy divide-and-conquer


strategy: always test the most important attribute first. This test divides the problem
up into smaller sub-problems that can then be solved recursively. By ―most
important attribute,‖ we mean the one that makes the most difference to the
classification of an example. That way, we hope to get to the correct classification
with a small number of tests, meaning that all paths in the tree will be short and the
tree as a whole will be shallow.

Figure 18.4(a) shows that Type is a poor attribute, because it leaves us with four
possible outcomes, each of which has the same number of positive as negative
examples. On the other hand, in (b) we see that Patrons is a fairly important
attribute, because if the value is None or Some, then we are left with example sets for
which we can answer definitively (No and Yes, respectively). If the value is Full, we
are left with a mixed set of examples. In general, after the first attribute test splits up
the examples, each outcome is a new decision tree learning problem in itself, with
fewer examples and one less attribute. There are four cases to consider for these
recursive problems:

Universal Engineering College Page | 6


Module V
CST 401 Artificial Intelligence

1. If the remaining examples are all positive (or all negative), then we are done: we
can answer Yes or No. Figure 18.4(b) shows examples of this happening in the None
and Some branches.
2. If there are some positive and some negative examples, then choose the best
attribute to split them. Figure 18.4(b) shows Hungry being used to split the
remaining examples.
3. If there are no examples left, it means that no example has been observed for this
combination of attribute values, and we return a default value calculated from the
plurality classification of all the examples that were used in constructing the node’s
parent. These are passed along in the variable parent examples.
4. If there are no attributes left, but both positive and negative examples, it means
that these examples have exactly the same description, but different classifications.
This can happen because there is an error or noise in the data; because the domain is

nondeterministic; or because we can’t observe an attribute that would distinguish


the examples. The best we can do is to return the plurality classification of the
remaining examples.

The DECISION-TREE-LEARNING algorithm is shown in Figure 18.5. The output of


the learning algorithm on our sample training set is shown in Figure 18.6. The tree is
clearly different from the original tree shown in Figure 18.2. One might conclude
that the learning algorithm is not doing a very good job of learning the correct
function. This would be the wrong conclusion to draw, however. The learning
algorithm looks at the examples, not at the correct function, and in fact, its hypothesis
(see Figure 18.6) not only is consistent with all the examples, but is considerably
simpler than the original tree! The learning algorithm has no reason to include tests
for Raining and Reservation, because it can classify all the examples without them.

Universal Engineering College Page | 7


Module V
CST 401 Artificial Intelligence

We can evaluate the accuracy of a learning algorithm with a learning curve, as


shown in Figure 18.7. The curve shows that as the training set size grows, the
accuracy increases. (For this reason, learning curves are also called happy graphs.)
In this graph we reach 95% accuracy, and it looks like the curve might continue to
increase with more data.

Universal Engineering College Page | 8


Module V
CST 401 Artificial Intelligence

Choosing attribute tests


A perfect attribute divides the examples into sets, each of which are all positive or all
negative and thus will be leaves of the tree. We will use the notion of information
gain, which is defined in terms of entropy.
Entropy is a measure of the uncertainty of a random variable. The entropy of a
random variable V with values 𝑣𝑘, each with probability P(𝑣𝑘), is defined as,

The information gain from the attribute test on A is the expected reduction in
entropy:

Where,

(IMPORTANT: PRACTISE NUMERICAL QUESTIONS BASED ON DECISION TREE)

Universal Engineering College Page | 9


Module V
CST 401 Artificial Intelligence

Generalization and Overfitting


Generalization is a term used to describe a model’s ability to react to new data. That
is, after being trained on a training set, a model can digest new data and make
accurate predictions. A model’s ability to generalize is central to the success of a
model. If a model has been trained too well on training data, it will be unable to
generalize. It will make inaccurate predictions when given new data, making the
model useless even though it is able to make accurate predictions for the training
data. This is called overfitting. The inverse is also true. Underfitting happens when a
model has not been trained enough on the data. In the case of underfitting, it makes
the model just as useless and it is not capable of making accurate predictions, even
with the training data.

The figure demonstrates the three concepts discussed above. On the left, the blue
line represents a model that is underfitting. The model notes that there is some trend
in the data, but it is not specific enough to capture relevant information. It is unable
to make accurate predictions for training or new data. In the middle, the blue line
represents a model that is balanced. This model notes there is a trend in the data,
and accurately models it. This middle model will be able to generalize successfully.
On the right, the blue line represents a model that is overfitting. The model notes a
trend in the data, and accurately models the training data, but it is too specific. It will
fail to make accurate predictions with new data because it learned the training data
too well.
For decision trees, a technique called decision tree pruning combats overfitting.
Pruning works by eliminating nodes that are not clearly relevant. We start with a full
tree, as generated by DECISION-TREE-LEARNING. We then look at a test node that
has only leaf nodes as descendants. If the test appears to be irrelevant—detecting

Universal Engineering College Page | 10


Module V
CST 401 Artificial Intelligence

only noise in the data— then we eliminate the test, replacing it with a leaf node. We
repeat this process, considering each test with only leaf descendants, until each one
has either been pruned or accepted as is.

EVALUATING AND CHOOSING THE BEST HYPOTHESIS


We want to learn a hypothesis that fits the future data best. To define ―best fit‖, we
define the error rate of a hypothesis as the proportion of mistakes it makes—the

proportion of times that h(x) ≠ y for an (x, y) example. Now, just because a
hypothesis h has a low error rate on the training set does not mean that it will
generalize well. To get an accurate evaluation of a hypothesis, we need to test it on a
set of examples it has not seen yet.
The simplest approach is the one we have seen already: randomly split the available
data into a training set from which the learning algorithm produces h and a test set
on which the accuracy of h is evaluated. This method, sometimes called holdout
cross-validation, has the disadvantage that it fails to use all the available data; if we
use half the data for the test set, then we are only training on half the data, and we
may get a poor hypothesis. On the other hand, if we reserve only 10% of the data for
the test set, then we may, by statistical chance, get a poor estimate of the actual
accuracy.
We can squeeze more out of the data and still get an accurate estimate using a
technique called k-fold cross-validation. The idea is that each example serves
double duty—as training data and test data. First we split the data into k equal
subsets. We then perform k rounds of learning; on each round 1/k of the data is held
out as a test set and the remaining examples are used as training data. The average
test set score of the k rounds should then be a better estimate than a single score.
Popular values for k are 5 and 10. The extreme is k = n, also known as leave-one-out
cross-validation or LOOCV.
Despite the best efforts of statistical methodologists, users frequently invalidate their
results by inadvertently peeking at the test data. Peeking can happen like this: A
learning algorithm has various ―knobs‖ that can be twiddled to tune its behavior—
for example, various different criteria for choosing the next attribute in decision tree

Universal Engineering College Page | 11


Module V
CST 401 Artificial Intelligence

learning. The researcher generates hypotheses for various different settings of the
knobs, measures their error rates on the test set, and reports the error rate of the best
hypothesis. Alas, peeking has occurred! The reason is that the hypothesis was
selected on the basis of its test set error rate, so information about the test set has leaked
into the learning algorithm.
Peeking is a consequence of using test-set performance to both choose a hypothesis
and evaluate it. The way to avoid this is to really hold the test set out—lock it away
until you are completely done with learning and simply wish to obtain an
independent evaluation of the final hypothesis. (And then, if you don’t like the
results, you have to obtain, and lock away, a completely new test set if you want to
go back and find a better hypothesis.) If the test set is locked away, but you still want
to measure performance on unseen data as a way of selecting a good hypothesis,
then divide the available data (without the test set) into a training set and a
validation set.

Model selection: Complexity versus goodness of fit


• We can think of finding the best hypothesis as two tasks:
o Model selection defines the hypothesis space
o Optimization finds the best hypothesis within that space
• How to select among models that are parameterized by size?
o With polynomials, we have size = 1 for linear function, size = 2 for
quadratics and so on.
o For decision trees, the size could be the number of nodes in the tree
• We want to find the value of size parameter that balances underfitting and
overfitting to give the best test set accuracy.
• An algorithm to perform model selection and optimization is shown in the
figure 18.8.

Universal Engineering College Page | 12


Module V
CST 401 Artificial Intelligence

• A wrapper takes a learning algorithm as an argument


• The wrapper enumerates models according to the size parameter
• For each size, it uses cross validation on the learner to compute the average error
rate on training and test sets
• We start with the smallest, simplest models and iterate, considering more
complex models at each step, until the model starts to overfit
• The cross validation picks the value of size with the lowest validation set error
• We then generate a hypothesis of that size using all the data

From error rates to loss

• Consider the problem of classifying emails as spam or non-spam


• It is worse to classify non-spam as spam than to classify spam as non-spam

Universal Engineering College Page | 13


Module V
CST 401 Artificial Intelligence

• So a classifier with 1% error rate, where almost all error were classifying spam as
non-spam, would be better than a classifier with only 0.5% error rate, if most of
those errors were classifying non-spam as spam
• Utility is what learners should maximize
• In machine learning, utility can be expressed by means of Loss Function
• The loss function L(x,y,𝑦̂) is defined as the amount of utility lost by predicting
h(x) = 𝑦̂ when the correct answer if f(x) = y

• This is the most general formulation of the loss function. Often a simplified
version which is independent of x is used. i.e., L(y, 𝑦̂)
• Note that L(y, y) is always zero
• In general, small errors are better than large ones. Two functions that implement
that idea are the absolute value of the difference (called the 𝐿1 loss), and the
square of the difference (called the 𝐿2 loss).
• If we are content with the idea of minimizing error rate, we can use the L0/1 loss
function, which has a loss of 1 for an incorrect answer and is appropriate for
discrete-valued outputs

• Let P(x, y) be a prior probability distribution over examples


• Let E be the set of all possible input-output examples
• Then the expected Generalization Loss for a hypothesis is:

and the best hypothesis ℎ* is the one with the minimum expected generalization
loss

Universal Engineering College Page | 14


Module V
CST 401 Artificial Intelligence

• Because P(x, y) is not known, the learning agent can only estimate generalization
loss with empirical loss on the set of examples E.

and the best hypothesis ℎ* is the one with the minimum expected empirical loss

Regularization:
• In earlier section we did model selection with cross-validation on model size
• An alternative approach is to search for a hypothesis that directly minimizes the
weighted sum of empirical loss and the complexity of the hypothesis, which we
call the total cost.

• Here,  is a parameter, a positive number that serves as a conversion rate


between loss and hypothesis complexity
• This process of explicitly penalizing complex hypotheis is called regularization

REGRESSION AND CLASSIFICATION WITH LINEAR MODELS

Regression and Classification algorithms are Supervised Learning algorithms. Both


the algorithms are used for prediction in Machine learning and work with the
labelled datasets. But the difference between both is how they are used for different
machine learning problems.

The main difference between Regression and Classification algorithms that


Regression algorithms are used to predict the continuous values such as price,

Universal Engineering College Page | 15


Module V
CST 401 Artificial Intelligence

salary, age, etc. and Classification algorithms are used to predict/Classify the
discrete values such as Male or Female, True or False, Spam or Not Spam, etc.

Classification is a process of finding a function which helps in dividing the dataset


into classes based on different parameters. In Classification, a computer program is
trained on the training dataset and based on that training, it categorizes the data into
different classes.

Regression is a process of finding the correlations between dependent and


independent variables. It helps in predicting the continuous variables such as
prediction of Market Trends, prediction of House prices, etc.

Univariate linear regression

Universal Engineering College Page | 16


Module V
CST 401 Artificial Intelligence

The space defined by all possible settings of the weights is known as the weight
space. For univariate linear regression, the weight space defined by w0 and w1 is

Universal Engineering College Page | 17


Module V
CST 401 Artificial Intelligence

two-dimensional, so we can graph the loss as a function of w0 and w1 in a 3D plot,


see figure 18.13(b).
In this case, because we are trying to minimize the loss, we will use gradient
descent. We choose any starting point in weight space—here, a point in the (w0, w1)
plane—and then move to a neighbouring point that is downhill, repeating until we
converge on the minimum possible loss:

The parameter α, the step size, is usually called the learning rate when we are trying
to minimize loss in a learning problem.

Multivariate linear regression


We can easily extend to multivariate linear regression problems, in which each
example xj is an n-element vector. Our hypothesis space is the set of functions of the
form

Universal Engineering College Page | 18


Module V
CST 401 Artificial Intelligence

With univariate linear regression we didn’t have to worry about overfitting. But
with multivariate linear regression in high-dimensional spaces it is possible that
some dimension that is actually irrelevant appears by chance to be useful, resulting
in overfitting. Thus, it is common to use regularization on multivariate linear
functions to avoid overfitting.

Linear classifiers with a hard threshold


Linear functions can be used to do classification as well as regression. A decision
boundary DECISION is a line (or a surface, in higher dimensions) that separates the
two classes. In the below figure, the decision boundary is a straight line. A linear
decision boundary is called a linear separator and data that admit such a separator
are called linearly separable.

Alternatively, we can think of h as the result of passing the linear function w.x
through a threshold function:

The threshold function is shown in the below figure.

Universal Engineering College Page | 19


Module V
CST 401 Artificial Intelligence

Linear classification with logistic regression


We have seen that passing the output of a linear function through the threshold
function creates a linear classifier; yet the hard nature of the threshold causes some
problems: the hypothesis ℎw(x) is not differentiable and is in fact a discontinuous

function of its inputs and its weights; this makes learning with the perceptron rule a
very unpredictable adventure. Furthermore, the linear classifier always announces a
completely confident prediction of 1 or 0, even for examples that are very close to the
boundary; in many situations, we really need more gradated predictions.
All of these issues can be resolved to a large extent by softening the threshold
function approximating the hard threshold with a continuous, differentiable
function. The function used here is logistic function. Logistic function is given by,

The function is given in the below figure.

Notice that the output, being a number between 0 and 1, can be interpreted as a
probability of belonging to the class labelled 1. The hypothesis forms a soft boundary
in the input space and gives a probability of 0.5 for any input at the centre of the
boundary region, and approaches 0 or 1 as we move away from the boundary.

Universal Engineering College Page | 20


Module V
CST 401 Artificial Intelligence

The process of fitting the weights of this model to minimize loss on a data set is
called logistic regression. There is no easy closed-form solution to find the optimal
value of w with this model, but the gradient descent computation is straightforward.

Universal Engineering College Page | 21

You might also like