Unit 1 1
Unit 1 1
A30528–MACHINE LEARNING
Introduction to Machine Learning: What is Machine Learning,
Examples of Various Learning Paradigms, designing the learning
system,Perspectives and Issues, Version Spaces, Finite and Infinite
Hypothesis Spaces, PAC Learning
Supervised Learning - I: Learning a Class from Examples, Linear, Non-linear,
Multi-class and multi-label classification
Generalization error bounds: VC Dimension,
Decision Trees: ID3, Classification, and Regression Trees,
Regression: Linear Regression, Multiple Linear Regression, Logistic Regression.
Design of a learning system
1. Type of training experience
2. Choosing the Target Function
3. Choosing a representation for the Target Function
4. Choosing an approximation algorithm for the Target Function
5. The final Design
1. Task T: To play checkers
2. Performance measure P: Total percent of the game won in the tournament.
3. Training experience E: A set of games played against itself
• Type of training experience—
During the design of the checker's learning system, the type of training experience
available for a learning system will have a significant effect on the success or
failure of the learning.
1. Direct or Indirect training experience — In the case of direct training
experience, an individual board states and correct move for each board state are
given. In case of indirect training experience, the move sequences for a game and
the final result (win, loss or draw) are given for a number of games. How to assign
credit or blame to individual moves is the credit assignment problem.
2.Teacher or Not — Supervised — The training experience will be labeled, which
means, all the board states will be labeled with the correct move. So the learning
takes place in the presence of a supervisor or a teacher. Unsupervised — The
training experience will be unlabeled, which means, all the board states will not
have the moves. So the learner generates random games and plays against itself
with no supervision or teacher involvement.
3.Semi-supervised — Learner generates game states and asks the teacher for help
in finding the correct move if the board state is confusing. 3. Is the training
experience good — Do the training examples represent the distribution of
examples over which the final system performance will be measured? Performance
is best when training examples and test examples are from the same/a similar
distribution.
Is the training experience good — Do the training examples represent the
distribution of examples over which the final system performance will be
measured? Performance is best when training examples and test examples are
from the same/a similar distribution.
Choosing the Target Function:When you are playing the checkers game, at any moment of
time, you make a decision on choosing the best move from different possibilities. You think
and apply the learning that you have gained from the experience.
• Here there are 2 considerations —
• direct and indirect experience:
• During the direct experience, the checkers learning system, it needs only to learn
how to choose the best move among some large search space. We need to find a
target function that will help us choose the best move among alternatives. Let us
call this function ChooseMove and use the notation ChooseMove : B →M to
indicate that this function accepts as input any board from the set of legal board
states B and produces as output some move from the set of legal moves M.
• When there is an indirect experience, it becomes difficult to learn such function.
How about assigning a real score to the board state.
Let us therefore define the target value V(b) for an arbitrary board state b in B, as
follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V (b) = V (b’),
where b’ is the best final board state that can be achieved starting from b and playing
optimally until the end of the game.
Choosing a representation for the Target Function
• Now that we have specified the ideal target function V, we must choose a representation that
the learning program will use to describe the function ^V that it will learn. As with earlier
design choices, we again have many options.
• let us choose a simple representation: for any given board state, the function ^V will be
calculated as a linear combination of the following board features.
• x1(b) — number of black pieces on board b
• x2(b) — number of red pieces on b
• x3(b) — number of black kings on b
• x4(b) — number of red kings on b
• x5(b) — number of red pieces threatened by black (i.e., which can be taken on black’s next
turn)
• x6(b) — number of black pieces threatened by red .
• ^V = w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 · x5(b) + w6 · x6(b)
• Where w0 through w6 are numerical coefficients or weights to be obtained by a learning
algorithm.
• Specification of the Machine Learning Problem at this time — Till now we
worked on choosing the type of training experience, choosing the target function
and its representation. The checkers learning task can be summarized as below.
• Task T : Play Checkers
• Performance Measure : % of games won in world tournament
• Training Experience E : opportunity to play against itself Target Function : V :
Board → R
• Target Function Representation :
^V = w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 · x5(b) + w6 · x6(b)
Choosing an approximation algorithm for the Target Function
• Generating training data — To train our learning program, we need a set of
training data, each describing a specific board state b and the training value
V_train (b) for b. Each training example is an ordered pair For example, a
training example may be
• <(x1 = 3, x2 = 0, x3 = 1, x4 = 0, x5 = 0, x6 = 0), +100">
• Let Successor(b) denotes the next board state following b for which it is again
the program’s turn to move. ^V is the learner’s current approximation to V.
Using these information, assign the training value of V_train(b) for any
intermediate board state b as below : V_train(b) ← ^V(Successor(b))
• Adjusting the weights: One common approach is to define the best hypothesis
as that which minimizes the squared error E between the training values and
the values predicted by the hypothesis ^V.
Final Design for Checkers Learning system
• The final design of our checkers learning system can be naturally described by
four distinct program modules that represent the central components in many
learning systems.
• 1. The performance System — Takes a new board as input and outputs a trace of
the game it played against itself.
• 2. The Critic — Takes the trace of a game as an input and outputs a set of training
examples of the target function.
• 3. The Generalizer — Takes training examples as input and outputs a hypothesis
that estimates the target function. Good generalization to new cases is crucial.
• 4. The Experiment Generator — Takes the current hypothesis (currently learned
function) as input and outputs a new problem (an initial board state) for the
performance system to explore
Perspectives and Issues
• Perspectives in Machine Learning
• One useful perspective on machine learning is that it involves searching a very
large space of possible hypotheses to determine one that best fits the observed
data and any prior knowledge held by the learner.
• For example, consider the space of hypotheses that could in principle be output
by the above checkers learner. This hypothesis space consists of all evaluation
functions that can be represented by some choice of values for the weights wo
through w6. The learner's task is thus to search through this vast space to locate
the hypothesis that is most consistent with the available training examples.
Issues in Machine Learning
• What algorithms exist for learning general target functions from specific
training examples? In what settings will particular algorithms converge to the
desired function, given sufficient training data? Which algorithms perform
best for which types of problems and representations?
• How much training data is sufficient? What general bounds can be found to
relate the confidence in learned hypotheses to the amount of training
experience and the character of the learner's hypothesis space?
• When and how can prior knowledge held by the learner guide the process of
generalizing from examples? Can prior knowledge be helpful even when it is
only approximately correct?
• What is the best strategy for choosing a useful next training experience, and
how does the choice of this strategy alter the complexity of the learning
problem?
Issues in Machine Learning
• What is the best way to reduce the learning task to one or more
function approximation problems? Put another way, what specific
functions should the system attempt to learn? Can this process itself
be automated?
• How can the learner automatically alter its representation to
improve its ability to represent and learn the target function?
Version Spaces
• Definition (Version space)
• A concept is complete if it covers all positive examples.
• A concept is consistent if it covers none of the negative examples. The version
space is the set of all complete and consistent concepts. This set is convex and is
fully defined by its least and most general elements.
• The key idea in the CANDIDATE-ELIMINATION algorithm is to output a description
of the set of all hypotheses consistent with the training examples
Representation
• The Candidate – Elimination algorithm finds all describable
hypotheses that are consistent with the 16 observed training
examples. In order to define this algorithm precisely, we begin with a
few basic definitions. First, let us say that a hypothesis is consistent
with the training examples if it correctly classifies these examples.
• Definition: A hypothesis h is consistent with a set of training examples
D if and only if h(x) = c(x) for each example (x, c(x)) in D.
Concept Learning
• Inducing general functions from specific training examples is a
main issue of machine learning.
• Concept Learning: Acquiring the definition of a general category
from given sample positive and negative training examples of the
category.
• Concept Learning can seen as a problem of searching through a
predefined space of potential hypotheses for the hypothesis
that best fits the training examples.
• The hypothesis space has a general-to-specific ordering of
hypotheses, and the search can be efficiently organized by taking
advantage of a naturally occurring structure over the hypothesis
space.
Concept Learning
• A Formal Definition for Concept Learning:
ATTRIBUTES
CONCEPT
• A set of example days, and each is described by six attributes.
• The task is to learn to predict the value of EnjoySport for arbitrary day,
EnjoySport – Hypothesis
Representation
• Each hypothesis consists of a conjuction of constraints on the
instance attributes.
• Each hypothesis will be a vector of six constraints, specifying the values of
the six attributes
– (Sky, AirTemp, Humidity, Wind, Water, and Forecast).
• Each attribute will be:
? - indicating any value is acceptable for the attribute (don’t care) single
value – specifying a single required value (ex. Warm) (specific)
0 - indicating no value is acceptable for the attribute (no value)
Hypothesis Representation
• A hypothesis:
Sky AirTemp Humidity WindWater Forecast
< Sunny, ? , ? , Strong , ? , Same >
• The most general hypothesis – that every day is a positive example
<?, ?, ?, ?, ?, ?>
• The most specific hypothesis – that no day is a positive example
<0, 0, 0, 0, 0, 0>
• EnjoySport concept learning task requires learning the sets of days
for which EnjoySport=yes, describing this set by a conjunction of
constraints over the instance attributes.
EnjoySport Concept Learning
Task
Given
– Instances X : set of all possible days, each described by the attributes
• Sky – (values: Sunny, Cloudy, Rainy)
• AirTemp – (values: Warm, Cold)
• Humidity – (values: Normal, High)
• Wind – (values: Strong, Weak)
• Water – (values: Warm, Cold)
• Forecast – (values: Same, Change)
– Target Concept (Function) c : EnjoySport : X {0,1}
– Hypotheses H : Each hypothesis is described by a conjunction of constraints on the
attributes.
– Training Examples D : positive and negative examples of the target function
Determine
– A hypothesis h in H such that h(x) = c(x) for all x in D.
The Inductive Learning
Hypothesis
• Although the learning task is to determine a hypothesis h identical to
the target concept cover the entire set of instances X, the only
information available about c is its value over the training examples.
– Inductive learning algorithms can at best guarantee that the output hypothesis fits the
target concept over the training data.
– Lacking any further information, our assumption is that the best hypothesis regarding
unseen instances is the hypothesis that best fits the observed training data. This is
the fundamental assumption of inductive learning.
• Now consider the sets of instances that are classified positive by hl and by h2.
– Because h2 imposes fewer constraints on the instance, it classifies more instancesas
positive.
– In fact, any instance classified positive by hl will also be classified positive by h2.
– Therefore, we say that h2 is more general than hl.
More-General-Than
Relation
• For any instance x in X and hypothesis h in H, we say that x satisfies
h if and only if h(x) = 1.
• More-General-Than-Or-Equal Relation:
Let h1 and h2 be two boolean-valued functions defined over X.
Then h1 is more-general-than-or-equal-to h2 (written h1
≥ h2) if and only if
any instance that satisfies h2 also satisfies h1.
• From first two examples S2 : <?, Warm, Normal, Strong, Cool, Change>
• This is inconsistent with third examples, and there are no hypotheses consistent
with these three examples
PROBLEM: We have biased the learner to consider only conjunctive hypotheses.
We require a more expressive hypothesis space.
Inductive Bias - An
Unbiased Learner
• The obvious solution to the problem of assuring that the target concept
is in the hypothesis space H is to provide a hypothesis space capable
of representing every teachable concept.
– Every possible subset of the instances X the power set of X.
CANDIDATE-ELIMINATION: New instances are classified only in the case where all
members of the current version space agree on the classification. Otherwise, the
system refuses to classify the new instance.
Inductive Bias: the target concept can be represented in its hypothesis space.
FIND-S: This algorithm, described earlier, finds the most specific hypothesis consistent
with the training examples. It then uses this hypothesis to classify all subsequent
instances.
Inductive Bias: the target concept can be represented in its hypothesis space, and all
instances are negative instances unless the opposite is entailed by its other
know1edge.
Concept Learning -
Summary
• Concept learning can be seen as a problem of searching through a large
predefined space of potential hypotheses.
• The general-to-specific partial ordering of hypotheses provides a
useful structure for organizing the search through the hypothesis
space.
• The FIND-S algorithm utilizes this general-to-specific ordering,
performing a specific-to-general search through the hypothesis space
along one branch of the partial ordering, to find the most specific
hypothesis consistent with the training examples.
• The CANDIDATE-ELIMINATION algorithm utilizes this general-to-
specific ordering to compute the version space (the set of all hypotheses
consistent with the training data) by incrementally computing the sets
of maximally specific (S) and maximally general (G) hypotheses.
Concept Learning -
Summary
• Because the S and G sets delimit the entire set of hypotheses
consistent with the data, they provide the learner with a description of
its uncertainty regarding the exact identity of the target concept. This
version space of alternative hypotheses can be examined
– to determine whether the learner has converged to the target concept,
– to determine when the training data are inconsistent,
– to generate informative queries to further refine the version space, and
– to determine which unseen instances can be unambiguously classified based on the
partially learned concept.
• The CANDIDATE-ELIMINATION algorithm is not robust to noisy
data or to situations in which the unknown target concept is not
expressible in the provided hypothesis space.
Concept Learning -
Summary
• Inductive learning algorithms are able to classify unseen examples
only because of their implicit inductive bias for selecting one
consistent hypothesis over another.
• If the hypothesis space is enriched to the point where there is a
hypothesis corresponding to every possible subset of instances (the
power set of the instances), this will remove any inductive bias from
the CANDIDATE-ELIMINATION algorithm .
– Unfortunately, this also removes the ability to classify any instance beyond the observed
training examples.
– An unbiased learner cannot make inductive leaps to classify unseen examples.