ML - Unit 1 - Part I
ML - Unit 1 - Part I
UNIT-1
INTRODUCTION
Ever since computers were invented, we have wondered whether they might be made to
learn. If we could understand how to program them to learn-to improve automatically with
experience-the impact would be dramatic.
Imagine computers learning from medical records which treatments are most effective
for new diseases
Houseslearningfromexperiencetooptimizeenergycostsbasedontheparticularusage
patterns of their occupants.
Personal software assistants learning the evolving interests of their users in order to
highlight especially relevant stories from the online morning newspaper
A successful understanding of how to make computers learn would open up many new uses
of computers and new levels of competence and customization
Some tasks cannot be defined well, except by examples (e.g., recognizing people).
Relationships and correlations can be hidden within large amounts of data. Machine
Learning/Data Mining may be able to find these relationships.
Human designers often produce machines that do not work as well as desired in the
environments in which they are used.
The amount of knowledge available about certain tasks might be too large for explicit
encoding by humans (e.g., medical diagnostic).
Environments change overtime.
New knowledge about tasks is constantly being discovered by humans. It may be
difficult to continuously re-design systems “by hand”.
1
Machine
Definition: A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.
Examples
1. Checkers game: A computer program that learns to play checkers might improve its
performance as measured by its ability to win at the class of tasks involving playing
checkers games, through experience obtained by playing games againstitself.
2
Machine
The basic design issues and approaches to machine learning are illustrated by designing a
program to learn to play checkers, with the goal of entering it in the world checkers
tournament
1. Choosing the Training Experience
2. Choosing the Target Function
3. Choosing a Representation for the Target Function
4. Choosing a Function Approximation Algorithm
1. Estimating training values
2. Adjusting the weights
5. The Final Design
The first design choice is to choose the type of training experience from which the
system will earn.
The type of training experience available can have a significant impact on success or
failure of the learner.
There are three attributes which impact on success or failure of the learner
1. Whether the training experience provides direct or indirect feedback regarding the
choices made by the performance system.
Indirect training examples consisting of the move sequences and final outcomes of
various games played. The information about the correctness of specific moves early
in the game must be inferred indirectly from the fact that the game was eventually won
or lost.
Here the learner faces an additional problem of credit assignment, or determining the
degree to which each move in the sequence deserves credit or blame for the final
outcome. Credit assignment can be a particularly difficult problem because the game
can be lost even when early moves are optimal, if these are followed later by poor
moves.
Hence, learning from direct training feedback is typically easier than learning from
indirect feedback.
3
Machine
2. The degree to which the learner controls the sequence of training examples
Alternatively, the learner might itself propose board states that it finds particularly
confusing and ask the teacher for the correct move.
The learner may have complete control over both the board states and(indirect)training
classifications,asitdoeswhenitlearnsbyplayingagainstitselfwithnoteacherpresent.
3. How well it represents the distribution of examples over which the final system
performance P must be measured
The next design choice is to determine exactly what type of knowledge will be learned and
how this will be used by the performance program.
Let’s consider a checkers-playing program that can generate the legal moves from any board
state.
The program needs only to learn how to choose the best move from among these legal moves.
We must learn to choose among the legal moves, the most obvious choice for the type of
information to be learned is a program, or function, that chooses the best move for any given
board state.
Choose Move : B→ M
which indicate that this function accepts as input any board from the set of legal board
states B and produces as output some move from the set of legal moves M.
4
Machine
Choose Move is a choice for the target function in checkers example, but this function
will turn out to be very difficult to learn given the kind of indirect training experience
available to our system
which denote that V maps any legal board state from the set B to some real value.
Intend for this target function V to assign higher scores to better board states. If the
system can successfully learn such a target function V, then it can easily use it to select
the best move from any current board position.
Let us define the target value V(b) for an arbitrary board state b in B, as follows:
If b is a final board state that is won, then V(b) =100
If b is a final board state that is lost, then V(b) =-100
If b is a final board state that is drawn, then V(b) =0
If b is a not a final state in the game, then V(b) = V(b'),
Let’s choose a simple representation - for any given board state, the function c will be
calculated as a linear combination of the following board features:
5
Machine
Where,
w0 through w6 are numerical coefficients, or weights, to be chosen by the learning
algorithm.
Learned values for the weights w1 through w6 will determine the relative importance
of the various board features in determining the value of the board
The weight w0 will provide an additive constant to the board value
In order to learn the target function f we require a set of training examples, each describing a
specific board state b and the training value Vtrain(b) for b.
For instance, the following training example describes a board state b in which black has won
the game (note x2 = 0 indicates that red has no remaining pieces) and for which the target
function value Vtrain(b) is therefore +100.
1. Derive training examples from the indirect training experience available to the learner
2. Adjusts the weights wi to best fit these training examples
A simple approach for estimating training values for intermediate board states is to
assign the training value of Vtrain(b) for any intermediate board state b to
be V̂(Successor(b))
Where ,
V̂ is the learner's current approximation toV
Successor(b) denotes the next board state following b for which it is again the
program's turn to move
Vtrain(b) ← V̂ (Successor(b))
6
Machine
Several algorithms are known for finding weights of a linear function that minimize E.
One such algorithm is called the least mean squares, or LMS training rule. For each
observed training example it adjusts the weights a small amount in the direction that
reduces the error on this training example
LMS weight update rule :- For each training example (b, Vtrain(b))
Use the current weights to calculate V̂ (b)
For each weight wi, update it as
Here ƞ is a small constant (e.g., 0.1) that moderates the size of the weight update.
7
Machine
1. ThePerformanceSystemisthemodulethatmustsolvethegivenperformancetaskby using
the learned target function(s). It takes an instance of a new problem (new game) as
input and produces a trace of its solution (game history) as output.
2. The Critic takes as input the history or trace of the game and produces as output a set
of training examples of the target function
3. The Generalize takes as input the training examples and produces an output
hypothesis that is its estimate of the target function. It generalizes from the specific
training examples, hypothesizing a general function that covers these examples and
other cases beyond the training examples.
4. The Experiment Generator takes as input the current hypothesis and outputs a new
problem (i.e., initial board state) for the Performance System to explore. Its role is to
picknewpracticeproblemsthatwillmaximizethelearningrateoftheoverallsystem.
The sequence of design choices made for the checkers program is summarized in below figure
8
Machine
9
Machine
Whenandhowcanpriorknowledgeheldbythelearnerguidetheprocessofgeneralizing from
examples? Can prior knowledge be helpful even when it is only approximately
correct?
What is the best strategy for choosing a useful next training experience, and how does
the choice of this strategy alter the complexity of the learning problem?
What is the best way to reduce the learning task to one or more function
approximation problems?
Putanotherway,whatspecificfunctionsshouldthesystemattempttolearn? Can this process
itself be automated?
How can the learner automatically alter its representation to improve its ability to
represent and learn the target function?
1
Machine
Learning involves acquiring general concepts from specific training examples. Example:
People continually learn general concepts or categories such as "bird," "car," "situations in
which I should study more in order to pass the exam,"etc.
Each such concept can be viewed as describing some subset of objects or events defined
over a largerset
Alternatively,each concept can be though to fasaBoolean-valuedfunctiondefinedoverthis
larger set. (Example: A function defined over all animals, whose value is true for birds and
false for otheranimals).
Consider the example task of learning the target concept "Days on which Aldo enjoys
his favorite water sport”
Table: Positive and negative training examples for the target concept EnjoySport.
The task is to learn to predict the value of EnjoySport for an arbitrary day, based on the
values of its other attributes?
1
Machine
If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive
example (h(x) = 1).
The hypothesis that PERSON enjoys his favorite sport only on cold days with high humidity
is represented by the expression
(? Cold, High,?, ?, ?)
Notation
The set of items over which the concept is defined is called the set of instances, which is
denoted byX.
Example: X is the set of all possible days, each represented by the attributes: Sky, Air Temp,
Humidity, Wind, Water, and Forecast
The concept or function to be learned is called the target concept, which is denoted by c.
c can be any Boolean valued function defined over the instancesX
c: X→ {O, 1}
Example: The target concept corresponds to the value of the attribute EnjoySport
(i.e., c(x) = 1 if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).
1
Machine
The symbol H to denote the set of all possible hypotheses that the learner may consider
regarding the identity of the target concept. Each hypothesis h in H represents a Boolean-
valued function defined overX
h: X→{O, 1}
The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in X.
Determine:
A hypothesis h in H such that h(x) = c(x) for all x inX.
Any hypothesis found to approximate the target function well over a sufficiently large set of
training examples will also approximate the target function well over other unobserved
examples.
1
Machine
Concept learning can be viewed as the task of searching through a large space of
hypotheses implicitly defined by the hypothesis representation.
The goal of this search is to find the hypothesis that best fits the training examples.
Example:
Consider the instances X and hypotheses H in the Enjoy Sport learning task. The attribute
Sky has three possible values, and Air Temp, Humidity, Wind, Water, Forecast each have two
possible values, the instance space X contains exactly
3.2.2.2.2.2 = 96 distinct instances
5.4.4.4.4.4 = 5120 syntactically distinct hypotheses within H.
Every hypothesis containing one or more "Φ" symbols represents the empty set of instances;
that is, it classifies every instance as negative.
1 + (4.3.3.3.3.3) = 973. Semantically distinct hypotheses
Consider the sets of instances that are classified positive by hl and byh2.
h2imposesfewerconstraintsontheinstance,it classifies more instances as positive.So,
anyinstanceclassifiedpositivebyhlwillalsobeclassifiedpositivebyh2.Therefore,h2 is
more general than hl.
Given hypotheses hj and hk, hj is more-general-than or- equal do hk if and only if any instance
that satisfies hk also satisfies hi
Definition:LethjandhkbeBoolean-valuedfunctionsdefinedoverX.Thenhjismoregeneral- than-
or-equal-to hk (written hj ≥ hk) if and onlyif
1
Machine
FIND-S Algorithm
1
Machine
To illustrate this algorithm, assume the learner is given the sequence of training examples
from the EnjoySport task
The first step of FIND-S is to initialize h to the most specific hypothesis inH
h - (Ø, Ø, Ø, Ø, Ø, Ø)
Observing the first training example, it is clear that hypothesis h is too specific. None
of the "Ø" constraints in h are satisfied by this example, so each is replaced by the next
more general constraint that fits the example
h1 = <Sunny Warm Normal Strong Warm Same>
The second training example forces the algorithm to further generalize h, this time
substituting a "?" in place of any attribute value in h that is not satisfied by the new
example
h2 = <Sunny Warm ? Strong Warm Same>
Upon encountering the third training the algorithm makes no change to h. The FIND-S
algorithm simply ignores every negative example.
h3 = < Sunny Warm ? Strong Warm Same>
1
Machine
Unanswered by FIND-S
1
Machine
Representation
The LIST-THEN-ELIMINATE algorithm first initializes the version space to contain all
hypotheses in H and then eliminates any hypothesis found inconsistent with any training
example.
1
Machine
The version space is represented by its most general and least general members. These
members form general and specific boundary sets that delimit the version space within the
partially ordered hypothesis space.
Definition: The general boundary G, with respect to hypothesis space H and training data D,
is the set of maximally general members of H consistent with D
Definition: The specific boundary S, with respect to hypothesis space H and training data D,
is the set of minimally general (i.e., maximally specific) members of H consistent with D.
To Prove:
1. Every h satisfying the right hand side of the above expression is inVS
H, D
2. Every memberofVS satisfies the right-hand side of theexpression
H, D
Sketch of proof:
1. let g, h, s be arbitrary members of G, H, S respectively with g g h gs
By the definition of S, s must be satisfied by all positive examples in D. Because hg s,
h must also be satisfied by all positive examples in D.
BythedefinitionofG,gcannotbesatisfiedbyanynegativeexampleinD,andbecause g g h h
cannot be satisfied by any negative example in D. Because h is satisfied by all positive
examples in D and by no negative examples in D, h is consistent with D, and therefore
h is a member of VSH,D.
2. It can be proven by assuming some h in VSH,D,that does not satisfy the right-hand side
of the expression, then showing that this leads to aninconsistency
1
Machine
• If d is a negativeexample
• Remove from S any hypothesis inconsistent withd
• For each hypothesis g in G that is not consistent withd
• Remove g fromG
• Add to G all minimal specializations h of g suchthat
• h is consistent with d, and some member of S is more specific thanh
• Remove from G any hypothesis that is less general than another hypothesis inG
An Illustrative Example
2
Machine
Initializing the S boundary set to contain the most specific (least general) hypothesis
S0, , , , ,
When the second training example is observed, it has a similar effect of generalizing S
further to S2, leaving G again unchanged i.e., G2 = G1 =G0
2
Machine
Considerthethirdtrainingexample.ThisnegativeexamplerevealsthattheGboundary of the
version space is overly general, that is, the hypothesis in G incorrectly predicts that
this new example is a positiveexample.
The hypothesis in the G boundary must therefore be specialized until it correctly
classifies this new negativeexample
Given that there are six attributes that could be specified to specialize G2, why are there
onlythree new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal specialization of
G2thatcorrectlylabelsthenewexampleasanegativeexample,butitisnotincludedinG3. The
reason this hypothesis is excluded is that it is inconsistent with the previously
encountered positive examples
2
Machine
This positive example further generalizes the S boundary of the version space. It also
results in removing one member of the G boundary, because this member fails to
cover the new positiveexample
After processing these four examples, the boundary sets S4 and G4 delimit the version
space of all hypotheses consistent with the set of incrementally observed training examples.
2
Machine