Module1 ML
Module1 ML
Figure 1-5. A labeled training set for supervised learning (e.g., spam classification)
Regression Supervised learning
• Another typical task is to predict a target numeric value, such as the price
of a car, given a set of features (mileage, age, brand, etc.) called
predictors. This sort of task is called regression (Figure 1-6). To train the
system, you need to give it many examples of cars, including both their
predictors and their labels (i.e., their prices).
Figure 1-10.
Anomaly detection
Advantages of Unsupervised learning
• It does not require training data to be labeled.
• Dimensionality reduction can be easily accomplished using
unsupervised learning.
• Capable of finding previously unknown patterns in data.
• Unsupervised learning can help you gain insights from unlabeled data
that you might not have been able to get otherwise.
• Unsupervised learning is good at finding patterns and relationships in
data without being told what to look for. This can help you learn new
things about your data.
Disadvantages of Unsupervised learning
• Difficult to measure accuracy or effectiveness due to lack of
predefined answers during training.
• The results often have lesser accuracy.
• The user needs to spend time interpreting and label the classes which
follow that classification.
• Unsupervised learning can be sensitive to data quality, including
missing values, outliers, and noisy data.
• Without labeled data, it can be difficult to evaluate the performance
of unsupervised learning models, making it challenging to assess their
effectiveness.
Semi-supervised learning
• Semi-supervised learning is a type of machine learning that falls in
between supervised and unsupervised learning. It is a method that
uses a small amount of labeled data and a large amount of unlabeled
data to train a model.
• The goal of semi-supervised learning is to learn a function that can
accurately predict the output variable based on the input variables,
similar to supervised learning.
Example for Semi-supervised learning
• Some photo-hosting services, such as Google Photos, are good
examples of this.
• Once you upload all your family photos to the service, it automatically
recognizes that the same person A shows up in photos 1, 5, and 11,
while another person B shows up in photos 2, 5, and 7.
3. Poor-Quality Data:
• Data plays a significant role in the machine learning process. Unclean and noisy data can
make the whole process extremely exhausting. We don’t want our algorithm to make
inaccurate or faulty predictions.
• We need to ensure that the process of data preprocessing which includes removing
outliers, filtering missing values, and removing unwanted features, is done with the
utmost level of perfection.
Main Challenges of Machine Learning
4. Irrelevant Features:
• System will only be capable of learning if the training data contains enough relevant
features and not too many irrelevant ones. A critical part of the success of a Machine
Learning project is coming up with a good set of features to train on.
• This process, called feature engineering, involves: Feature selection: selecting the most
useful features to train on among existing features. Feature extraction: combining
existing features to produce a more useful one (as we saw earlier, dimensionality
reduction algorithms can help). Creating new features by gathering new data.
5. Overfitting the Training Data:
• Say you are visiting a foreign country and the taxi driver rips you off. You might be
tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is
something that we humans do all too often, and unfortunately machines can fall into the
same trap if we are not careful.
• In Machine Learning this is called overfitting: it means that the model performs well on
the training data, but it does not generalize well. Figure 1-22 shows an example of a
high-degree polynomial life satisfaction model that strongly overfits the training data.
• Constraining a model to make it simpler and reduce the risk of overfitting is called
regularization.
Main Challenges of Machine Learning
• There are three attributes which impact on success or failure of the learner
• Indirect training examples consisting of the move sequences and final outcomes of various games played.
• The information about the correctness of specific moves early in the game must be inferred indirectly from
the fact that the game was eventually won or lost.
• Here the learner faces an additional problem of credit assignment, or determining the degree to which each
move in the sequence deserves credit or blame for the final outcome.
• Credit assignment can be a particularly difficult problem because the game can be lost even when early
moves are optimal, if these are followed later by poor moves.
• Hence, learning from direct training feedback is typically easier than learning from indirect feedback.
2. A second important attribute of the training experience is the degree to which the
learner controls the sequence of training examples
For example, in checkers game:
• The learner might depends on the teacher to select informative board states and to provide the correct move
for each.
• Alternatively, the learner might itself propose board states that it finds particularly confusing and ask the
teacher for the correct move.
• The learner may have complete control over both the board states and (indirect) training classifications, as it
does when it learns by playing against itself with no teacher present.
• Notice in this last case the learner may choose between experimenting with novel board states that it has not
yet considered, or refining its skill by playing minor variations of lines of play it currently finds
most promising.
3. A third attribute of the training experience is how well it represents the
distribution of examples over which the final system performance P must be
measured.
Learning is most reliable when the training examples follow a distribution similar to that of future test
examples.
• If its training experience E consists only of games played against itself, there is an danger that this training
experience might not be fully representative of the distribution of situations over which it will later be tested.
For example, the learner might never encounter certain crucial board states that are very likely to be played
by the human checkers champion.
• It is necessary to learn from a distribution of examples that is somewhat different from those on which the
final system will be evaluated. Such situations are problematic because mastery of one distribution of
examples will not necessary lead to strong performance over some other distribution.
2. Choosing the Target Function
The next design choice is to determine exactly what type of knowledge will be
learned and how this will be used by the performance program.
• Lets begin with a checkers-playing program that can generate the legal moves
from any board state.
• The program needs only to learn how to choose the best move from among these
legal moves. This learning task is representative of a large class of tasks for
which the legal moves that define some large search space are known a priori, but
for which the best search strategy is not known.
Given this setting where we must learn to choose among the legal moves, the most
obvious choice for the type of information to be learned is a program, or function,
that chooses the best move for any given board state.
ChooseMove is an choice for the target function in checkers example, but this
function will turn out to be very difficult to learn given the kind of indirect training
experience available to our system
2. An alternative target function is an evaluation function that assigns a numerical
score to any given board state
Let the target function V and the notation
V:B R
which denote that V maps any legal board state from the set B to some real value
We intend for this target function V to assign higher scores to better board states. If
the system can successfully learn such a target function V, then it can easily use it to
select the best move from any current board position.
Let us define the target value V(b) for an arbitrary board state b in B, as follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V(b) = V(b' ),
where b' is the best final board state that can be achieved starting from b and
playing optimally until the end of the game
3. Choosing a Representation for the
Target Function
let us choose a simple representation - for any given board state, the function c will
be calculated as a linear combination of the following board features:
Where,
• w0 through w6 are numerical coefficients, or weights, to be chosen by the
learning algorithm.
• Learned values for the weights w1 through w6 will determine the relative
importance of the various board features in determining the value of the board
• The weight w0 will provide an additive constant to the board value
Partial design of a checkers learning program:
The first three items above correspond to the specification of the learning task,
whereas the final two items constitute design choices for the implementation of the
learning program.
4. Choosing a Function Approximation
Algorithm
• In order to learn the target function f we require a set of training examples, each
describing a specific board state b and the training value Vtrain(b) for b.
• For instance, the following training example describes a board state b in which
black has won the game (note x2 = 0 indicates that red has no remaining
pieces) and for which the target function value Vtrain(b) is therefore +100.
A simple approach for estimating training values for intermediate board states is
to assign the training value of Vtrain(b) for any intermediate board state b to be
V(̂ Successor(b))
Where ,
V̂ is the learner's current approximation to V
Successor(b) denotes the next board state following b for which it is again the
program's turn to move
Vtrain(b) ← V̂ (Successor(b))
2. Adjusting the weights
Specify the learning algorithm for choosing the weights wi to best fit the set of
training examples {(b, Vtrain(b))}
A first step is to define what we mean by the bestfit to the training data.
• One common approach is to define the best hypothesis, or set of weights, as that
which minimizes the squared error E between the training values and the values
predicted by the hypothesis.
• Several algorithms are known for finding weights of a linear function that
minimize E.
In our case, we require an algorithm that will incrementally refine the weights as
new training examples become available and that will be robust to errors in these
estimated training values
One such algorithm is called the least mean squares, or LMS training rule. For
each observed training example it adjusts the weights a small amount in the
direction that reduces the error on this training example
LMS weight update rule :- For each training example (b, Vtrain(b))
Use the current weights to calculate V̂ (b)
For each weight wi, update it as
2.The Critic takes as input the history or trace of the game and produces as output a
set of training examples of the target function. As shown in the diagram, each
training example in this case corresponds to some game state in the trace, along
with an estimate Vtrain of the target function value for this example.
3.The Generalizer takes as input the training examples and produces an output
hypothesis that is its estimate of the target function.
It generalizes from the specific training examples, hypothesizing a general function
that covers these examples and other cases beyond the training examples.
In our example, the Generalizer corresponds to the LMS algorithm, and the output
hypothesis is the function V̂ described by the learned weights w0, . . . , W6.
4. The Experiment Generator takes as input the current hypothesis and outputs a
new problem (i.e., initial board state) for the Performance System to explore. Its
role is to pick new practice problems that will maximize the learning rate of the
overall system.
In our example, the Experiment Generator always proposes the same initial game
board to begin a new game.
The sequence of design choices made for the checkers program is summarized in
below figure
Issues in Machine Learning
• What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired
function, given sufficient training data? Which algorithms perform best for
which types of problems and representations?
• How much training data is sufficient? What general bounds can be found to
relate the confidence in learned hypotheses to the amount of training experience
and the character of the learner's hypothesis space?
• When and how can prior knowledge held by the learner guide the process of
generalizing from examples? Can prior knowledge be helpful even when it is
only approximately correct?
• What is the best strategy for choosing a useful next training experience, and how
does the choice of this strategy alter the complexity of the learning problem?
• What is the best way to reduce the learning task to one or more function
approximation problems? Put another way, what specific functions should the
system attempt to learn? Can this process itself be automated?
• How can the learner automatically alter its representation to improve its ability to
represent and learn the target function?
Concept Learning
• Learning involves acquiring general concepts from specific training
examples. Example: People continually learn general concepts or
categories such as "bird," "car," "situations in which I should study
more in order to pass the exam," etc.
• Each such concept can be viewed as describing some subset of
objects or events defined over a larger set
• Alternatively, each concept can be thought of as a Boolean-
valued function defined over this larger set. (Example: A function
defined over all animals, whose value is true for birds and false for
other animals).
• Concept learning - Inferring a Boolean-valued function from training
examples of its input and output
A Concept Learning Task
Consider the example task of learning the target concept
"Days on which my friend Aldo enjoys his favorite water sport."
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
The attribute EnjoySport indicates whether or not a Person enjoys his favorite water sport on this day.
What hypothesis representation is provided to the learner?
Let each hypothesis be a vector of six constraints, specifying the values of the six
attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.
The hypothesis that PERSON enjoys his favorite sport only on cold days with high
humidity (independent of the values of the other attributes) is represented by the
expression
(?, Cold, High, ?, ?, ?)
The concept or function to be learned is called the target concept, which we denote
by c.
c can be any Boolean valued function defined over the instances X
c:X {O, 1}
Example: The target concept corresponds to the value of the attribute EnjoySport
(i.e., c(x) = 1 if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).
• Instances for which c(x) = 1 are called positive examples, or members of the
target concept.
• Instances for which c(x) = 0 are called negative examples, or non-members of
the target concept.
• The ordered pair (x, c(x)) to describe the training example consisting of the
instance x and its target concept value c(x).
• D to denote the set of available training examples
• The symbol H to denote the set of all possible hypotheses that the learner may
consider regarding the identity of the target concept. Each hypothesis h in H
represents a Boolean-valued function defined over X
h:X {O, 1}
• The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in
X.
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
• Consider the sets of instances that are classified positive by hl and by h2.
• h2 imposes fewer constraints on the instance, it classifies more instances as
positive. So, any instance classified positive by hl will also be classified positive
by h2. Therefore, h2 is more general than hl.
General-to-Specific Ordering of Hypotheses
• Given hypotheses hj and hk, hj is more-general-than or- equal do hk if and only if
any instance that satisfies hk also satisfies hi
Representation
• Definition: A hypothesis h is consistent with a set of training examples D if and only if
h(x) = c(x) for each example (x, c(x)) in D.
Consistent(h, D) ( x, c(x) D) h(x) = c(x))
Definition: The version space, denoted VSH,D with respect to hypothesis space H
and training examples D, is the subset of hypotheses from H consistent with the
training examples in D
VSH,D {h H | Consistent(h, D)}
The LIST-THEN-ELIMINATE Algorithm
• The LIST-THEN-ELIMINATE algorithm first initializes the version space to
contain all hypotheses in H and then eliminates any hypothesis found inconsistent
with any training example.
1. VersionSpace ← a list containing every hypothesis in H
2. For each training example, (x, c(x))
remove from VersionSpace any hypothesis h for which h(x) ≠ c(x)
3. Output the list of hypotheses in VersionSpace
Sketch of proof:
1. let g, h, s be arbitrary members of G, H, S respectively with g g h g s
By the definition of S, s must be satisfied by all positive examples in D. Because h g s , h must also
be satisfied by all positive examples in D.
By the definition of G, g cannot be satisfied by any negative example in D, and because g g h h
cannot be satisfied by any negative example in D. Because h is satisfied by all positive examples in D
and by no negative examples in D, h is consistent with D, and therefore h is a member of VSH,D
2. It can be proven by assuming some h in VSH,D,that does not satisfy the right-hand side of the
expression, then showing that this leads to an inconsistency
The CANDIDATE-ELIMINATION Learning
Algorithm
• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
• Remove from G any hypothesis that is less general than another hypothesis in G
An Illustrative Example
The boundary sets are first initialized to Go and So, the most general and most
specific hypotheses in H.
S0 , , , , ,
G0 ?, ?, ?, ?, ?, ?
For training example d,
S0 , , , , .
G0, G1 ?, ?, ?, ?, ?, ?
For training example d,
G1, G2 ?, ?, ?, ?, ?, ?
For training example d,
G2 ?, ?, ?, ?, ?, ?
For training example d,
If apply Candidate Elimination algorithm as before, end up with empty Version Space
New hypothesis is overly general and it covers the third negative training example!
For instance, the target concept "Sky = Sunny or Sky = Cloudy" could then be described as
(Sunny, ?, ?, ?, ?, ?) V (Cloudy, ?, ?, ?, ?, ?)
Definition:
Consider a concept learning algorithm L for the set of instances X.
• Let c be an arbitrary concept defined over X
• Let Dc = {( x , c(x))} be an arbitrary set of training examples of c.
• Let L(xi, Dc) denote the classification assigned to the instance xi by L after training on the
data Dc.
• The inductive bias of L is any minimal set of assertions B such that for any target concept
c and corresponding training examples Dc