0% found this document useful (0 votes)
20 views

Module1 ML

machine learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Module1 ML

machine learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

HKBK College of Engineering

Department of Artificial Intelligence & Machine


Learning

Subject Name: Machine Learning


Subject Code: 21AI63

Course Coordinator: Prof Shruthi V Kulkarni


Introduction
• Ever since computers were invented, we have wondered whether they
might be made to learn. If we could understand how to program them
to learn-to improve automatically with experience-the impact would be
dramatic.
• Imagine computers learning from medical records
which treatments are most effective for new diseases
• Houses learning from experience to optimize energy costs based on the
particular usage patterns of their occupants.
• Personal software assistants learning the evolving interests of their users
in order to highlight especially relevant stories from the online morning
newspaper
What Is Machine Learning?
• Machine Learning is the science (and art) of programming computers so
they can learn from data.

Here is a slightly more general definition:


• [Machine Learning is the] field of study that gives computers the ability to
learn without being explicitly programmed. —Arthur Samuel, 1959

And a more engineering-oriented one:


• A computer program is said to learn from experience E with respect to
some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E. —Tom Mitchell, 1997
Why Use Machine Learning?
• Consider how you would write a spam filter using traditional programming techniques

Figure 1-1 The traditional approach


The traditional approach
1. First you would look at what spam typically looks like. You might
notice that some words or phrases (such as “4U,” “credit card,”
“free,” and “amazing”) tend to come up a lot in the subject. Perhaps
you would also notice a few other patterns in the sender’s name,
the email’s body, and so on.
2. You would write a detection algorithm for each of the patterns that
you noticed, and your program would flag emails as spam if a
number of these patterns are detected.
3. You would test your program, and repeat steps 1 and 2 until it is
good enough.
Machine Learning approach
• In contrast, a spam filter based on Machine Learning techniques
automatically learns which words and phrases are good predictors of spam
by detecting unusually frequent patterns of words in the spam examples
compared to the ham examples (Figure 1-2). (ham is referred to legitimate
messages - being exactly as intended or presented )
• The program is much shorter, easier to maintain, and most likely more
accurate.
• Moreover, if spammers notice that all their emails containing “4U” are
blocked, they might start writing “For U” instead.
• A spam filter using traditional programming techniques would need to be
updated to flag “For U” emails. If spammers keep working around your
spam filter, you will need to keep writing new rules forever.
Figure 1-2. Machine Learning approach

Figure 1-2. Machine Learning approach


Machine Learning approach
• In contrast, a spam filter based on Machine Learning techniques
automatically notices that “For U” has become unusually frequent in
spam flagged by users, and it starts flagging them without your
intervention (Figure 1-3).
Figure 1-3. Automatically adapting to change
Machine Learning can help humans learn
• Finally, Machine Learning can help humans learn (Figure 1-4): ML
algorithms can be inspected to see what they have learned. For
instance, once the spam filter has been trained on enough spam, it
can easily be inspected to reveal the list of words and combinations
of words that it believes are the best predictors of spam.
• Sometimes this will reveal unsuspected correlations or new trends,
and thereby lead to a better understanding of the problem.
• Applying ML techniques to dig into large amounts of data can help
discover patterns that were not immediately apparent. This is called
data mining.
Figure 1-4. Machine Learning can help humans
learn
Machine Learning is great for
• Problems for which existing solutions require a lot of hand-tuning or
long lists of rules: one Machine Learning algorithm can often simplify
code and perform better.
• Complex problems for which there is no good solution at all using a
traditional approach: the best Machine Learning techniques can find
a solution.
• Fluctuating environments: a Machine Learning system can adapt to
new data.
• Getting insights about complex problems and large amounts of data.
Types of Machine Learning Systems
• Whether or not they are trained with human supervision (supervised,
unsupervised, semi supervised, and Reinforcement Learning)
• Whether or not they can learn incrementally on the fly (online versus
batch learning)
• Whether they work by simply comparing new data points to known
data points, or instead detect patterns in the training data and build a
predictive model, much like scientists do (instance-based versus
model-based learning)
Supervised learning
• Supervised learning is a type of machine learning algorithm that learns
from labeled data. Labeled data is data that has been tagged with a correct
answer or classification.
• In Supervised machine learning, the machine mainly focuses on regression
and classification types of problems.
• We know the correct output and relationship with input and output in this
phase. It deals with labeled datasets and algorithms.
• For example, a labeled dataset of images of Elephant, Camel and Cow
would have each image tagged with either “Elephant” , “Camel” or “Cow.”
• Supervised learning is classified into two categories of algorithms:
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
• Classification: A classification problem is when the output variable is a category, such
as “Red” or “blue” , “disease” or “no disease”.
Classification Supervised learning
• The spam filter is a good example of this: it is trained with many
example emails along with their class (spam or ham), and it must
learn how to classify new emails.

Figure 1-5. A labeled training set for supervised learning (e.g., spam classification)
Regression Supervised learning
• Another typical task is to predict a target numeric value, such as the price
of a car, given a set of features (mileage, age, brand, etc.) called
predictors. This sort of task is called regression (Figure 1-6). To train the
system, you need to give it many examples of cars, including both their
predictors and their labels (i.e., their prices).

Figure 1-6. Regression


Advantages of Supervised learning
• Supervised learning allows collecting data and produces data output
from previous experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-
world computation problems.
• It performs classification and regression tasks.
• It allows estimating or mapping the result to a new sample.
• We have complete control over choosing the number of classes we
want in the training data.
Disadvantages of Supervised learning
• Classifying big data can be challenging.
• Training for supervised learning needs a lot of computation time. So,
it requires a lot of time.
• Supervised learning cannot handle all complex tasks in Machine
Learning.
• Computation time is vast for supervised learning.
• It requires a labelled data set.
• It requires a training process.
Unsupervised learning
• Unsupervised learning allows the model to discover patterns and
relationships in unlabeled data.
• Clustering algorithms group similar data points together based on
their inherent characteristics.
• Feature extraction captures essential information from the data,
enabling the model to make meaningful distinctions.
• Label association assigns categories to the clusters based on the
extracted patterns and characteristics.

Figure 1-7. An unlabeled training set for


unsupervised learning
Unsupervised learning is classified into two
categories of algorithms
1. Clustering: In clustering, data is found in segments and meaningful
groups. It is based in small groups. These groups have their own patterns
through which data is arranged and segmented.
• For example, say you have a lot of data about your blog’s visitors. You
may want to run a clustering algorithm to try to detect groups of similar
visitors (Figure 1-8). At no point do you tell the algorithm which group a
visitor belongs to: it finds those connections without your help. For
example, it might notice that 40% of your visitors are males who love
comic books and generally read your blog in the evening, while 20% are
young sci-fi lovers who visit during the weekends, and so on.
Figure 1-8. Clustering

2. Visualization algorithms are also good examples of unsupervised


learning algorithms: you feed them a lot of complex and unlabeled
data, and they output a 2D or 3D representation of your data that
can easily be plotted (Figure 1-9).

Figure 1-9. Example of a t-SNE


visualization highlighting
semantic clusters3
3. Anomaly detection—detecting unusual credit card transactions to
prevent fraud, catching manufacturing defects, or automatically
removing outliers from a dataset before feeding it to another learning
algorithm. The system is shown mostly normal instances during
training, so it learns to recognize them and when it sees a new instance
it can tell whether it looks like a normal one or whether it is likely an
anomaly (Figure 1-10).

Figure 1-10.
Anomaly detection
Advantages of Unsupervised learning
• It does not require training data to be labeled.
• Dimensionality reduction can be easily accomplished using
unsupervised learning.
• Capable of finding previously unknown patterns in data.
• Unsupervised learning can help you gain insights from unlabeled data
that you might not have been able to get otherwise.
• Unsupervised learning is good at finding patterns and relationships in
data without being told what to look for. This can help you learn new
things about your data.
Disadvantages of Unsupervised learning
• Difficult to measure accuracy or effectiveness due to lack of
predefined answers during training.
• The results often have lesser accuracy.
• The user needs to spend time interpreting and label the classes which
follow that classification.
• Unsupervised learning can be sensitive to data quality, including
missing values, outliers, and noisy data.
• Without labeled data, it can be difficult to evaluate the performance
of unsupervised learning models, making it challenging to assess their
effectiveness.
Semi-supervised learning
• Semi-supervised learning is a type of machine learning that falls in
between supervised and unsupervised learning. It is a method that
uses a small amount of labeled data and a large amount of unlabeled
data to train a model.
• The goal of semi-supervised learning is to learn a function that can
accurately predict the output variable based on the input variables,
similar to supervised learning.
Example for Semi-supervised learning
• Some photo-hosting services, such as Google Photos, are good
examples of this.
• Once you upload all your family photos to the service, it automatically
recognizes that the same person A shows up in photos 1, 5, and 11,
while another person B shows up in photos 2, 5, and 7.

Figure 1-11. Semisupervised learning


Reinforcement learning
• Reinforcement learning is an area of Machine Learning. It is about
taking suitable action to maximize reward in a particular situation.
• The learning system, called an agent in this context, can observe the
environment, select and perform actions, and get rewards in return
(or penalties in the form of negative rewards, as in Figure 1-12).
• It must then learn by itself what is the best strategy, called a policy, to
get the most reward over time.
• A policy defines what action the agent should choose when it is in a
given situation.
Figure 1-12. Reinforcement Learning
Advantages of Reinforcement learning
• Reinforcement learning can be used to solve very complex problems that
cannot be solved by conventional techniques.
• The model can correct the errors that occurred during the training process.
• In RL, training data is obtained via the direct interaction of the agent with the
environment
• Reinforcement learning can handle environments that are non-deterministic,
meaning that the outcomes of actions are not always predictable. This is useful
in real-world applications where the environment may change over time or is
uncertain.
• Reinforcement learning can be used to solve a wide range of problems,
including those that involve decision making, control, and optimization.
• Reinforcement learning is a flexible approach that can be combined with other
machine learning techniques, such as deep learning, to improve performance.
Disadvantages of Reinforcement learning
• Reinforcement learning is not preferable to use for solving simple
problems.
• Reinforcement learning needs a lot of data and a lot of computation
• Reinforcement learning is highly dependent on the quality of the
reward function. If the reward function is poorly designed, the agent
may not learn the desired behavior.
• Reinforcement learning can be difficult to debug and interpret. It is
not always clear why the agent is behaving in a certain way, which can
make it difficult to diagnose and fix problems.
Batch and Online Learning
• Batch Learning
• During batch learning, data is gathered over time. The machine learning
model is then periodically trained using this accumulated data in batches.
Because the model is unable to learn progressively from a stream of real-
time data, it is the exact reverse of online learning.
• In batch learning, the machine learning algorithm does not modify its
parameters until batches of fresh data have been consumed.
• Large batches of accumulated data are used to train models, which
requires more time and resources like CPU, memory, and disc
input/output.
• Additionally, it requires more time to deploy models into production
because this can only be done periodically depending on how well the
model performs after being trained with fresh data.
• A model that was learned using batch learning must be retrained using
the fresh dataset if it has to learn about new data.
Batch and Online Learning
• Online Learning
• Online machine learning is a sort of machine learning in which the best
predictor for future data is updated at each step using data that is received
sequentially.
• In online machine learning the best prediction model for future data is
updated continuously and sequentially, as new data keeps arriving. Thus
every time new data arrives, the model parameters get updated based on
the new data.
• At each stage the training is quite fast and cheap, also the model is always
up to date because parameters associated with the model adjust
themselves based on the new data.
• The application of online learning might be in stock market prediction or
weather forecasting. Also, if computational resources are a concern, you
can go for online learning.
• Online machine learning is also a good choice in scenarios when a model
has to learn from feedback. Online learning also saves storage space,
because you keep discarding the data from which it has learned already.
Figure 1-13. Online learning Figure 1-14. Using online learning to handle huge datasets
Instance-Based Versus Model-Based Learning
• Instance-Based Learning
• Instance-based learning involves using the entire dataset to make
predictions. The machine learns by storing all instances of data and
then using these instances to make predictions on new data. The
machine compares the new data to the instances it has seen before
and uses the closest match to make a prediction.
• In instance-based learning, no model is created. Instead, the machine
stores all of the training data and uses this data to make predictions
based on new data. Instance-based learning is often used in pattern
recognition, clustering, and anomaly detection.
Figure 1-15. Instance-based learning
Instance-Based Versus Model-Based Learning
• Model-Based Learning
• Model-based learning involves creating a mathematical model that can
predict outcomes based on input data. The model is trained on a large
dataset and then used to make predictions on new data. The model can be
thought of as a set of rules that the machine uses to make predictions.
• In model-based learning, the training data is used to create a model that
can be generalized to new data. The model is typically created using
statistical algorithms such as linear regression, logistic regression, decision
trees, and neural networks. These algorithms use the training data to
create a mathematical model that can be used to predict outcomes.
Figure 1-16. Model-based learning
Advantages of Model-Based Learning
• Faster predictions: Model-based learning is typically faster than
instance-based learning because the model is already created and can
be used to make predictions quickly.
• More accurate predictions: Model-based learning can often make
more accurate predictions than instance-based learning because the
model is trained on a large dataset and can generalize to new data.
• Better understanding of data Model-based learning allows you to
gain a better understanding of the relationships between input and
output variables. This can help identify which variables are most
important in making predictions.
Disadvantages of Model-Based Learning
• Requires a large dataset: model-based learning requires a large dataset to train
the model. This can be a disadvantage if you have a small dataset.
• Requires expert knowledge: Model-based learning requires expert knowledge of
statistical algorithms and mathematical modeling. This can be a disadvantage if
you don’t have the expertise to create the model.
• Requires expert knowledge: Model-based learning requires expert knowledge of
statistical algorithms and mathematical modeling. This can be a disadvantage if
you don’t have the expertise to create the model.
• Example of Model-Based Learning
• An example of model-based learning is predicting the price of a house based on
its size, number of rooms, location, and other features. In this case, a model could
be created using linear regression to predict the price of the house based on
these features. The model would be trained on a dataset of house prices and
features and then used to make predictions on new data.
Advantages of Instance-Based Learning
• No need for model creation: Instance-based learning doesn’t require
creating a model, which can be an advantage if you don’t have the
expertise to create the model.
• Can handle small datasets: Instance-based learning can handle small
datasets because it doesn’t require a large dataset to create a model.
• More flexibility: Instance-based learning can be more flexible than
model-based learning because the machine stores all instances of
data and can use this data to make predictions.
Disadvantages of Instance-Based Learning
• Slower predictions: Instance-based learning is typically slower than
model-based learning because the machine has to compare the new
data to all instances of data in order to make a prediction.
• Less accurate predictions: Instance-based learning can often make
less accurate predictions than model-based learning because it
doesn’t have a mathematical model to generalize from.
• Limited understanding of data: Instance-based learning doesn’t
provide as much insight into the relationships between input and
output variables as model-based learning does.
Main Challenges of Machine Learning
1. Insufficient Quantity of Training Data:
• Machine Learning takes a lot of data for most Machine Learning algorithms to work
properly. Even for very simple problems you typically need thousands of examples, and
for complex problems such as image or speech recognition you may need millions of
examples.
2. Non representative Training Data:
• In order to generalize well, it is crucial that your training data be representative of the
new cases you want to generalize to. This is true whether you use instance-based learning
or model-based learning.
• For example, the set of countries we used earlier for training the linear model was not
perfectly representative; a few countries were missing. Figure 1-21 shows what the data
looks like when you add the missing countries.
• If you train a linear model on this data, you get the solid line, while the old model is
represented by the dotted line. As you can see, not only does adding a few missing
countries significantly alter the model, but it makes it clear that such a simple linear
model is probably never going to work well.
Main Challenges of Machine Learning
Figure 1-21. A more representative
training sample

3. Poor-Quality Data:
• Data plays a significant role in the machine learning process. Unclean and noisy data can
make the whole process extremely exhausting. We don’t want our algorithm to make
inaccurate or faulty predictions.
• We need to ensure that the process of data preprocessing which includes removing
outliers, filtering missing values, and removing unwanted features, is done with the
utmost level of perfection.
Main Challenges of Machine Learning
4. Irrelevant Features:
• System will only be capable of learning if the training data contains enough relevant
features and not too many irrelevant ones. A critical part of the success of a Machine
Learning project is coming up with a good set of features to train on.
• This process, called feature engineering, involves: Feature selection: selecting the most
useful features to train on among existing features. Feature extraction: combining
existing features to produce a more useful one (as we saw earlier, dimensionality
reduction algorithms can help). Creating new features by gathering new data.
5. Overfitting the Training Data:
• Say you are visiting a foreign country and the taxi driver rips you off. You might be
tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is
something that we humans do all too often, and unfortunately machines can fall into the
same trap if we are not careful.
• In Machine Learning this is called overfitting: it means that the model performs well on
the training data, but it does not generalize well. Figure 1-22 shows an example of a
high-degree polynomial life satisfaction model that strongly overfits the training data.
• Constraining a model to make it simpler and reduce the risk of overfitting is called
regularization.
Main Challenges of Machine Learning

Figure 1-22. Overfitting the training data

6. Under fitting the Training Data:


• This process occurs when data is unable to establish an accurate relationship between
input and output variables. It simply means trying to fit in undersized jeans. It signifies
the data is too simple to establish a precise relationship.
• To overcome this issue: Maximize the training time, Enhance the complexity of the model,
Add more features to the data, Reduce regular parameters, Increasing the training time
of model
Main Challenges of Machine Learning
7. Stepping Back:
• Machine Learning is about making machines get better at some task by learning from
data, instead of having to explicitly code rules.
• There are many different types of ML systems: supervised or not, batch or online,
instance-based or model-based, and so on.
• In a ML project you gather data in a training set, and you feed the training set to a
learning algorithm. If the algorithm is model-based it tunes some parameters to fit the
model to the training set (i.e., to make good predictions on the training set itself), and
then hopefully it will be able to make good predictions on new cases as well. If the
algorithm is instance-based, it just learns the examples by heart and generalizes to new
instances by comparing them to the learned instances using a similarity measure.
• The system will not perform well if your training set is too small, or if the data is not
representative, noisy, or polluted with irrelevant features (garbage in, garbage out).
Lastly, the model needs to be neither too simple (in which case it will underfit) nor too
complex (in which case it will overfit).
Designing a Learning System
1. Choosing the Training Experience
2. Choosing the Target Function
3. Choosing a Representation for the Target Function
4. Choosing a Function Approximation Algorithm
1. Estimating training values
2. Adjusting the weights
5. The Final Design
1. Choosing the Training Experience
• The first design choice is to choose the type of training experience from
which the system will learn.
• The type of training experience available can have a significant impact on
success or failure of the learner.

• There are three attributes which impact on success or failure of the learner

1. Whether the training experience provides direct or indirect feedback


regarding the choices made by the performance system.
2. The degree to which the learner controls the sequence of training
examples
3. How well it represents the distribution of examples over which the final
system performance P must be measured.
1. Whether the training experience provides direct or indirect feedback regarding
the choices made by the performance system.
For example, in checkers game:
• In learning to play checkers, the system might learn from direct training examples consisting of individual
checkers board states and the correct move for each.

• Indirect training examples consisting of the move sequences and final outcomes of various games played.

• The information about the correctness of specific moves early in the game must be inferred indirectly from
the fact that the game was eventually won or lost.

• Here the learner faces an additional problem of credit assignment, or determining the degree to which each
move in the sequence deserves credit or blame for the final outcome.

• Credit assignment can be a particularly difficult problem because the game can be lost even when early
moves are optimal, if these are followed later by poor moves.

• Hence, learning from direct training feedback is typically easier than learning from indirect feedback.
2. A second important attribute of the training experience is the degree to which the
learner controls the sequence of training examples
For example, in checkers game:
• The learner might depends on the teacher to select informative board states and to provide the correct move
for each.

• Alternatively, the learner might itself propose board states that it finds particularly confusing and ask the
teacher for the correct move.

• The learner may have complete control over both the board states and (indirect) training classifications, as it
does when it learns by playing against itself with no teacher present.

• Notice in this last case the learner may choose between experimenting with novel board states that it has not
yet considered, or refining its skill by playing minor variations of lines of play it currently finds
most promising.
3. A third attribute of the training experience is how well it represents the
distribution of examples over which the final system performance P must be
measured.
Learning is most reliable when the training examples follow a distribution similar to that of future test
examples.

For example, in checkers game:


• In checkers learning scenario, the performance metric P is the percent of games the system wins in the world
tournament.

• If its training experience E consists only of games played against itself, there is an danger that this training
experience might not be fully representative of the distribution of situations over which it will later be tested.
For example, the learner might never encounter certain crucial board states that are very likely to be played
by the human checkers champion.

• It is necessary to learn from a distribution of examples that is somewhat different from those on which the
final system will be evaluated. Such situations are problematic because mastery of one distribution of
examples will not necessary lead to strong performance over some other distribution.
2. Choosing the Target Function
The next design choice is to determine exactly what type of knowledge will be
learned and how this will be used by the performance program.
• Lets begin with a checkers-playing program that can generate the legal moves
from any board state.
• The program needs only to learn how to choose the best move from among these
legal moves. This learning task is representative of a large class of tasks for
which the legal moves that define some large search space are known a priori, but
for which the best search strategy is not known.
Given this setting where we must learn to choose among the legal moves, the most
obvious choice for the type of information to be learned is a program, or function,
that chooses the best move for any given board state.

1. Let ChooseMove be the target function and the notation is


ChooseMove : B M
which indicate that this function accepts as input any board from the set of legal
board states B and produces as output some move from the set of legal moves M.

ChooseMove is an choice for the target function in checkers example, but this
function will turn out to be very difficult to learn given the kind of indirect training
experience available to our system
2. An alternative target function is an evaluation function that assigns a numerical
score to any given board state
Let the target function V and the notation
V:B R
which denote that V maps any legal board state from the set B to some real value

We intend for this target function V to assign higher scores to better board states. If
the system can successfully learn such a target function V, then it can easily use it to
select the best move from any current board position.
Let us define the target value V(b) for an arbitrary board state b in B, as follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V(b) = V(b' ),
where b' is the best final board state that can be achieved starting from b and
playing optimally until the end of the game
3. Choosing a Representation for the
Target Function
let us choose a simple representation - for any given board state, the function c will
be calculated as a linear combination of the following board features:

xl: the number of black pieces on the board


x2: the number of red pieces on the board
x3: the number of black kings on the board
x4: the number of red kings on the board
x5: the number of black pieces threatened by red (i.e., which can be
captured on red's next turn)
x6: the number of red pieces threatened by black
Thus, learning program will represent as a linear function of the form

Where,
• w0 through w6 are numerical coefficients, or weights, to be chosen by the
learning algorithm.
• Learned values for the weights w1 through w6 will determine the relative
importance of the various board features in determining the value of the board
• The weight w0 will provide an additive constant to the board value
Partial design of a checkers learning program:

• Task T: playing checkers


• Performance measure P: percent of games won in the world tournament
• Training experience E: games played against itself
• Target function: V: Board R
• Target function representation

The first three items above correspond to the specification of the learning task,
whereas the final two items constitute design choices for the implementation of the
learning program.
4. Choosing a Function Approximation
Algorithm
• In order to learn the target function f we require a set of training examples, each
describing a specific board state b and the training value Vtrain(b) for b.

• Each training example is an ordered pair of the form (b, Vtrain(b)).

• For instance, the following training example describes a board state b in which
black has won the game (note x2 = 0 indicates that red has no remaining
pieces) and for which the target function value Vtrain(b) is therefore +100.

((x1=3, x2=0, x3=1, x4=0, x5=0, x6=0), +100)


Function Approximation Procedure
1. Derive training examples from the indirect training experience available to the
learner
2. Adjusts the weights wi to best fit these training examples
1. Estimating training values

A simple approach for estimating training values for intermediate board states is
to assign the training value of Vtrain(b) for any intermediate board state b to be
V(̂ Successor(b))

Where ,
V̂ is the learner's current approximation to V
Successor(b) denotes the next board state following b for which it is again the
program's turn to move

Rule for estimating training values

Vtrain(b) ← V̂ (Successor(b))
2. Adjusting the weights

Specify the learning algorithm for choosing the weights wi to best fit the set of
training examples {(b, Vtrain(b))}

A first step is to define what we mean by the bestfit to the training data.
• One common approach is to define the best hypothesis, or set of weights, as that
which minimizes the squared error E between the training values and the values
predicted by the hypothesis.

• Several algorithms are known for finding weights of a linear function that
minimize E.
In our case, we require an algorithm that will incrementally refine the weights as
new training examples become available and that will be robust to errors in these
estimated training values

One such algorithm is called the least mean squares, or LMS training rule. For
each observed training example it adjusts the weights a small amount in the
direction that reduces the error on this training example

LMS weight update rule :- For each training example (b, Vtrain(b))
Use the current weights to calculate V̂ (b)
For each weight wi, update it as

wi ← wi + ƞ (Vtrain (b) - V̂(b)) xi


Here ƞ is a small constant (e.g., 0.1) that moderates the size of the weight update.

Working of weight update rule

• When the error (Vtrain(b)- V̂(b)) is zero, no weights are changed.


• When (Vtrain(b) - V̂(b)) is positive (i.e., when V̂(b) is too low), then each weight
is increased in proportion to the value of its corresponding feature. This will
raise the value of V̂(b), reducing the error.
• If the value of some feature xi is zero, then its weight is not altered regardless of
the error, so that the only weights updated are those whose features actually
occur on the training example board.
5. The Final Design
The final design of checkers learning system can be described by four distinct
program modules that represent the central components in many learning systems
1. The Performance System is the module that must solve the given performance
task by using the learned target function(s).
It takes an instance of a new problem (new game) as input and produces a trace of
its solution (game history) as output.
In checkers game, the strategy used by the Performance System to select its next
move at each step is determined by the learned V̂ evaluation function. Therefore, we
expect its performance to improve as this evaluation function becomes increasingly
accurate.

2.The Critic takes as input the history or trace of the game and produces as output a
set of training examples of the target function. As shown in the diagram, each
training example in this case corresponds to some game state in the trace, along
with an estimate Vtrain of the target function value for this example.
3.The Generalizer takes as input the training examples and produces an output
hypothesis that is its estimate of the target function.
It generalizes from the specific training examples, hypothesizing a general function
that covers these examples and other cases beyond the training examples.
In our example, the Generalizer corresponds to the LMS algorithm, and the output
hypothesis is the function V̂ described by the learned weights w0, . . . , W6.

4. The Experiment Generator takes as input the current hypothesis and outputs a
new problem (i.e., initial board state) for the Performance System to explore. Its
role is to pick new practice problems that will maximize the learning rate of the
overall system.
In our example, the Experiment Generator always proposes the same initial game
board to begin a new game.
The sequence of design choices made for the checkers program is summarized in
below figure
Issues in Machine Learning
• What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired
function, given sufficient training data? Which algorithms perform best for
which types of problems and representations?

• How much training data is sufficient? What general bounds can be found to
relate the confidence in learned hypotheses to the amount of training experience
and the character of the learner's hypothesis space?

• When and how can prior knowledge held by the learner guide the process of
generalizing from examples? Can prior knowledge be helpful even when it is
only approximately correct?
• What is the best strategy for choosing a useful next training experience, and how
does the choice of this strategy alter the complexity of the learning problem?

• What is the best way to reduce the learning task to one or more function
approximation problems? Put another way, what specific functions should the
system attempt to learn? Can this process itself be automated?

• How can the learner automatically alter its representation to improve its ability to
represent and learn the target function?
Concept Learning
• Learning involves acquiring general concepts from specific training
examples. Example: People continually learn general concepts or
categories such as "bird," "car," "situations in which I should study
more in order to pass the exam," etc.
• Each such concept can be viewed as describing some subset of
objects or events defined over a larger set
• Alternatively, each concept can be thought of as a Boolean-
valued function defined over this larger set. (Example: A function
defined over all animals, whose value is true for birds and false for
other animals).
• Concept learning - Inferring a Boolean-valued function from training
examples of its input and output
A Concept Learning Task
Consider the example task of learning the target concept
"Days on which my friend Aldo enjoys his favorite water sport."
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rainy Cold High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes

Table- Describes a set of example days, each represented by a set of attributes

The attribute EnjoySport indicates whether or not a Person enjoys his favorite water sport on this day.
What hypothesis representation is provided to the learner?

Let’s consider a simple representation in which each hypothesis consists of a


conjunction of constraints on the instance attributes.

Let each hypothesis be a vector of six constraints, specifying the values of the six
attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.

For each attribute, the hypothesis will either


• Indicate by a "?' that any value is acceptable for this attribute,
• Specify a single required value (e.g., Warm) for the attribute, or
• Indicate by a "Φ" that no value is acceptable
If some instance x satisfies all the constraints of hypothesis h, then h classifies
x as a positive example (h(x) = 1).

The hypothesis that PERSON enjoys his favorite sport only on cold days with high
humidity (independent of the values of the other attributes) is represented by the
expression
(?, Cold, High, ?, ?, ?)

The most general hypothesis-that every day is a positive example-is represented by


(?, ?, ?, ?, ?, ?)

The most specific possible hypothesis-that no day is a positive example-is


represented by
(Φ , Φ, Φ, Φ, Φ, Φ)
1. Notation
The set of items over which the concept is defined is called the set of instances,
which we denote by X.
Example: X is the set of all possible days, each represented by the attributes: Sky,
AirTemp, Humidity, Wind, Water, and Forecast

The concept or function to be learned is called the target concept, which we denote
by c.
c can be any Boolean valued function defined over the instances X
c:X {O, 1}

Example: The target concept corresponds to the value of the attribute EnjoySport
(i.e., c(x) = 1 if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).
• Instances for which c(x) = 1 are called positive examples, or members of the
target concept.
• Instances for which c(x) = 0 are called negative examples, or non-members of
the target concept.
• The ordered pair (x, c(x)) to describe the training example consisting of the
instance x and its target concept value c(x).
• D to denote the set of available training examples
• The symbol H to denote the set of all possible hypotheses that the learner may
consider regarding the identity of the target concept. Each hypothesis h in H
represents a Boolean-valued function defined over X
h:X {O, 1}

• The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in
X.
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rainy Cold High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes


2. The Inductive Learning Hypothesis
Any hypothesis found to approximate the target function well over a sufficiently
large set of training examples will also approximate the target function well over
other unobserved examples.
Concept learning as Search
• Concept learning can be viewed as the task of searching through a large space of
hypotheses implicitly defined by the hypothesis representation.
• The goal of this search is to find the hypothesis that best fits the training
examples.

Example, the instances X and hypotheses H in the EnjoySport learning task.


The attribute Sky has three possible values, and AirTemp, Humidity, Wind, Water
Forecast each have two possible values, the instance space X contains exactly
• 3.2.2.2.2.2 = 96 Distinct instances
• 5.4.4.4.4.4 = 5120 Syntactically distinct hypotheses within H.
Every hypothesis containing one or more " Φ" symbols represents the empty set of
instances; that is, it classifies every instance as negative.
• 1 + (4.3.3.3.3.3) = 973. Semantically distinct hypotheses.
General-to-Specific Ordering of Hypotheses
• Consider the two hypotheses
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)

• Consider the sets of instances that are classified positive by hl and by h2.
• h2 imposes fewer constraints on the instance, it classifies more instances as
positive. So, any instance classified positive by hl will also be classified positive
by h2. Therefore, h2 is more general than hl.
General-to-Specific Ordering of Hypotheses
• Given hypotheses hj and hk, hj is more-general-than or- equal do hk if and only if
any instance that satisfies hk also satisfies hi

Definition: Let hj and hk be Boolean-valued functions defined over X. Then hj is


more general-than-or-equal-to hk (written hj ≥ hk) if and only if
• In the figure, the box on the left
represents the set X of all
instances, the box on the right the
set H of all hypotheses.

• Each hypothesis corresponds to


some subset of X-the subset of
instances that it classifies positive.

• The arrows connecting hypotheses


represent the more - general -than
relation, with the arrow pointing
toward the less general hypothesis.

• Note the subset of instances


characterized by h2 subsumes the
subset characterized by h l , hence
h2 is more - general– than h1
F I N D - S: Finding a Maximally Specific
Hypothesis
FIND-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
For each attribute constraint ai in h
If the constraint ai is satisfied by x
Then do nothing
Else replace ai in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
To illustrate this algorithm, assume the learner is given the sequence of training
examples from the EnjoySport task
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport
1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rainy Cold High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes

The first step of FIND-S is to initialize h to the most specific hypothesis in H


h - (Ø, Ø, Ø, Ø, Ø, Ø)
x1 = <Sunny Warm Normal Strong Warm Same>, +
Observing the first training example, it is clear that our hypothesis is too specific. In
particular, none of the "Ø" constraints in h are satisfied by this example, so each is
replaced by the next more general constraint that fits the example
h1 = <Sunny Warm Normal Strong Warm Same>
This h is still very specific; it asserts that all instances are negative except for the
single positive training example

x2 = <Sunny, Warm, High, Strong, Warm, Same>, +


The second training example forces the algorithm to further generalize h, this time
substituting a "?' in place of any attribute value in h that is not satisfied by the new
example
h2 = <Sunny Warm ? Strong Warm Same>
x3 = <Rainy, Cold, High, Strong, Warm, Change>, -
Upon encountering the third training the algorithm makes no change to h. The
FIND-S algorithm simply ignores every negative example.
h3 = < Sunny Warm ? Strong Warm Same>

x4 = <Sunny Warm High Strong Cool Change>, +


The fourth example leads to a further generalization of h
h4 = < Sunny Warm ? Strong ? ? >
The key property of the FIND-S algorithm is
• FIND-S is guaranteed to output the most specific hypothesis within H that is
consistent with the positive training examples
• FIND-S algorithm’s final hypothesis will also be consistent with the negative
examples provided the correct target concept is contained in H, and provided the
training examples are correct.
Unanswered by FIND - S

1. Has the learner converged to the correct target concept?


2. Why prefer the most specific hypothesis?
3. Are the training examples consistent?
4. What if there are several maximally specific consistent hypotheses?
Version Space and C A N D I DAT E EL IM INATION
Algorithm
The key idea in the CANDIDATE-ELIMINATION algorithm is to output a description of
the set of all hypotheses consistent with the training examples

Representation
• Definition: A hypothesis h is consistent with a set of training examples D if and only if
h(x) = c(x) for each example (x, c(x)) in D.
Consistent(h, D)  ( x, c(x)  D) h(x) = c(x))

Note difference between definitions of consistent and satisfies


• an example x is said to satisfy hypothesis h when h(x) = 1, regardless of whether x is
a positive or negative example of the target concept.
• an example x is said to consistent with hypothesis h iff h(x) = c(x)
Version Space
A representation of the set of all hypotheses which are consistent with D

Definition: The version space, denoted VSH,D with respect to hypothesis space H
and training examples D, is the subset of hypotheses from H consistent with the
training examples in D
VSH,D {h  H | Consistent(h, D)}
The LIST-THEN-ELIMINATE Algorithm
• The LIST-THEN-ELIMINATE algorithm first initializes the version space to
contain all hypotheses in H and then eliminates any hypothesis found inconsistent
with any training example.
1. VersionSpace ← a list containing every hypothesis in H
2. For each training example, (x, c(x))
remove from VersionSpace any hypothesis h for which h(x) ≠ c(x)
3. Output the list of hypotheses in VersionSpace

• List-Then-Eliminate works in principle, so long as version space is finite.


• However, since it requires exhaustive enumeration of all hypotheses in practice,
it is not feasible.
A More Compact Representation for Version Spaces
• The version space is represented by its most general and least general members.
• These members form general and specific boundary sets that delimit the version
space within the partially ordered hypothesis space.
• A version space with its
general and specific boundary
sets.
• The version space includes all
six hypotheses shown here, but
can be represented more
simply by S and G.
• Arrows indicate instance of the
more-general-than relation.
This is the version space for
the Enjoysport concept
learning
• problem and training
examples described in below
Example Sky AirTemp Humidity Wind Water Forecast EnjoySport table

1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rainy Cold High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes


Definition: The general boundary G, with respect to hypothesis space H and
training data D, is the set of maximally general members of H consistent with D
G {g  H | Consistent(g, D)(g'  H)[(g' g g)  Consistent(g', D)]}

Definition: The specific boundary S, with respect to hypothesis space H and


training data D, is the set of minimally general (i.e., maximally specific) members of
H consistent with D.
S {s  H | Consistent(s, D)(s'  H)[(s gs')  Consistent(s', D)]}
Version Space representation theorem
Theorem: Let X be an arbitrary set of instances and Let H be a set of Boolean-
valued hypotheses defined over X. Let c : X →{O, 1} be an arbitrary target concept
defined over X, and let D be an arbitrary set of training examples {(x, c(x))). For all
X, H, c, and D such that S and G are well defined,

VSH,D={h  H |(s  S) (g  G) (g g h g s)}


VSH,D={h  H |(s  S) (g  G) (g g h g s)}
To Prove:
1. Every h satisfying the right hand side of the above expression is in VS H,D
2. Every member of VS H,D satisfies the right-hand side of the expression

Sketch of proof:
1. let g, h, s be arbitrary members of G, H, S respectively with g g h g s
By the definition of S, s must be satisfied by all positive examples in D. Because h g s , h must also
be satisfied by all positive examples in D.
By the definition of G, g cannot be satisfied by any negative example in D, and because g g h h
cannot be satisfied by any negative example in D. Because h is satisfied by all positive examples in D
and by no negative examples in D, h is consistent with D, and therefore h is a member of VSH,D

2. It can be proven by assuming some h in VSH,D,that does not satisfy the right-hand side of the
expression, then showing that this leads to an inconsistency
The CANDIDATE-ELIMINATION Learning
Algorithm

The CANDIDATE-ELIMINTION algorithm computes the version space containing


all hypotheses from H that are consistent with an observed sequence of training
examples.
Initialize G to the set of maximally general hypotheses in H
Initialize S to the set of maximally specific hypotheses in H
For each training example d, do
• If d is a positive example
• Remove from G any hypothesis inconsistent with d
• For each hypothesis s in S that is not consistent with d
• Remove s from S
• Add to S all minimal generalizations h of s such that
• h is consistent with d, and some member of G is more general than h
• Remove from S any hypothesis that is more general than another hypothesis in S

• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
• Remove from G any hypothesis that is less general than another hypothesis in G
An Illustrative Example
The boundary sets are first initialized to Go and So, the most general and most
specific hypotheses in H.

S0 , , , , , 

G0 ?, ?, ?, ?, ?, ?
For training example d,

Sunny, Warm, Normal, Strong, Warm, Same  +

S0 , , , , . 

S1 Sunny, Warm, Normal, Strong, Warm, Same

G0, G1 ?, ?, ?, ?, ?, ?
For training example d,

Sunny, Warm, High, Strong, Warm, Same +

S1 Sunny, Warm, Normal, Strong, Warm, Same

S2 Sunny, Warm, ?, Strong, Warm, Same

G1, G2 ?, ?, ?, ?, ?, ?
For training example d,

Rainy, Cold, High, Strong, Warm, Change  −

S2, S3 Sunny, Warm, ?, Strong, Warm, Same

G3 Sunny, ?, ?, ?, ?, ? ?, Warm, ?, ?, ?, ? ?, ?, ?, ?, ?, Same

G2 ?, ?, ?, ?, ?, ?
For training example d,

Sunny, Warm, High, Strong, Cool Change  +

S3 Sunny, Warm, ?, Strong, Warm, Same

S4 Sunny, Warm, ?, Strong, ?, ?

G4 Sunny, ?, ?, ?, ?, ? ?, Warm, ?, ?, ?, ?

G3 Sunny, ?, ?, ?, ?, ? ?, Warm, ?, ?, ?, ? ?, ?, ?, ?, ?, Same


The final version space for the EnjoySport concept learning problem and training
examples described earlier.
Inductive Bias
The fundamental questions for inductive inference

• What if the target concept is not contained in the hypothesis space?


• Can we avoid this difficulty by using a hypothesis space that includes every possible
hypothesis?
• How does the size of this hypothesis space influence the ability of the algorithm to
generalize to unobserved instances?
• How does the size of the hypothesis space influence the number of training examples
that must be observed?
Effect of incomplete hypothesis space
Preceding algorithms work if target function is in H
Will generally not work if target function not in H

Consider following examples which represent target function


“sky = sunny or sky = cloudy”:
Sunny Warm Normal Strong Cool Change Y
Cloudy Warm Normal Strong Cool Change Y
Rainy Warm Normal Strong Cool Change N

If apply Candidate Elimination algorithm as before, end up with empty Version Space

After first two training example


S= ? Warm Normal Strong Cool Change

New hypothesis is overly general and it covers the third negative training example!

Our H does not include the appropriate c


An Unbiased Learner
Incomplete hypothesis space
• If c not in H, then consider generalizing representation of H to contain c
• The size of the instance space X of days described by the six available attributes is 96.
The number of distinct subsets that can be defined over a set X containing |X| elements
(i.e., the size of the power set of X) is 2|X|
• Recall that there are 96 instances in EnjoySport; hence there are 296 possible hypotheses
in full space H
• Can do this by using full propositional calculus with AND, OR, NOT
• Hence H defined only by conjunctions of attributes is biased (containing only 973 h’s)
• Let us reformulate the Enjoysport learning task in an unbiased way by defining a new
hypothesis space H' that can represent every subset of instances; that is, let H' correspond
to the power set of X.
• One way to define such an H' is to allow arbitrary disjunctions, conjunctions, and
negations of our earlier hypotheses.

For instance, the target concept "Sky = Sunny or Sky = Cloudy" could then be described as
(Sunny, ?, ?, ?, ?, ?) V (Cloudy, ?, ?, ?, ?, ?)
Definition:
Consider a concept learning algorithm L for the set of instances X.
• Let c be an arbitrary concept defined over X
• Let Dc = {( x , c(x))} be an arbitrary set of training examples of c.
• Let L(xi, Dc) denote the classification assigned to the instance xi by L after training on the
data Dc.
• The inductive bias of L is any minimal set of assertions B such that for any target concept
c and corresponding training examples Dc

( xi  X ) [(B  Dc  xi) ├ L(xi, Dc)]


Modelling inductive systems by
equivalent deductive systems.
The input-output behavior of the
CANDIDATE-ELIMINATION
algorithm using a hypothesis space H
is identical to that of a deductive
theorem prover utilizing the assertion
"H contains the target concept." This
assertion is therefore called the
inductive bias of the CANDIDATE-
ELIMINATION algorithm.
characterizing inductive systems
by their inductive bias allows
modelling them by their equivalent
deductive systems. This provides a
way to compare inductive systems
according to their policies for
generalizing beyond the observed
training data.
End

You might also like