0% found this document useful (0 votes)
107 views160 pages

ML m1-m5 NOTES

ML m1-m5 NOTES
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views160 pages

ML m1-m5 NOTES

ML m1-m5 NOTES
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 160

Machine Learning 21AI63

MODULE 1
Introduction: The Machine Learning Landscape

What is Machine learning?


Machine Learning is the science and art of programming computers so they can learn from
data.
• Arthur Samuel, 1959: Machine Learning is the field of study that gives computers
the ability to learn without being explicitly programmed.
• Tom Mitchell, 1997: A computer program is said to learn from experience E with
respect to some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.

Example: Spam filter

• It is a Machine Learning program that can learn to flag spam given examples of spam
emails (flagged by users) and examples of regular emails.
• The examples that the system uses to learn are called the training set. Each training
example is called a training instance (or sample).
• In this case, the task T is to flag spam for new emails, the experience E is the training
data, and the performance measure P needs to be defined; for example, you can use the
ratio of correctly classified emails. This particular performance measure is called
accuracy and it is often used in classification tasks.

Why Use Machine learning?


Machine Learning is used for:
• Problems for which existing solutions require a lot of hand-tuning or long lists of
rules: one Machine Learning algorithm can often simplify code and perform better.
• Complex problems for which there is no good solution at all using a traditional
approach: the best Machine Learning techniques can find a solution.
• Fluctuating environments: a Machine Learning system can adapt to new data.
• Getting insights about complex problems and large amounts of data

1 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Example 1: Spam filtering


Consider a spam filter program which is written using traditional programming techniques
(Figure 1-1):

1. First look at what spam typically looks like. Some words or phrases (such as “4U,”
“credit card,” “free,” and “amazing”) tend to come up a lot in the subject are noticed.
Few other patterns in the sender’s name, the email’s body, and so on are also noticed.
2. Then write a detection algorithm for each of the patterns that noticed, and the program
would flag emails as spam if a number of these patterns are detected.
3. Next, test the program, and repeat steps 1 and 2 until it is good enough.
Since the problem is not trivial, the program will likely become a long list of complex rules
and pretty hard to maintain.

A spam filter based on Machine Learning techniques automatically learns which words and
phrases are good predictors of spam by detecting unusually frequent patterns of words in the
spam examples compared to the ham examples (Figure 1-2). The program is much shorter,
easier to maintain, and most likely more accurate.

A machine learning-based spam filter adapts dynamically to evolving spam tactics,


recognizing patterns like "For U" without manual rule updates. This proactive approach saves
time and effort, ensuring continuous effectiveness against emerging spam techniques.

2 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Another area where Machine Learning shines is for problems that either are too complex
for traditional approaches or have no known algorithm.

Example 2: Speech recognition


• If one wants to start simple and write a program capable of distinguishing the words
“one” and “two,” one might notice that the word “two” starts with a high-pitch sound
(“T”), so one could hardcode an algorithm that measures high-pitch sound intensity
and use that to distinguish ones and twos.
• Obviously, this technique will not scale to thousands of words spoken by millions of
very different people in noisy environments and in dozens of languages.
• The best solution is to write an algorithm that learns by itself, given many example
recordings for each word.

Finally, Machine Learning can help humans learn (Figure 1-4): ML algorithms can be
inspected to see what they have learned. Sometimes it will reveal unsuspected correlations or
new trends, and thereby lead to a better understanding of the problem.

3 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Types of Machine learning


The different types of Machine Learning systems that it is useful to classify them in broad
categories based on:
• Whether or not they are trained with human supervision
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised learning
4. Reinforcement learning
• Whether or not they can learn incrementally on the fly
1. Online learning
2. Batch learning
• Whether they work by simply comparing new data points to known data points, or
instead detect patterns in the training data and build a predictive model
1. Instance-based learning
2. Model-based learning

1. Machine Learning systems can be classified according to the amount and type of
supervision they get during training.
There are four major categories:
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised learning
4. Reinforcement learning.

1. Supervised learning: The training data is feed to the algorithm includes the desired
solutions, called labels.

A typical Supervised learning task is Classification.


• The spam filter is a good example of classification: it is trained with many example
emails along with their class (spam or ham), and it must learn how to classify new
emails.

4 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Another Supervised learning task is Regression


• Here, task is to predict a target numeric value, such as the price of a car, given a set of
features (mileage, age, brand, etc.) called predictors. This sort of task is called
regression (Figure 1-6)
• To train the system, you need to give it many examples of cars, including both their
predictors and their labels (i.e., their prices).

Here are some of the most important supervised learning algorithms


• k-Nearest Neighbors • Support Vector Machines (SVMs)
• Linear Regression • Decision Trees and Random Forests
• Logistic Regression • Neural networks

2. Unsupervised learning: In unsupervised learning the training data is unlabeled (Figure


1-7). The system tries to learn without a teacher

Here are some of the most important Visualization and dimensionality reduction
unsupervised learning algorithms • Principal Component Analysis (PCA)
• Clustering • Kernel PCA
• K-Means • Locally-Linear Embedding (LLE)
• DBSCAN • t-distributed Stochastic Neighbor
• Hierarchical Cluster Analysis (HCA) Embedding (t-SNE)
Anomaly detection and novelty detection Association rule learning
• One-class SVM • Apriori
• Isolation Forest • Eclat
• Eclat

5 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

For example:
• Data about blog’s visitors. You may want to run a clustering algorithm to try to detect
groups of similar visitors (Figure 1-8). the algorithm tells which group a visitor
belongs to.
• It might notice that 40% of your visitors are males who love comic books and
generally read your blog in the evening, while 20% are young sci-fi lovers who visit
during the weekends, and so on.
• If you use a hierarchical clustering algorithm, it may also subdivide each group into
smaller groups. This may help you target your posts for each group.

3. Semisupervised learning: Some algorithms can deal with partially labeled training data,
usually a lot of unlabeled data and a little bit of labeled data. This is called
semisupervised learning (Figure 1-11).

Example: Google Photos


Once users upload all their family photos to the service, it automatically recognizes that the
same person A shows up in photos 1, 5, and 11, while another person B shows up in photos 2,
5, and 7. This constitutes the unsupervised part of the algorithm (clustering). Now all the
system needs are for users to tell it who these people are. Just one label per person, and it is
able to name everyone in every photo, which is useful for searching photos.

6 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

4. Reinforcement Learning: The learning system, called an agent in this context, can
observe the environment, select and perform actions, and get rewards in return (or
penalties in the form of negative rewards, as in Figure 1-12). It must then learn by itself
what is the best strategy, called a policy, to get the most reward over time. A policy
defines what action the agent should choose when it is in a given situation.

Example: Robots implement Reinforcement Learning algorithms to learn how to walk.


DeepMind’s AlphaGo program is also a good example of Reinforcement.

2. Machine Learning systems is classified based on whether or not the system can learn
incrementally from a stream of incoming data.

1. Batch learning: In batch learning, the system is incapable of learning incrementally: it


must be trained using all the available data. This will generally take a lot of time and
computing resources, so it is typically done offline. First the system is trained, and then it
is launched into production and runs without learning anymore; it just applies what it has
learned. This is called offline learning (Batch learning).

7 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• If one desires a batch learning system to know about new data, one must train a new
version of the system from scratch on the full dataset, then stop the old system and
replace it with the new one.
• Fortunately, the whole process of training, evaluating, and launching a Machine
Learning system can be automated fairly easily, so even a batch learning system can
adapt to change. Simply update the data and train a new version of the system from
scratch as often as needed.
• This solution is simple and often works fine, but training using the full set of data can
take many hours, so you would typically train a new system only every 24 hours or
even just weekly. If your system needs to adapt to rapidly changing data (e.g., to
predict stock prices), then you need a more reactive solution.

2. Online learning: Train the system incrementally by feeding it data instances


sequentially, either individually or by small groups called mini-batches. Each learning
step is fast and cheap, so the system can learn about new data on the fly, as it arrives (see
Figure 1-13).

• Online learning is great for systems that receive data as a continuous flow (e.g., stock
prices) and need to adapt to change rapidly or autonomously. If it has had limited
computing resources: once an online learning system has learned about new data
instances, it does not need them anymore, so you can discard them is also a good
option. This can save a huge amount of space.
• Online learning algorithms can also be used to train systems on huge datasets that
cannot fit in one machine’s main memory (this is called out-of-core learning). The
algorithm loads part of the data, runs a training step on that data, and repeats the
process until it has run on all of the data.
• One important parameter of online learning systems is how fast they should adapt to
changing data: this is called the learning rate. If you set a high learning rate, then your
system will rapidly adapt to new data, but it will also tend to quickly forget the old
data.

8 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

3. One more way Machine Learning systems is classified based by how they generalize.
There are two main approaches to generalization:

1. Instance-based learning: The system learns the examples, then generalizes to new
cases by comparing them to the learned examples using a similarity measure.
For example, in Figure 1-15 the new instance would be classified as a triangle
because the majority of the most similar instances belong to that class.

9 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

2. Model-based learning: Build a model of from a set of examples, then use that model
to make predictions. This is called model-based learning.

Main challenges of Machine learning

In ML the main task is to select a learning algorithm and train it on some data, the two things
that can go wrong are “bad algorithm” and “bad data. So, here are some challenges of ML

1. Insufficient Quantity of Training Data


2. Nonrepresentative Training Data
3. Poor-Quality Data
4. Irrelevant Features
5. Overfitting the Training Data
6. Underfitting the Training Data
7. Testing and Validating
8. Hyperparameter Tuning and Model Selection

1. Insufficient Quantity of Training Data: Machine Learning takes a lot of data for
most ML algorithms to work properly. Even for very simple problems you typically
need thousands of examples, and for complex problems such as image or speech
recognition you may need millions of examples.
2. Nonrepresentative Training Data: In order to generalize well, it is crucial that
training data is representative of the new cases one aims to generalize to. By using a
nonrepresentative training set, we trained a model that is unlikely to make accurate
predictions
3. Poor-Quality Data: If the training data is full of errors, outliers, and noise (e.g., due to
poor-quality measurements), it will make it harder for the system to detect the

10 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

underlying patterns, so the system is less likely to perform well. It is often well worth
the effort to spend time cleaning up the training data. The truth is, most data scientists
spend a significant part of their time doing just that.
4. Irrelevant Features: The system will only be capable of learning if the training data
contains enough relevant features and not too many irrelevant ones. A critical part of
the success of a Machine Learning project is coming up with a good set of features to
train on. This process, called feature engineering.
The feature engineering includes:
• Feature selection: selecting the most useful features to train on among existing
features.
• Feature extraction: combining existing features to produce a more useful one
• Creating new features by gathering new data.

5. Overfitting the Training Data: It means that the model performs well on the training
data, but it does not generalize well. Complex models such as deep neural networks
can detect subtle patterns in the data, but if the training set is noisy, or if it is too small
then the model is likely to detect patterns in the noise itself. Obviously, these patterns
will not generalize to new instances.
Overfitting happens when the model is too complex relative to the amount and
noisiness of the training data. The possible solutions are:
• To simplify the model by selecting one with fewer parameters by reducing the
number of attributes in the training data or by constraining the model
• To gather more training data
• To reduce the noise in the training data (e.g., fix data errors and remove
outliers)
Constraining a model to make it simpler and reduce the risk of overfitting is called
regularization.

6. Underfitting the Training Data: It occurs when a model is too simple to learn the
underlying structure of the data.
Example: Attempting to fit a linear model to predict housing prices based solely on
the number of bedrooms. While bedrooms may have some correlation with price, a
linear model over simplifies the relationship between various features (such as square
footage, location, amenities, etc.) and housing prices. As a result, the model would
likely perform poorly in accurately predicting housing prices, both on the training data
and unseen data, because it fails to capture the complexities of the housing market.
The main options to fix this problem are:
• Selecting a more powerful model, with more parameters
• Feeding better features to the learning algorithm (feature engineering)
• Reducing the constraints on the model (e.g., reducing the regularization
hyperparameter)

11 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

7. Testing and Validating: Splitting the data into two sets: the training set and the test
set. The model is trained using the training set, and it is tested using the test set. The
error rate on new cases is called the generalization error, and by evaluating the model
on the test set, an estimate of this error is obtained. This value indicates how well the
model will perform on instances it has never seen before. If the training error is low
but the generalization error is high, it means that the model is overfitting the training
data.

8. Hyperparameter Tuning and Model Selection:


• Suppose the linear model generalizes better, but regularization is desired to avoid
overfitting. The challenge arises in selecting the appropriate regularization
hyperparameter value. One approach is to train numerous models with varied
hyperparameter values, selecting the one yielding the lowest generalization error.
However, relying solely on test set performance for hyperparameter tuning may
lead to reduced performance on new data due to overfitting to the test set.
• A common solution to this problem is called holdout validation: part of the
training set is simply held out to evaluate several candidate models and select the
best one. The new heldout set is called the validation set. Multiple models with
various hyperparameters are trained on the reduced training set and the model that
performs best on the validation set is selected. After this holdout validation
process, the best model is trained on the full training set, and this gives the final
model. Lastly, this final model is evaluated on the test set to get an estimate of the
generalization error.

12 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Concept Learning and Learning Problems


Well- Posed Learning Problems

Definition: A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.
To have a well-defined learning problem, three features needs to be identified:
1. The class of tasks
2. The measure of performance to be improved
3. The source of experience

Examples
1. Checkers game: A computer program that learns to play checkers might improve its
performance as measured by its ability to win at the class of tasks involving playing
checkers games, through experience obtained by playing games against itself.

Fig: Checker game board


A checkers learning problem:
• Task T: playing checkers
• Performance measure P: percent of games won against opponents
• Training experience E: playing practice games against itself

2. A handwriting recognition learning problem:


• Task T: recognizing and classifying handwritten words within images
• Performance measure P: percent of words correctly classified
• Training experience E: a database of handwritten words with given
classifications
3. A robot driving learning problem:
• Task T: driving on public four-lane highways using vision sensors
• Performance measure P: average distance travelled before an error (as judged
by human overseer)
• Training experience E: a sequence of images and steering commands recorded
while observing a human driver

13 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Designing a Leaning System

The basic design issues and approaches to machine learning are illustrated by designing a
program to learn to play checkers, with the goal of entering it in the world checkers
tournament
1. Choosing the Training Experience
2. Choosing the Target Function
3. Choosing a Representation for the Target Function
4. Choosing a Function Approximation Algorithm
1. Estimating training values
2. Adjusting the weights
5. The Final Design

1. Choosing the Training Experience


• The first design choice is to choose the type of training experience from which the
system will learn.
• The type of training experience available can have a significant impact on success or
failure of the learner.

There are three attributes which impact on success or failure of the learner

1. Whether the training experience provides direct or indirect feedback regarding the
choices made by the performance system.
For example, in checkers game:
• In learning to play checkers, the system might learn from direct training examples
consisting of individual checkers board states and the correct move for each.
• Indirect training examples consisting of the move sequences and final outcomes of
various games played. The information about the correctness of specific moves
early in the game must be inferred indirectly from the fact that the game was
eventually won or lost.
• Here the learner faces an additional problem of credit assignment, or determining
the degree to which each move in the sequence deserves credit or blame for the final
outcome. Credit assignment can be a particularly difficult problem because the
game can be lost even when early moves are optimal, if these are followed later by
poor moves.
• Hence, learning from direct training feedback is typically easier than learning from
indirect feedback.

2. The degree to which the learner controls the sequence of training examples
For example, in checkers game:
• The learner might depend on the teacher to select informative board states and to
provide the correct move for each.

14 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• Alternatively, the learner might itself propose board states that it finds particularly
confusing and ask the teacher for the correct move.
• The learner may have complete control over both the board states and (indirect)
training classifications, as it does when it learns by playing against itself with no
teacher present.

3. How well it represents the distribution of examples over which the final system
performance P must be measured
• For example, in checkers game:
In checkers learning scenario, the performance metric P is the percent of games the
system wins in the world tournament.
• If its training experience E consists only of games played against itself, there is a
danger that this training experience might not be fully representative of the
distribution of situations over which it will later be tested.
• It is necessary to learn from a distribution of examples that is different from those
on which the final system will be evaluated.

2. Choosing the Target Function


• The next design choice is to determine exactly what type of knowledge will be
learned and how this will be used by the performance program.
• Let’s consider a checkers-playing program that can generate the legal moves from
any board state.
• The program needs only to learn how to choose the best move from among these
legal moves.
• We must learn to choose among the legal moves, the most obvious choice for the
type of information to be learned is a program, or function, that chooses the best
move for any given board state.

1. Let ChooseMove be the target function and the notation is

ChooseMove : B→ M
which indicate that this function accepts as input any board from the set of legal board
states B and produces as output some move from the set of legal moves M.

ChooseMove is a choice for the target function in checkers example, but this function
will turn out to be very difficult to learn given the kind of indirect training experience
available to our system

2. An alternative target function is an evaluation function that assigns a numerical score


to any given board state
Let the target function V and the notation
V:B →R

15 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

which denote that V maps any legal board state from the set B to some real value.
Intend for this target function V to assign higher scores to better board states. If the
system can successfully learn such a target function V, then it can easily use it to
select the best move from any current board position.

Let us define the target value V(b) for an arbitrary board state b in B, as follows:
• If b is a final board state that is won, then V(b) = 100
• If b is a final board state that is lost, then V(b) = -100
• If b is a final board state that is drawn, then V(b) = 0
• If b is a not a final state in the game, then V(b) = V(b' ),

Where b' is the best final board state that can be achieved starting from b and playing
optimally until the end of the game

3. Choosing a Representation for the Target Function

Let’s choose a simple representation - for any given board state, the function c will be
calculated as a linear combination of the following board features:

• xl: the number of black pieces on the board


• x2: the number of red pieces on the board
• x3: the number of black kings on the board
• x4: the number of red kings on the board
• x5: the number of black pieces threatened by red (i.e., which can be captured on red's
next turn)
• x6: the number of red pieces threatened by black

Thus, learning program will represent as a linear function of the form

Where,
• w0 through w6 are numerical coefficients, or weights, to be chosen by the learning
algorithm.
• Learned values for the weights w1 through w6 will determine the relative importance
of the various board features in determining the value of the board
• The weight w0 will provide an additive constant to the board value

16 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

4. Choosing a Function Approximation Algorithm

In order to learn the target function f we require a set of training examples, each describing a
specific board state b and the training value Vtrain(b) for b.

Each training example is an ordered pair of the form (b, Vtrain(b)).

For instance, the following training example describes a board state b in which black has won
the game (note x2 = 0 indicates that red has no remaining pieces) and for which the target
function value Vtrain(b) is therefore +100.

((x1=3, x2=0, x3=1, x4=0, x5=0, x6=0), +100)

Function Approximation Procedure

1. Derive training examples from the indirect training experience available to the learner
2. Adjusts the weights wi to best fit these training examples

1. Estimating training values

A simple approach for estimating training values for intermediate board states is to
assign the training value of Vtrain(b) for any intermediate board state b to be
V̂(Successor(b))

Where ,
• V̂ is the learner's current approximation to V
• Successor(b) denotes the next board state following b for which it is again the
program's turn to move

Rule for estimating training values

Vtrain(b) ← V̂ (Successor(b))

2. Adjusting the weights


Specify the learning algorithm for choosing the weights wi to best fit the set of training
examples {(b, Vtrain(b))}
A first step is to define what we mean by the bestfit to the training data.
One common approach is to define the best hypothesis, or set of weights, as that which
minimizes the squared error E between the training values and the values predicted by
the hypothesis.

17 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Several algorithms are known for finding weights of a linear function that minimize E.
One such algorithm is called the least mean squares, or LMS training rule. For each
observed training example it adjusts the weights a small amount in the direction that
reduces the error on this training example

LMS weight update rule :- For each training example (b, Vtrain(b))
Use the current weights to calculate V̂ (b)
For each weight wi, update it as

wi ← wi + ƞ (Vtrain (b) - V̂(b)) xi

Here ƞ is a small constant (e.g., 0.1) that moderates the size of the weight update.

Working of weight update rule

• When the error (Vtrain(b)- V̂(b)) is zero, no weights are changed.


• When (Vtrain(b) - V̂(b)) is positive (i.e., when V̂(b) is too low), then each weight
is increased in proportion to the value of its corresponding feature. This will
raise the value of V̂(b), reducing the error.
• If the value of some feature xi is zero, then its weight is not altered regardless of
the error, so that the only weights updated are those whose features actually
occur on the training example board.

5. The Final Design


The final design of checkers learning system can be described by four distinct program
modules that represent the central components in many learning systems

18 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

1. The Performance System is the module that must solve the given performance task
by using the learned target function(s). It takes an instance of a new problem (new
game) as input and produces a trace of its solution (game history) as output.

2. The Critic takes as input the history or trace of the game and produces as output a set
of training examples of the target function

3. The Generalizer takes as input the training examples and produces an output
hypothesis that is its estimate of the target function. It generalizes from the specific
training examples, hypothesizing a general function that covers these examples and
other cases beyond the training examples.

4. The Experiment Generator takes as input the current hypothesis and outputs a new
problem (i.e., initial board state) for the Performance System to explore. Its role is to
pick new practice problems that will maximize the learning rate of the overall system.

The sequence of design choices made for the checkers program is summarized in below
figure

19 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

20 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

PERSPECTIVES AND ISSUES IN MACHINE LEARNING

Issues in Machine Learning


The field of machine learning is concerned with answering questions such as the following
• What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired function,
given sufficient training data? Which algorithms perform best for which types of
problems and representations?
• How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the
character of the learner's hypothesis space?
• When and how can prior knowledge held by the learner guide the process of
generalizing from examples? Can prior knowledge be helpful even when it is only
approximately correct?
• What is the best strategy for choosing a useful next training experience, and how does
the choice of this strategy alter the complexity of the learning problem?
• What is the best way to reduce the learning task to one or more function
approximation problems? Put another way, what specific functions should the system
attempt to learn? Can this process itself be automated?
• How can the learner automatically alter its representation to improve its ability to
represent and learn the target function?

21 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

CONCEPT LEARNING

• Learning involves acquiring general concepts from specific training examples. Example:
People continually learn general concepts or categories such as "bird," "car," "situations in
which I should study more in order to pass the exam," etc.
• Each such concept can be viewed as describing some subset of objects or events defined
over a larger set
• Alternatively, each concept can be thought of as a Boolean-valued function defined over
this larger set. (Example: A function defined over all animals, whose value is true for birds
and false for other animals).

Definition: Concept learning - Inferring a Boolean-valued function from training examples


of its input and output

A CONCEPT LEARNING TASK

Consider the example task of learning the target concept "Days on which Aldo enjoys
his favorite water sport”

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rainy Cold High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes

Table: Positive and negative training examples for the target concept EnjoySport.

The task is to learn to predict the value of EnjoySport for an arbitrary day, based on the
values of its other attributes?

What hypothesis representation is provided to the learner?

• Let’s consider a simple representation in which each hypothesis consists of a


conjunction of constraints on the instance attributes.
• Let each hypothesis be a vector of six constraints, specifying the values of the six
attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.

22 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

For each attribute, the hypothesis will either


• Indicate by a "?' that any value is acceptable for this attribute,
• Specify a single required value (e.g., Warm) for the attribute, or
• Indicate by a "Φ" that no value is acceptable

If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive
example (h(x) = 1).

The hypothesis that PERSON enjoys his favorite sport only on cold days with high humidity
is represented by the expression
(?, Cold, High, ?, ?, ?)

The most general hypothesis-that every day is a positive example-is represented by


(?, ?, ?, ?, ?, ?)

The most specific possible hypothesis-that no day is a positive example-is represented by


(Φ, Φ, Φ, Φ, Φ, Φ)

Notation

• The set of items over which the concept is defined is called the set of instances, which is
denoted by X.

Example: X is the set of all possible days, each represented by the attributes: Sky, AirTemp,
Humidity, Wind, Water, and Forecast

• The concept or function to be learned is called the target concept, which is denoted by c.
c can be any Boolean valued function defined over the instances X

c: X→ {O, 1}

Example: The target concept corresponds to the value of the attribute EnjoySport
(i.e., c(x) = 1 if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).

• Instances for which c(x) = 1 are called positive examples, or members of the target
concept.
• Instances for which c(x) = 0 are called negative examples, or non-members of the target
concept.
• The ordered pair (x, c(x)) to describe the training example consisting of the instance x and
its target concept value c(x).
• D to denote the set of available training examples

23 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• The symbol H to denote the set of all possible hypotheses that the learner may consider
regarding the identity of the target concept. Each hypothesis h in H represents a Boolean-
valued function defined over X
h: X→{O, 1}

The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in X.

• Given:
• Instances X: Possible days, each described by the attributes
• Sky (with possible values Sunny, Cloudy, and Rainy),
• AirTemp (with values Warm and Cold),
• Humidity (with values Normal and High),
• Wind (with values Strong and Weak),
• Water (with values Warm and Cool),
• Forecast (with values Same and Change).

• Hypotheses H: Each hypothesis is described by a conjunction of constraints on the


attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast. The constraints may be
"?" (any value is acceptable), “Φ” (no value is acceptable), or a specific value.

• Target concept c: EnjoySport : X → {0, l}


• Training examples D: Positive and negative examples of the target function

• Determine:
• A hypothesis h in H such that h(x) = c(x) for all x in X.

Table: The EnjoySport concept learning task.

The inductive learning hypothesis

Any hypothesis found to approximate the target function well over a sufficiently large set of
training examples will also approximate the target function well over other unobserved
examples.

24 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

CONCEPT LEARNING AS SEARCH

• Concept learning can be viewed as the task of searching through a large space of
hypotheses implicitly defined by the hypothesis representation.
• The goal of this search is to find the hypothesis that best fits the training examples.

Example:
Consider the instances X and hypotheses H in the EnjoySport learning task. The attribute Sky
has three possible values, and AirTemp, Humidity, Wind, Water, Forecast each have two
possible values, the instance space X contains exactly
3.2.2.2.2.2 = 96 distinct instances
5.4.4.4.4.4 = 5120 syntactically distinct hypotheses within H.

Every hypothesis containing one or more "Φ" symbols represents the empty set of instances;
that is, it classifies every instance as negative.
1 + (4.3.3.3.3.3) = 973. Semantically distinct hypotheses

General-to-Specific Ordering of Hypotheses

Consider the two hypotheses


h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)

• Consider the sets of instances that are classified positive by hl and by h2.
• h2 imposes fewer constraints on the instance, it classifies more instances as positive.
So, any instance classified positive by hl will also be classified positive by h2.
Therefore, h2 is more general than hl.

Given hypotheses hj and hk, hj is more-general-than or- equal do hk if and only if any instance
that satisfies hk also satisfies hi

Definition: Let hj and hk be Boolean-valued functions defined over X. Then hj is more


general-than-or-equal-to hk (written hj ≥ hk) if and only if

( xX ) [(hk (x) = 1) → (hj (x) = 1)]

25 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• In the figure, the box on the left represents the set X of all instances, the box on the
right the set H of all hypotheses.
• Each hypothesis corresponds to some subset of X-the subset of instances that it
classifies positive.
• The arrows connecting hypotheses represent the more - general -than relation, with the
arrow pointing toward the less general hypothesis.
• Note the subset of instances characterized by h2 subsumes the subset characterized by
hl , hence h2 is more - general– than h1

FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS

FIND-S Algorithm

1. Initialize h to the most specific hypothesis in H


2. For each positive training instance x
For each attribute constraint a in h
i
If the constraint a is satisfied by x
i
Then do nothing
Else replace a in h by the next more general constraint that is satisfied by x
i
3. Output hypothesis h

26 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

To illustrate this algorithm, assume the learner is given the sequence of training examples
from the EnjoySport task

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport


1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

• The first step of FIND-S is to initialize h to the most specific hypothesis in H


h - (Ø, Ø, Ø, Ø, Ø, Ø)

• Consider the first training example


x1 = <Sunny Warm Normal Strong Warm Same>, +

Observing the first training example, it is clear that hypothesis h is too specific. None
of the "Ø" constraints in h are satisfied by this example, so each is replaced by the next
more general constraint that fits the example
h1 = <Sunny Warm Normal Strong Warm Same>

• Consider the second training example


x2 = <Sunny, Warm, High, Strong, Warm, Same>, +

The second training example forces the algorithm to further generalize h, this time
substituting a "?" in place of any attribute value in h that is not satisfied by the new
example
h2 = <Sunny Warm ? Strong Warm Same>

• Consider the third training example


x3 = <Rainy, Cold, High, Strong, Warm, Change>, -

Upon encountering the third training the algorithm makes no change to h. The FIND-S
algorithm simply ignores every negative example.
h3 = < Sunny Warm ? Strong Warm Same>

• Consider the fourth training example


x4 = <Sunny Warm High Strong Cool Change>, +

The fourth example leads to a further generalization of h


h4 = < Sunny Warm ? Strong ? ? >

27 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The key property of the FIND-S algorithm


• FIND-S is guaranteed to output the most specific hypothesis within H that is
consistent with the positive training examples
• FIND-S algorithm’s final hypothesis will also be consistent with the negative
examples provided the correct target concept is contained in H, and provided the
training examples are correct.

Unanswered by FIND-S

1. Has the learner converged to the correct target concept?


2. Why prefer the most specific hypothesis?
3. Are the training examples consistent?
4. What if there are several maximally specific consistent hypotheses?

28 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

VERSION SPACES AND THE CANDIDATE-ELIMINATION


ALGORITHM

The key idea in the CANDIDATE-ELIMINATION algorithm is to output a description of the


set of all hypotheses consistent with the training examples

Representation

Definition: consistent- A hypothesis h is consistent with a set of training examples D if and


only if h(x) = c(x) for each example (x, c(x)) in D.

Consistent (h, D)  ( x, c(x)  D) h(x) = c(x))

Note difference between definitions of consistent and satisfies


• An example x is said to satisfy hypothesis h when h(x) = 1, regardless of whether x is
a positive or negative example of the target concept.
• An example x is said to consistent with hypothesis h iff h(x) = c(x)

Definition: version space- The version space, denoted V S with respect to hypothesis
H, D
space H and training examples D, is the subset of hypotheses from H consistent with the
training examples in D
V S {h  H | Consistent (h, D)}
H, D

The LIST-THEN-ELIMINATION algorithm

The LIST-THEN-ELIMINATE algorithm first initializes the version space to contain all
hypotheses in H and then eliminates any hypothesis found inconsistent with any training
example.

1. VersionSpace c a list containing every hypothesis in H


2. For each training example, (x, c(x))
remove from VersionSpace any hypothesis h for which h(x) ≠ c(x)
3. Output the list of hypotheses in VersionSpace

The LIST-THEN-ELIMINATE Algorithm

• List-Then-Eliminate works in principle, so long as version space is finite.


• However, since it requires exhaustive enumeration of all hypotheses in practice it is
not feasible.

29 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

A More Compact Representation for Version Spaces

The version space is represented by its most general and least general members. These
members form general and specific boundary sets that delimit the version space within the
partially ordered hypothesis space.

Definition: The general boundary G, with respect to hypothesis space H and training data
D, is the set of maximally general members of H consistent with D

G {g  H | Consistent (g, D)(g'  H)[(g' g g)  Consistent(g', D)]}

Definition: The specific boundary S, with respect to hypothesis space H and training data D,
is the set of minimally general (i.e., maximally specific) members of H consistent with D.

S {s  H | Consistent (s, D)(s'  H)[(s gs')  Consistent(s', D)]}

Theorem: Version Space representation theorem


Theorem: Let X be an arbitrary set of instances and Let H be a set of Boolean-valued
hypotheses defined over X. Let c: X →{O, 1} be an arbitrary target concept defined over X,
and let D be an arbitrary set of training examples {(x, c(x))). For all X, H, c, and D such that
S and G are well defined,

VS ={ h  H | (s  S ) (g  G ) ( g g h g s )}
H,D

To Prove:
1. Every h satisfying the right hand side of the above expression is in VS H, D
2. Every member of VS H, D satisfies the right-hand side of the expression

Sketch of proof:
1. let g, h, s be arbitrary members of G, H, S respectively with g g h g s
• By the definition of S, s must be satisfied by all positive examples in D. Because h g
s, h must also be satisfied by all positive examples in D.
• By the definition of G, g cannot be satisfied by any negative example in D, and
because g g h h cannot be satisfied by any negative example in D. Because h is
satisfied by all positive examples in D and by no negative examples in D, h is
consistent with D, and therefore h is a member of VSH,D.
2. It can be proven by assuming some h in VSH,D,that does not satisfy the right-hand side
of the expression, then showing that this leads to an inconsistency

30 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

CANDIDATE-ELIMINATION Learning Algorithm

The CANDIDATE-ELIMINTION algorithm computes the version space containing all


hypotheses from H that are consistent with an observed sequence of training examples.

Initialize G to the set of maximally general hypotheses in H


Initialize S to the set of maximally specific hypotheses in H
For each training example d, do
• If d is a positive example
• Remove from G any hypothesis inconsistent with d
• For each hypothesis s in S that is not consistent with d
• Remove s from S
• Add to S all minimal generalizations h of s such that
• h is consistent with d, and some member of G is more general than h
• Remove from S any hypothesis that is more general than another hypothesis in
S

• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
• Remove from G any hypothesis that is less general than another hypothesis in G

CANDIDATE- ELIMINTION algorithm using version spaces

31 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

An Illustrative Example

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport


1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

CANDIDATE-ELIMINTION algorithm begins by initializing the version space to the set of


all hypotheses in H;

Initializing the G boundary set to contain the most general hypothesis in H


G0 ?, ?, ?, ?, ?, ?

Initializing the S boundary set to contain the most specific (least general) hypothesis
S0 , , , , , 

• When the first training example is presented, the CANDIDATE-ELIMINTION algorithm


checks the S boundary and finds that it is overly specific and it fails to cover the positive
example.
• The boundary is therefore revised by moving it to the least more general hypothesis that
covers this new example
• No update of the G boundary is needed in response to this training example because Go
correctly covers this example

32 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• When the second training example is observed, it has a similar effect of generalizing S
further to S2, leaving G again unchanged i.e., G2 = G1 = G0

• Consider the third training example. This negative example reveals that the G
boundary of the version space is overly general, that is, the hypothesis in G incorrectly
predicts that this new example is a positive example.
• The hypothesis in the G boundary must therefore be specialized until it correctly
classifies this new negative example

Given that there are six attributes that could be specified to specialize G2, why are there only
three new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal specialization of
G2 that correctly labels the new example as a negative example, but it is not included
in G3. The reason this hypothesis is excluded is that it is inconsistent with the
previously encountered positive examples

33 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• Consider the fourth training example.

• This positive example further generalizes the S boundary of the version space. It also
results in removing one member of the G boundary, because this member fails to
cover the new positive example

After processing these four examples, the boundary sets S4 and G4 delimit the version space
of all hypotheses consistent with the set of incrementally observed training examples.

34 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

INDUCTIVE BIAS

The fundamental questions for inductive inference

1. What if the target concept is not contained in the hypothesis space?


2. Can we avoid this difficulty by using a hypothesis space that includes every possible
hypothesis?
3. How does the size of this hypothesis space influence the ability of the algorithm to
generalize to unobserved instances?
4. How does the size of the hypothesis space influence the number of training examples
that must be observed?

These fundamental questions are examined in the context of the CANDIDATE-


ELIMINTION algorithm

A Biased Hypothesis Space

• Suppose the target concept is not contained in the hypothesis space H, then obvious
solution is to enrich the hypothesis space to include every possible hypothesis.
• Consider the EnjoySport example in which the hypothesis space is restricted to
include only conjunctions of attribute values. Because of this restriction, the
hypothesis space is unable to represent even simple disjunctive target concepts such as
"Sky = Sunny or Sky = Cloudy."
• The following three training examples of disjunctive hypothesis, the algorithm would
find that there are zero hypotheses in the version space

Sunny Warm Normal Strong Cool Change Y


Cloudy Warm Normal Strong Cool Change Y
Rainy Warm Normal Strong Cool Change N

• If Candidate Elimination algorithm is applied, then it end up with empty Version


Space. After first two training example
S= ? Warm Normal Strong Cool Change

• This new hypothesis is overly general and it incorrectly covers the third negative
training example! So H does not include the appropriate c.
• In this case, a more expressive hypothesis space is required.

35 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

An Unbiased Learner

• The solution to the problem of assuring that the target concept is in the hypothesis space
H is to provide a hypothesis space capable of representing every teachable concept that is
representing every possible subset of the instances X.
• The set of all subsets of a set X is called the power set of X

• In the EnjoySport learning task the size of the instance space X of days described by
the six attributes is 96 instances.
• Thus, there are 296 distinct target concepts that could be defined over this instance
space and learner might be called upon to learn.
• The conjunctive hypothesis space is able to represent only 973 of these - a biased
hypothesis space indeed

• Let us reformulate the EnjoySport learning task in an unbiased way by defining a new
hypothesis space H' that can represent every subset of instances
• The target concept "Sky = Sunny or Sky = Cloudy" could then be described as

(Sunny, ?, ?, ?, ?, ?) v (Cloudy, ?, ?, ?, ?, ?)

The Futility of Bias-Free Learning

Inductive learning requires some form of prior assumptions, or inductive bias

Definition:
Consider a concept learning algorithm L for the set of instances X.
• Let c be an arbitrary concept defined over X
• Let Dc = {(x , c(x))} be an arbitrary set of training examples of c.
• Let L (xi , Dc) denote the classification assigned to the instance xi by L after training on
the data Dc.
• The inductive bias of L is any minimal set of assertions B such that for any target
concept c and corresponding training examples Dc

• ( xi  X ) [(B  Dc  xi) ├ L (xi, Dc )]

36 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The below figure explains


• Modelling inductive systems by equivalent deductive systems.
• The input-output behavior of the CANDIDATE-ELIMINATION algorithm using a
hypothesis space H is identical to that of a deductive theorem prover utilizing the
assertion "H contains the target concept." This assertion is therefore called the
inductive bias of the CANDIDATE-ELIMINATION algorithm.
• Characterizing inductive systems by their inductive bias allows modelling them by
their equivalent deductive systems. This provides a way to compare inductive systems
according to their policies for generalizing beyond the observed training data.

37 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

QUESTION BANK

1. What is Machine learning?


2. Why Use Machine learning? Explain with the example
3. Explain how you would categorize Machine Learning systems based on the amount
and type of supervision they receive during training. Provide examples for each
category.
4. Demonstrate how you would categorize Machine Learning systems based on their
ability to learn incrementally from a stream of incoming data. Provide examples for
each type.
5. Demonstrate how Machine Learning systems are classified based on their
generalization capabilities. Provide examples for each classification.
6. Illustrate the main challenges of ML that can arise when selecting a learning algorithm
and training it on data, specifically focusing on "bad algorithm" and "bad data."
Provide examples to support the explanation.
7. Explain the concepts of overfitting and underfitting in the context of training data in
Machine Learning
8. Illustrate some of the basic design issues and approaches to machine learning.
9. What are the issues in Machine Learning?
10. Discuss the concept learning with example.
11. Explain the General-to-Specific Ordering of Hypotheses
12. Explain the FIND-S algorithm with example.
13. Write the candidate elimination algorithm and illustrate with example
14. Write the final version space for the below mentioned training example using
candidate elimination algorithm

n-gl.com 38 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru
Machine Learning 21AI63

MODULE 2
Introduction: End-to-End Machine Learning Project
In this chapter, an example project end to end is presented, imagining the scenario of being a
recently hired data scientist in a real estate company. Here are the main steps to go through:

1. Look at the big picture.


2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.

Look at the Big Picture


Welcome to Machine Learning Housing Corporation!
• The first task is to build a model of housing prices in California using the California
census data.
• This data includes metrics such as population, median income, median housing price,
and more for each block group in California.
• Block groups are the smallest geographical units for which the US Census Bureau
publishes sample data, typically having a population of 600 to 3,000 people. These
will be referred to as “districts” for simplicity.
• The model should learn from this data and be able to predict the median housing price
in any district based on all the other metrics

Frame the Problem

Each of questions helps in framing and understanding the machine learning project more
effectively.
1. What exactly is the business objective? - This question aims to clarify the ultimate
goal of the project and how the company expects to benefit from the model.

2. How does the company expect to use and benefit from this model? - This is important
because it will determine how to frame the problem, what algorithms to select, what
performance measure should use to evaluate model, and how much effort should
spend tweaking it.

1 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Here, the model’s output (a prediction of a district’s median housing price) will be fed
to another Machine Learning system (Figure 2-2), along with many other signals. This
downstream system will determine whether it is worth investing in a given area or not.
Getting this right is critical, as it directly affects revenue.

3. What does the current solution look like (if any)? - It will give a reference
performance, as well as insights on how to solve the problem.

With all this information, it is ready to start designing the system

1. Is it supervised, unsupervised, or reinforcement learning? - This is a supervised


learning task because we have labeled training examples where each instance comes
with the expected output, i.e., the district’s median housing price.

2. Is it a classification task, a regression task, or something else? - It is a regression task


because we are asked to predict a continuous value (the median housing price). More
specifically, it is a multiple regression problem since the system will use multiple
features to make a prediction (such as population, median income, etc.). It is also a
univariate regression problem since we are only trying to predict a single value for
each district. If we were trying to predict multiple values per district, it would be a
multivariate regression problem

3. Should you use batch learning or online learning techniques? - Batch learning should
be chosen because there is no continuous flow of new data, no immediate need to
adjust to changing data, and the data is small enough to fit in memory.

2 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Select a Performance Measure

1. Root Mean Square Error (RMSE)


A performance measure for regression problems is the Root Mean Square Error (RMSE). It
gives an idea of how much error the system typically makes in its predictions, with a higher
weight for large errors.

This equation introduces several very common Machine Learning notations


• m is the number of instances in the dataset. For example, if you are evaluating the
RMSE on a validation set of 2,000 districts, then m = 2,000.
• x(i) is a vector of all the feature values (excluding the label) of the ith instance in the
dataset, and y(i) is its label (the desired output value for that instance).
• X is a matrix containing all the feature values (excluding labels) of all instances in the
dataset.
• h is called a hypothesis. When system is given an instance’s feature vector x(i), it
outputs a predicted value ŷ(i) = h(x(i)) for that instance
• RMSE(X,h) is the cost function measured on the set of examples using hypothesis h.

2. Mean Absolute Error (Average Absolute Deviation)

This performance measure is used when there are many outliers.

Both the RMSE and the MAE are ways to measure the distance between two vectors: the
vector of predictions and the vector of target values.

3 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Get the Data

Create the Workspace

• First, ensure Python is installed.


• Next, create a workspace directory for your Machine Learning code and datasets.
• Open a terminal and type the following commands:

• A number of Python modules are needed: Jupyter, NumPy, Pandas, Matplotlib, and
Scikit-Learn.
• The system’s packaging system (e.g., apt-get on Ubuntu, or MacPorts or HomeBrew
on MacOS) can be used. Install a Scientific Python distribution such as Anaconda and
its packaging system or Python’s own packaging system, pip, can be used.
• All the required modules and their dependencies can now be installed using this
simple pip command.

• To check your installation, try to import every module like this:

There should be no output and no error.

• Now you can fire up Jupyter by typing:

A Jupyter server is now running in your terminal, listening to port 8888


• Now create a new Python notebook by clicking on the New button and selecting the
appropriate Python version

Download the Data

• For this project, just download a single compressed file, housing.tgz, which contains a
comma-separated value (CSV) file called housing.csv with all the data.
• A simple method is to use web browser to download it, decompress the file and extract
the CSV file.
• But it is preferable to create a small function / script to download the data because it is
useful in particular if data changes regularly, it can run whenever you need to fetch the
latest data.

4 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Here is the function to fetch the data:

Now when you call fetch_housing_data(), it creates a datasets/housing directory in


workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this
directory.

Now let’s load the data using Pandas.

This function returns a Pandas DataFrame object containing all the data.

Take a Quick Look at the Data Structure

1. head() : Let’s take a look at the top five rows using the DataFrame’s head() method.
Each row represents one district. There are 10 attributes “longitude, latitude,
housing_median_age, total_rooms, total_bed_rooms, population, households,
median_income, median_house_value, and ocean_proximity.”

5 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

2. info(): The info() method is useful to get a quick description of the data, in particular
the total number of rows, and each attribute’s type and number of non-null values

3. value_counts(): You can find out what categories exist and how many districts belong
to each category by using the value_counts() method:

6 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

4. describe(): The describe() method shows a summary of the numerical attributes

• The count, mean, min, and max rows are self-explanatory. Note that the null values are
ignored (so, for example, count of total_bedrooms is 20,433, not 20,640).
• The std row shows the standard deviation, which measures how dispersed the values
are.
• The 25%, 50%, and 75% rows show the corresponding percentiles: a percentile
indicates the value below which a given percentage of observations in a group of
observations falls.
• For example, 25% of the districts have a housing_median_age lower than 18, while
50% are lower than 29 and 75% are lower than 37. These are often called the 25th
percentile (or 1st quartile), the median, and the 75th percentile (or 3rd quartile).

Create a Test Set

When splitting the data into training and test sets, it's important to ensure that test set remains
consistent across different runs of the program.

The Problem: If the dataset is randomly split into training and test sets each time the
program is run, different test sets will be generated each time. Over time, the model might see
the entire dataset, which defeats the purpose of having a separate test set.

Solution 1: Saving the Test Set

One way to address the issue of different test sets on each run is to save the test set when it is
first created. Then, load this saved test set in future runs. However, this approach has
limitations, especially if there is a need to update the dataset.

7 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Solution 2: Using a Random Seed

Another option is to set the random number generator’s seed (e.g., np.random.seed(42)) so
that it always generates the same shuffled indices.

But both these solutions will break next time you fetch an updated dataset.

A more robust approach is to use each instance's unique identifier to determine whether it
should be in the test set. This way, even if dataset is refreshed, the split remains consistent.

Here it can do:


• Compute a hash of each instance’s identifier.
• Put the instance in the test set if the hash value is below a certain threshold (e.g., 20%
of the maximum hash value).
This method ensures that test set contains approximately 20% of the data and remains
consistent across runs, even when the dataset is updated

Discover and Visualize the Data to Gain Insights

Visualizing Geographical Data

The dataset has geographical information (latitude and longitude), it is a good idea to create a
scatterplot of all districts to visualize the data.

8 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The above plot looks like California, but other than that it is hard to see any particular
pattern. Setting the alpha option to 0.1 makes it much easier to visualize the places where
there is a high density of data points

Now let’s look at the housing prices. The radius of each circle represents the district’s
population (s), and the color represents the price (c). We will use a predefined color map
(cmap) called jet, which ranges from blue (low values) to red (high prices)

9 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Looking for Correlations

To compute the standard correlation coefficient (also called Pearson’s r) between every pair
of attributes use the corr() method:

Now let’s look at how much each attribute correlates with the median house value:

• The correlation coefficient ranges from –1 to 1.


• When it is close to 1, it means that there is a strong positive correlation; for example,
the median house value tends to go up when the median income goes up.
• When the coefficient is close to –1, it means that there is a strong negative correlation;
For example, there is small negative correlation between the latitude and the median
house value (i.e., prices have a slight tendency to go down when you go north).
• When coefficients close to zero mean that there is no linear correlation.

The below figure shows various plots along with the correlation coefficient between their
horizontal and vertical axes.

Figure: Standard correlation coefficient of various datasets

10 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Another way to check for correlation between attributes is to use Pandas’ scatter_matrix
function, which plots every numerical attribute against every other numerical attribute

The main diagonal (top left to bottom right) would be full of straight lines if Pandas plotted
each variable against itself, which would not be very useful. So instead, Pandas displays a
histogram of each attribute

The most promising attribute to predict the median house value is the median income, so let’s
look in on their correlation scatterplot.

11 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Figure: Median income versus median house value

Prepare the Data for Machine Learning Algorithms

Data Cleaning
Start to clean training set. Let’s separate the predictors and the labels since we don’t want to
apply the same transformations to the predictors and the target values.

Missing Features: Most Machine Learning algorithms cannot work with missing features. If
any attribute has some missing values there are three options to handle:
• Get rid of the corresponding attribute.
• Get rid of the whole attribute.
• Set the values to some value (zero, the mean, the median, etc.).

These can be accomplish easily by using DataFrame’s dropna(), drop(), and fillna() methods:

If option 3 is chosen, compute the median value on the training set and use it to fill the
missing values in the training set. Save the computed median value, as it will be needed later
to replace missing values in the test set for system evaluation, and also to handle missing
values in new data once the system goes live.

12 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Scikit-Learn provides a class to take care of missing values: SimpleImputer

Since the median can only be computed on numerical attributes, we need to create a copy of
the data without the text attribute ocean_proximity:

Now, fit the imputer instance to the training data using the fit() method:

The imputer has simply computed the median of each attribute and stored the result in its
statistics_ instance variable.

Now you can use this “trained” imputer to transform the training set by replacing missing
values by the learned medians:

The result is a plain NumPy array containing the transformed features. If you want to put it
back into a Pandas DataFrame, it’s simple:

Handling Text and Categorical Attributes

To convert categories from text to numbers, we can use Scikit-Learn’s OrdinalEncoder

13 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Another way to create one binary attribute per category: one attribute equal to 1 when the
category is “<1H OCEAN” (and 0 otherwise), another attribute equal to 1 when the category
is “INLAND” (and 0 otherwise), and so on. This is called one-hot encoding, because only
one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are
sometimes called dummy attributes. Scikit-Learn provides a OneHotEn coder class to convert
categorical values into one-hot vectors.

By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense
array if needed by calling the toarray() method:

Feature Scaling
Machine Learning algorithms don’t perform well when the input numerical attributes have
very different scales.

There are two common ways to get all attributes to have the same scale:
1. Min-max scaling: In min-max scaling (normalization) the values are shifted and
rescaled so that they end up ranging from 0 to 1.

Scikit-Learn provides a transformer called MinMaxScaler for this.

14 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

2. Standardization (Z-score Normalization): The value x is subtracting the mean


value, and then it divides by the standard deviation so that the resulting distribution
has unit variance.

Scikit-Learn provides a transformer called StandardScaler for standardization

Transformation Pipelines
• There are many data transformation steps that need to be executed in the right order.
Scikit-Learn provides the Pipeline class to help with such sequences of
transformations.
• Here is a small pipeline for the numerical attributes:

First line imports the necessary classes from the sklearn library. Pipeline is used to create a
sequence of data processing steps.
StandardScaler is used to standardize features by removing the mean and scaling to unit
variance.
This code defines a pipeline named num_pipeline consisting of three steps:

1. 'imputer': Uses SimpleImputer to handle missing values by replacing them with the
median value of the column. This is specified by strategy="median".
2. 'attribs_adder': Uses a custom transformer CombinedAttributesAdder(), which is
assumed to be defined elsewhere. This step adds new attributes to the dataset based on
existing ones.
3. 'std_scaler': Uses StandardScaler to standardize the numerical attributes.
Standardization is the process of rescaling the features so that they have the properties
of a standard normal distribution with a mean of 0 and a standard deviation of 1.

15 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The last line applies the pipeline to the housing_num data. The fit_transform method first fits
the pipeline to the data i.e., it computes the necessary statistics such as median values for
imputation and mean/standard deviation for scaling and then transforms the data according to
the fitted pipeline.

Select and Train a Model

Training and Evaluating on the Training Set

Let’s first train a Linear Regression model.

Let’s try it out on a few instances from the training set:

Let’s measure this regression model’s RMSE on the whole training set using Scikit-Learn’s
mean_squared_error function:

• This score is better than nothing but clearly not a great score: most districts’
median_housing_values range between $120,000 and $265,000, so a typical prediction
error of $68,628 is not very satisfying.
• This is an example of a model underfitting the training data. When this happens it can
mean that the features do not provide enough information to make good predictions, or
that the model is not powerful enough.
• To fix underfitting are to select a more powerful model, to feed the training algorithm
with better features, or to reduce the constraints on the model.

16 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Better Evaluation Using Cross-Validation

To evaluate the model (Decision Tree) would be to use the train_test_split function to split
the training set into a smaller training set and a validation set, then train models against the
smaller training set and evaluate them against the validation set.

K-Fold Cross-Validation
• A more efficient alternative is using Scikit-Learn’s K-fold cross-validation.
• This method splits the training set into n distinct subsets, called folds.
• The model is trained and evaluated k times, each time using a different fold for
evaluation and the remaining k-1 folds for training.
• This results in an array of k evaluation scores.

Image Source: https://docs.ultralytics.com/guides/kfold-cross-validation/#introduction

Insights from Cross-Validation


• The Decision Tree model might not perform as well as expected when using cross-
validation. For example, it may perform worse than the Linear Regression model.
• Cross-validation not only estimates the model’s performance but also gives a measure
of its precision (standard deviation).

Overfitting in Decision Tree Model


• If the Decision Tree performs worse than expected, it might be overfitting.
• Overfitting occurs when the model learns the training data too well, including noise
and outliers, which reduces its performance on new data.

Trying the RandomForestRegressor


• Random Forests train multiple Decision Trees on random subsets of features and
average their predictions. This technique, known as Ensemble Learning, often
enhances the performance of machine learning models.
• Although Random Forests show promising results, they can still overfit if the training
set score is much lower than the validation set score.

17 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

To handle overfitting, consider the following:


• Simplify the model.
• Apply regularization techniques to constrain the model.
• Obtain more training data to improve the model's generalization.

Exploring More Models


• Before finalizing on Random Forests or any other model, experiment with a variety of
models.
• Try models from different categories of machine learning algorithms, such as: Support
Vector Machines with different kernels and Neural networks.

The goal is to identify a shortlist of two to five promising models without spending too much
time on hyperparameter tweaking. By following these steps, you can ensure a thorough
evaluation of machine learning models, leading to better performance and reliability in real-
world applications.

Fine-Tune Your Model

Grid Search
• Scikit-Learn’s GridSearchCV tell which hyperparameters you want it to experiment
with, and what values to try out, and it will evaluate all the possible combinations of
hyperparameter values, using cross-validation.
• For example, the following code searches for the best combination of hyperparameter
values for the RandomForestRegressor:

18 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of


n_estimators and max_features hyperparameter values specified in the first dict, then try all
2 × 3 = 6 combinations of hyperparameter values in the second dict, but this time with the
bootstrap hyperparameter set to False instead of True.
The grid search will explore 12 + 6 = 18 combinations of RandomForestRegressor
hyperparameter values, and it will train each model five times. In other words, there will be
18 × 5 = 90 rounds of training!

Randomized Search

• When the hyperparameter search space is large, it is often preferable to use


RandomizedSearchCV instead.
• It evaluates a given number of random combinations by selecting a random value for
each hyperparameter at every iteration.

Ensemble Methods

Another way to fine-tune your system is to try to combine the models that perform best. The
group (or “ensemble”) will often perform better than the best individual model, especially if
the individual models make very different types of errors.

Analyze the Best Models and Their Errors


The RandomForestRegressor can indicate the relative importance of each attribute for
making accurate predictions.

With this information, you may want to try dropping some of the less useful features. You
should also look at the specific errors that your system makes, then try to understand why it
makes them and what could fix the problem.

19 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Evaluate Your System on the Test Set


Get the predictors and the labels from test set, run your full_pipeline to transform the data,
and evaluate the final model on the test set.

Launch, Monitor, and Maintain Your System

• Production Readiness: Integrate the production input data sources into your system
and write necessary tests to ensure everything functions correctly.
• Performance Monitoring: Develop code to monitor your system’s live performance
regularly and trigger alerts if there is a performance drop, to catch both sudden
breakage and gradual performance degradation.
• Human Evaluation: Implement a pipeline for human analysis of your system’s
predictions, involving field experts or crowdsourcing platforms, to evaluate and
improve system accuracy.
• Input Data Quality Check: Regularly evaluate the quality of the system’s input data
to detect issues early, preventing minor problems from escalating and affecting system
performance.
• Automated Training: Automate the process of training models with fresh data
regularly to maintain consistent performance and save snapshots of the system's state
for easy rollback in online learning systems.

20 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Explanation of Grid Search (Additional Concept)

GridSearchCV: This is a tool from Scikit-Learn that performs an exhaustive search over
specified parameter values for an estimator. It helps in finding the best combination of
hyperparameters for a given model.

• param_grid: This is a list of dictionaries, where each dictionary defines a set of


hyperparameters to search over.
o n_estimators: This parameter specifies the number of trees in the forest.
o max_features: This parameter specifies the maximum number of features to
consider when looking for the best split.
o The first dictionary searches over different combinations of n_estimators
and max_features with the default setting of bootstrap=True.
o The second dictionary adds an additional setting to search over:
bootstrap=False, with its own combinations of n_estimators and
max_features

forest_reg: This creates an instance of the RandomForestRegressor, which is the


model we want to tune.

• grid_search: This initializes GridSearchCV with several parameters:


o forest_reg: The estimator (model) to be tuned.
o param_grid: The parameter grid defined earlier, specifying the
hyperparameters to search over.
o cv=5: This sets the cross-validation strategy to 5-fold cross-validation. This
means the data will be split into 5 parts, and the model will be trained and
validated 5 times, each time using a different part of the data for validation and
the remaining parts for training.

21 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

o scoring='neg_mean_squared_error': This sets the scoring metric to


negative mean squared error. GridSearchCV will use this metric to evaluate
the performance of each combination of hyperparameters. The negative sign is
used because Scikit-Learn expects higher values to be better, but for mean
squared error, lower values are better.
o return_train_score=True: This ensures that the training scores for each
fold and parameter combination are stored in the results

fit: This method trains the GridSearchCV object using the prepared housing data
(housing_prepared) and the corresponding labels (housing_labels)

22 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Classification
MNIST
• MNIST dataset, which is a set of 70,000 small images of digits handwritten by high
school students and employees of the US Census Bureau.
• Each image is labeled with the digit it represents.
• This set has been studied so much that it is often called the “Hello World” of Machine
Learning: whenever people come up with a new classification algorithm, they are
curious to see how it will perform on MNIST.

The following code fetches the MNIST dataset

from sklearn.datasets import fetch_openml


# Fetch the MNIST dataset from OpenML
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

Output

dict_keys(['data', 'target', 'frame', 'categories',


'feature_names', 'target_names', 'DESCR', 'details', 'url'])

Datasets loaded by Scikit-Learn generally have a dictionary structure including:


• A DESCR key describing the dataset
• A data key containing an array with one row per instance and one column per feature
• A target key containing an array with the labels

X, y = mnist["data"], mnist["target"]
X.shape
y.shape

Output
(70000, 784)
(70000,)

There are 70,000 images, and each image has 784 features. This is because each image is
28×28 pixels, and each feature simply represents one pixel’s intensity, from 0 (white) to 255
(black).

23 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Let’s look at one digit from the dataset. Fetch an instance’s feature vector, reshape it to a
28×28 array, and display it using Matplotlib’s imshow() function:

# Convert the target to integers


y = y.astype(int)

# Select an instance (e.g., the first instance)


some_digit = X.iloc[0]

# Reshape the feature vector to a 28x28 array


some_digit_image = some_digit.values.reshape(28, 28)

# Display the digit using Matplotlib


plt.imshow(some_digit_image, cmap='gray')
plt.title(f"Label: {y[0]}")
plt.axis('off')
plt.show()
Output

The below figure shows a few more images from the MNIST dataset to give you a feel for the
complexity of the classification task.

The MNIST dataset is actually already split into a training set (the first 60,000 images) and a
test set (the last 10,000 images):

X_train, X_test, y_train, y_test = X[:60000], X[60000:],


y[:60000], y[60000:]

24 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Training a Binary Classifier

• To simplify the problem, focus on identifying a single digit, such as the number 5.
• This '5-detector' will serve as an example of a binary classifier, distinguishing
between two classes: 5 and not-5.
• Let's create the target vectors for this classification task:

y_train_5 = (y_train == 5) # True for all 5s, False for all other digits.
y_test_5 = (y_test == 5)

• Let’s pick a classifier and train it. Consider the Stochastic Gradient Descent (SGD)
classifier, using Scikit-Learn’s SGDClassifier class.
• This classifier has the advantage of being capable of handling very large datasets
efficiently.

from sklearn.linear_model import SGDClassifier


sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

Detect images of the number 5:


sgd_clf.predict([some_digit])

Output
array([ True])

The classifier guesses that this image represents a 5 (True)

25 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Performance Measures

Measuring Accuracy Using Cross-Validation

• Let’s use the cross_val_score() function to evaluate your SGDClassifier model using
K-fold cross-validation, with three folds.
• K-fold crossvalidation means splitting the training set into K-folds (three), then
making predictions and evaluating them on each fold using a model trained on the
remaining folds

from sklearn.model_selection import cross_val_score


cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Output
array([0.95035, 0.96035, 0.9604 ])

• Sometimes accuracy is not the preferred performance measure for classifiers,


especially when dealing with skewed datasets, i.e., when some classes are much more
frequent than others.

Confusion Matrix

• It is a table that is used to evaluate the performance of a classification algorithm.


• The general idea is to count the number of times instances of class A are classified as
class B.
• It provides a comprehensive breakdown of the predictions made by the model and
compares them to the actual outcomes. The matrix helps to understand how well the
classifier is performing, especially in distinguishing between different classes.

Components of a Confusion Matrix


A confusion matrix has the following components for a binary classification problem:

1. True Positives (TP): The number of instances correctly predicted as positive.


2. True Negatives (TN): The number of instances correctly predicted as negative.
3. False Positives (FP): The number of instances incorrectly predicted as positive (Type
I error).
4. False Negatives (FN): The number of instances incorrectly predicted as negative
(Type II error).

26 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

from sklearn.model_selection import cross_val_predict


y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

from sklearn.metrics import confusion_matrix


confusion_matrix(y_train_5, y_train_pred)

Output
array([[53892, 687],
[ 1891, 3530]]

• Each row in a confusion matrix represents an actual class, while each column
represents a predicted class.
• The first row of this matrix considers non-5 images (the negative class): 53,892 of
them were correctly classified as non-5s (they are called true negatives), 687 were
wrongly classified as 5s (false positives).
• The second row considers the images of 5s (the positive class): 1,891 were wrongly
classified as non-5s (false negatives), while the remaining 3530 were correctly
classified as 5s (true positives).
• A perfect classifier would have only true positives and true negatives, so its confusion
matrix would have nonzero values only on its main diagonal (top left to bottom right)

y_train_perfect_predictions = y_train_5 # pretend we reached perfection


confusion_matrix(y_train_5, y_train_perfect_predictions)

Output
array([[54579, 0],
[ 0, 5421]]

27 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Precision

• Precision is a performance metric that measures the accuracy of positive predictions


made by the model
• Precision is defined as the ratio of true positive predictions to the total number of
positive predictions.

Where:
• TP (True Positives) is the number of correctly predicted positive instances.
• FP (False Positives) is the number of instances incorrectly predicted as positive.

Recall (Sensitivity or True Positive Rate)


• Recall is defined as the ratio of positive instances that are correctly detected by the
classifier.

Where:
• TP (True Positives) is the number of correctly predicted positive instances.
• FN (False Negatives) is the number of actual positive instances that were
incorrectly predicted as negative.

from sklearn.metrics import precision_score, recall_score


precision_score(y_train_5, y_train_pred)

Output
0.8370879772350012

precision_score(y_train_5, y_train_pred)

Output
0.6511713705958311

When it claims an image represents a 5, it is correct only 83.7% of the time. Moreover, it
only detects 65.1% of the 5s.

28 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

F1 score
• The F1 score is a metric used to evaluate the performance of a binary classification
model.
• It is the harmonic mean of precision and recall, providing a single metric that balances
both the false positives and false negatives.
• The F1 score is useful when you need to take both precision and recall into account
and is helpful when dealing with imbalanced datasets.

from sklearn.metrics import f1_score


f1_score(y_train_5, y_train_pred)

Output
0.7325171197343846

The ROC Curve


• The Receiver operating characteristic (ROC) curve is another common tool used with
binary classifiers.
• The ROC curve plots the true positive rate (recall) against the false positive rate.
• The FPR is the ratio of negative instances that are incorrectly classified as positive.
FPR = 1 - TNR
• The True Negative Rate (TNR), which is the ratio of negative instances that are
correctly classified as negative. The TNR is also called specificity. Hence the ROC
curve plots sensitivity (recall) versus 1 – specificity.

29 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

How to Read a ROC Curve


• Diagonal Line: A ROC curve that lies on the diagonal line (from bottom left to top
right) represents a classifier with no discriminative power, equivalent to random
guessing.
• Above the Diagonal: The area above the diagonal represents better-than-random
performance. The closer the ROC curve is to the top-left corner, the better the model is
at distinguishing between the positive and negative classes.
• Below the Diagonal: Curves below the diagonal indicate worse-than-random
performance.

Area under the curve (AUC) : To compare classifiers is to measure the area under the
curve. The AUC value ranges from 0 to 1.

• AUC = 1: Perfect classifier.


• AUC = 0.5: No discriminative power, equivalent to random guessing.
• AUC < 0.5: Indicates a model that is performing worse than random guessing.

Let’s consider RandomForestClassifier and compare its ROC curve and ROC AUC score to
the SGDClassifier

The RandomForestClassifier’s ROC curve looks much better than the SGDClassifier’s. It
comes much closer to the top-left corner.

30 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Multiclass Classification

• The binary classifiers distinguish between two classes. Multiclass classifiers


(Multinomial classifiers) can distinguish between more than two classes.
• Some algorithms are capable of handling multiple classes directly called Direct
Multiclass Algorithms. Ex: Random Forest, Naive Bayes.
• Some algorithms are strictly binary classifiers. Ex: Support Vector Machine (SVM),
Linear classifiers.

Consider a system that can classify the digit images into 10 classes (from 0 to 9). There are
Multiclass Strategies
One-versus-All (OvA) Strategy:
• Train 10 binary classifiers, one for each digit (0 to 9).
• Classify an image by selecting the class with the highest score.
• Example: Train a 0-detector, 1-detector, etc.
One-versus-One (OvO) Strategy:
• Train a binary classifier for every pair of digits. one to distinguish 0s and 1s,
another to distinguish 0s and 2s, another for 1s and 2s, and so on
• If there are N classes, you need
N×(N−1)/2 classifiers.
• For 10 classes, train 45 classifiers. Classify an image by determining which class
wins the most pairwise duels.

Algorithm Selection
• OvO Preferred: For algorithms like SVM that scale poorly with large training sets.
• OvA Preferred: For most binary classification algorithms.
• Scikit-Learn Default: Automatically applies OvA for binary classifiers, OvO for SVM.

31 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Error Analysis

Assume we have a promising model and aim to improve it by analysing the errors it makes.
Start by looking at the confusion matrix.

• Use cross_val_predict() to make predictions


• Generate the confusion matrix
• Convert the confusion matrix into an image for better visualization

This confusion matrix looks fairly good, since most images are on the main diagonal, which
means that they were classified correctly. The 5s look slightly darker than the other digits,
which could mean that there are fewer images of 5s in the dataset or that the classifier does
not perform as well on 5s as on other digits.

Error Rate Analysis


• Normalize the confusion matrix to compare error rates.
• Focus on the errors by filling the diagonal with zeros

32 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Insights from Error Analysis


• Rows represent actual classes, and columns represent predicted classes.
• Bright columns indicate frequent misclassifications into that class.
• For instance, '8s' are often misclassified, though actual '8s' are correctly identified.

Improving the Classifier


• Gather more training data for digits similar to '8' but not '8'.
• engineer new features that would help the classifier. for example, writing an algorithm
to count the number of closed loops in digits (e.g., 8 has two, 6 has one, 5 has none)
• preprocess the images (e.g., using Scikit-Image, Pillow, or OpenCV) to make some
patterns stand out more, such as closed loops.

Analyzing individual errors gain insights on what classifier is doing and why it is failing, but
it is more difficult and time-consuming.
Ex: let’s plot examples of 3s and 5s

The two 5×5 blocks on the left show digits classified as 3s, and the two 5×5 blocks on the
right show images classified as 5s. Some of the digits that the classifier gets wrong (i.e., in
the bottom-left and top-right blocks) are so badly written that even a human would have
trouble classifying them (e.g., the 5 on the 1st row and 2nd column truly looks like a badly
written 3).

Understanding the Classifier’s Errors


• Misclassifications might be due to badly written digits or similarities between digits.
• The linear model (SGDClassifier) assigns weights to pixels, making it sensitive to
image shifting and rotation.
• Preprocess images to ensure they are well-centred and aligned to reduce errors.

33 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Multilabel Classification
• Multilabel Classification is a type of classification where each instance can belong to
multiple classes simultaneously.
• For example, in a face-recognition system, a picture with multiple known faces
should result in multiple outputs. If a classifier can recognize Alice, Bob, and Charlie,
and sees a picture of Alice and Charlie, it should output [1, 0, 1], meaning "Alice yes,
Bob no, Charlie yes".

Example: Multilabel Classification with Digits


• To understand multilabel classification, let's consider example to classify digits based
on two labels: the first indicates whether or not the digit is large (7, 8, or 9) and the
second indicates whether or not it is odd.

Steps to Implement Multilabel Classification


1. Create Target Labels:
• y_train_large indicates if a digit is large (7, 8, or 9).
• y_train_odd indicates if a digit is odd.
• Combine these labels into a y_multilabel array.

from sklearn.neighbors import KNeighborsClassifier


y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

2. Train the Classifier:


• Use a KNeighborsClassifier which supports multilabel classification

from sklearn.neighbors import KNeighborsClassifier


knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

3. Make Predictions:
• Predict using the trained classifier and output multiple labels.

knn_clf.predict([some_digit])

Output
array([[False, True]])
The digit 5 is indeed not large (False) and odd (True).

34 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Multioutput Classification
• Multioutput Classification (multioutput-multiclass classification) is a generalization of
multilabel classification. In this type of classification, each label can have multiple
values, not just binary options.
• For example, each label can represent different pixel intensities ranging from 0 to 255.

Example: Removing Noise from Images


• Illustrate this with an example where the goal is to remove noise from digit images.
• The input will be a noisy image, and the output will be a clean image of the digit.

Steps to Implement Multioutput Classification


1. Create Training and Test Sets:
• Add noise to the original MNIST digit images using NumPy's randint() function.
• The noisy images are the input, and the original images are the target

2. Visualize Noisy and Clean Images:


• Before training, visualize a noisy image and its corresponding clean image. This
step helps to understand the task visually

3. Train the Classifier, Make Predictions and Clean the Image:


• Use a KNeighborsClassifier to train on the noisy images and their clean
counterparts.

n-gl.com 35 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru
Machine Learning 21AI63

MODULE 3
Introduction: Training Models
Introduction to Training Machine Learning Models
• In earlier chapters, a lot with Machine Learning (ML) algorithms were used without
knowing the details of how the models work internally. This approach works well in
many situations, but understanding the inner workings of ML models can be
beneficial.

Why Understand the Inner Workings?


• Choosing Models and Algorithms: Helps in selecting the right model and training
algorithm.
• Hyperparameters: Aids in setting appropriate hyperparameters for better performance.
• Debugging and Error Analysis: Makes it easier to debug and analyze errors.

Linear Regression

• Linear regression is a type of supervised machine learning algorithm that computes the
linear relationship between the dependent variable and one or more independent
features by fitting a linear equation to observed data.
• When there is only one independent feature, it is known as Simple Linear Regression,
and when there is more than one feature, it is known as Multiple Linear Regression.
• Similarly, when there is only one dependent variable, it is considered Univariate
Linear Regression, while when there are more than one dependent variables, it is
known as Multivariate Regression.

Definition: A linear model makes a prediction by simply computing a weighted sum of the
input features, plus a constant called the bias term (intercept term).

1 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

This can be written in vectorized form

In linear regression, the goal is to find a relationship between the dependent variable (Y) and
one or more independent variables (X). This relationship is represented by a line, known as
the best-fit line, which can be used to predict Y from X. Linear regression involves learning a
function from the given data that minimizes the error between predicted and actual values.

What is the Best Fit Line?


The best-fit line is a straight line that best represents the data on a scatter plot. It minimizes
the error between the predicted values (ŷ) and the actual values (Y). This line shows how
much Y changes with a unit change in X.
• Dependent Variable (Y): The value we want to predict.
• Independent Variable (X): The value used to make predictions.

To find the best-fit line, we need to determine the best values for θ1 and θ2. This is done using
the cost function, which measures how well the model predicts the actual values. In linear
regression, we commonly use the Mean Squared Error (MSE) as the cost function:

2 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Note: Interactive Visualization of Linear Regression


https://observablehq.com/@yizhe-ang/interactive-visualization-of-linear-regression

Training the Linear Regression Model


• Training a model means setting its parameters so that the model best fits the training
set. So, we need a measure of how well (or poorly) the model fits the training data.
• To train a Linear Regression model, find the value of θ that minimize the Mean Square
Error (MSE)

The MSE of a Linear Regression hypothesis hθ on a training set X is calculated using

The Normal Equation: Finding the Best Parameters for Linear Regression
To find the best parameters (θ) for a linear regression model, use a mathematical formula
called the Normal Equation. This equation directly computes the optimal values for θ that
minimize the cost function (usually the Mean Square Error).

3 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Example: Generating Data and Using the Normal Equation


1. Generate Data:
import numpy as np

# Generate random linear-looking data


X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

2. Compute θ Using the Normal Equation


# # add x0 = 1 to each instance
X_b = np.c_[np.ones((100, 1)), X]

# Calculate θ using the Normal Equation


theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

3. Check the Results: The original function we used to generate the data was
y = 4 + 3x1 + Gaussian noise.

Let’s see what θ the equation found:


print(theta_best)

Output: array([[4.21509616], [2.77011339]])

4. Make Predictions:
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]

y_predict = X_new_b.dot(theta_best)
print(y_predict)

Output: array([[4.21509616], [9.75532293]])

5. Plot the Results


import matplotlib.pyplot as plt

plt.plot(X_new, y_predict, "r-")


plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()

4 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Gradient Descent

Gradient Descent is a powerful optimization algorithm used to find the minimum value of a
function. It's widely used in machine learning to minimize the cost function of models like
linear regression.

Conceptual Explanation
Imagine you are lost in the mountains on a foggy day. You want to get to the lowest point in
the valley, but you can only feel the slope of the ground under your feet. The best way to
reach the bottom is to keep moving downhill in the direction where the slope is the steepest.
This is how Gradient Descent works: it adjusts the parameters step by step to minimize the
cost function, similar to how you would move downhill to minimize your altitude.

Working of Gradient Descent


• Initialization: Start with random values for the parameters (θ).
• Compute the Gradient: Measure the slope (gradient) of the cost function with respect
to the parameters.
• Update Parameters: Adjust the parameters in the direction that reduces the cost
function the most. This is done using the learning rate, which determines the size of
the steps.
• Iterate: Repeat the process until the parameters converge to the minimum value.

5 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

An important parameter in Gradient Descent is the size of the steps, determined by the
learning rate hyperparameter. If the learning rate is too small, the algorithm will take very
small steps, and it will take a long time to reach the minimum.

If the learning rate is too high, the algorithm might overshoot the minimum, causing it to
diverge and fail to find the optimal solution.

Two main challenges with Gradient Descent


• Local Minima: Some cost functions have multiple local minima. If the algorithm
starts in a region that leads to a local minimum, it might not find the global minimum.
• Plateaus: Some regions of the cost function may be flat, causing the algorithm to take
a long time to find the slope and if you stop too early you will never reach the global
minimum.

6 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Convex Cost Function


For linear regression, the Mean Squared Error (MSE) cost function is convex, meaning it has
a single global minimum and no local minima. This makes Gradient Descent effective
because it will always converge to the global minimum, given a properly chosen learning
rate.

Feature Scaling
The cost function has the shape of a bowl, but it can be an elongated bowl if the features have
very different scales. below shows Gradient Descent on a training set where features 1 and 2
have the same scale (on the left), and on a training set where feature 1 has much smaller
values than feature 2 (on the right).

7 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Batch Gradient Descent

• To implement Gradient Descent, calculate the gradient of the cost function concerning
each model parameter θj.
• Calculate how much the cost function will change if you change θj just a little bit. This
is known as a partial derivative.

Instead of computing these partial derivatives individually, compute them all in one go. The
gradient vector, ∇θ MSE(θ), contains all the partial derivatives of the cost function

Once the gradient vector is found, the gradient vector points uphill (the direction of the
steepest ascent) and to minimize the cost function, move in the opposite direction (downhill).

Gradient Descent Step


To move downhill, update the parameters using the following steps:
• Calculate the Gradient Vector: Find ∇θMSE(θ)
• Determine Step Size: Multiply the gradient vector by the learning rate 𝜂.
• Update Parameters: Subtract this value from the current parameter values.

8 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Below shows the first 10 steps of Gradient Descent using three different learning rates. The
dashed line represents the starting point.

Stochastic Gradient Descent

• The Main Problem with Batch Gradient Descent is it uses the entire training set to
compute gradients at each step. Main issue is this process is very slow for large
training sets because it involves a lot of data manipulation at each iteration.
• In Stochastic Gradient Descent (SGD), instead of using the whole training set, SGD
picks a random instance from the training set at each step and computes the gradients
based on that single instance.
• This makes SGD much faster because it deals with very little data at every iteration.
Only one instance needs to be in memory at a time, which allows training on huge
datasets.
Characteristics of SGD
• Due to its stochastic (random) nature, the cost function does not decrease smoothly.
Instead, it bounces up and down but generally decreases over time.
• Once the algorithm stops, the final parameter values are close to the optimal but not
exactly at the minimum.
• The randomness helps SGD jump out of local minima, increasing the chance of
finding the global minimum compared to Batch Gradient Descent.

9 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Learning Rate and Learning Schedule


• Randomness is good to escape from local optima, but bad because it means that the
algorithm can never settle at the minimum. One solution to this dilemma is to
gradually reduce the learning rate.
• The Learning Rate determines the size of the steps taken towards the minimum.
• Initially it starts with large steps to make quick progress and escape local minima then
get smaller and smaller, allowing the algorithm to settle at the global minimum.
• The function that determines the learning rate at each iteration is called the learning
schedule. If the learning rate is reduced too quickly, may get stuck in a local
minimum. If the learning rate is reduced too slowly, may jump around the minimum
for a long time and end up with a suboptimal solution if you halt training too early.

Mini-batch Gradient Descent


• Mini-batch Gradient Descent is a variation of Gradient Descent. It combines ideas
from Batch Gradient Descent and Stochastic Gradient Descent.
• Minibatch GD computes the gradients on small random sets of instances called
minibatches.
• Advantages of Mini-batch GD is it takes advantage of hardware optimization for
matrix operations, especially when using GPUs, which speeds up computations.
• Mini-batch GD may have a harder time escaping local minima compared to SGD.

10 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Polynomial Regression

Sometimes, data is more complex than a straight line can represent. Linear Model can be
used to fit nonlinear data by adding powers of each feature as new features. This technique is
called Polynomial Regression.

Example: Let’s generate some nonlinear data, based on a simple quadratic equation plus
some noise.

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

• m = 100 data points.


• X: Random data points between -3 and 3.
• y: Quadratic equation with noise.

Use Scikit-Learn’s PolynomialFeatures class to add polynomial features.

from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=2, include_bias=False)


X_poly = poly_features.fit_transform(X)

X[0] # Original feature


# Output: array([-0.75275929])

X_poly[0] # Original feature and its square


# Output: array([-0.75275929, 0.56664654])

11 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• degree=2: Adds the square of each feature.


• include_bias=False: Excludes the bias term from the transformation.
• X_poly: Contains original features plus their squares.

Train a LinearRegression model on the extended dataset.

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_

# Output: (array([1.78134581]), array([[0.93366893, 0.56456263]]))

12 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Regularized Linear Models

Why Regularize?
• Prevent Overfitting: Regularization helps reduce overfitting by constraining the
model, making it less flexible and less likely to overfit the data.
• Degrees of Freedom: The fewer degrees of freedom a model has, the less likely it is to
fit the noise in the data.

Types of Regularized Models


• Ridge Regression
• Lasso Regression
• Elastic Net

1. Ridge Regression
• Ridge Regression, also known as Tikhonov regularization, adds a regularization term
to the Linear Regression cost function.
• Regularization term added is

This term forces the model to keep the weights 𝜃 as small as possible.
• Purpose: To fit the data while keeping the model simple by having smaller weights.

Ridge Regression Cost Function:

The Mean Squared Error (MSE) plus a term that penalizes large weights. The bias term 𝜃0 is
not regularized.

13 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Lasso Regression (Least Absolute Shrinkage and Selection Operator Regression)


Like Ridge Regression, it adds a regularization term to the Linear Regression cost function,
but with a key difference in how it penalizes the weights.

14 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Elastic Net
• Elastic Net is a middle ground between Ridge Regression and Lasso Regression.
• The regularization term is a simple mix of both Ridge and Lasso’s regularization
terms, and you can control the mix ratio r.
• When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is
equivalent to Lasso Regression

Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of
features is greater than the number of training instances or when several features are strongly
correlated.

Early Stopping
Early stopping is a regularization technique used in iterative learning algorithms like Gradient
Descent to prevent overfitting. Instead of running the algorithm for a fixed number of
iterations or until the cost function converges, you monitor the model's performance on a
validation set and stop training when the validation error stops improving.

How Does Early Stopping Work?


Training and Validation Error:
• As the model trains, the error (e.g., Root Mean Square Error, RMSE) on the training
set decreases. Similarly, the error on the validation set also decreases initially.
Detection of Overfitting:
• After some time, the validation error stops decreasing and starts to increase, indicating
that the model is beginning to overfit the training data.
• Overfitting means the model is too closely tailored to the training data, losing its
ability to generalize to new data.
Stopping Training:
• With early stopping, you halt the training process when the validation error reaches its
minimum, before it starts increasing again.

15 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Logistic Regression

Logistic Regression is used to estimate the probability that an instance belongs to a particular
class. If the estimated probability is greater than 50%, then the model predicts that the
instance belongs to that class (positive class or 1), or else it predicts that it does not (negative
class or “0”). This makes it a binary classifier. (outliers, output <1 or >1)

Estimating Probabilities
Logistic Regression model computes a weighted sum of the input features plus a bias term, it
and outputs the logistic of the result.

Once the Logistic Regression model has estimated the probability P=hθ(x) that an instance x
belongs to the positive class, it can make its prediction ŷ easily

16 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Training and Cost Function

The objective of training is to set the parameter vector θ so that the model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative instances (y=0).
This idea is captured by the cost function shown in Equation for a single training instance x.

The cost function over the whole training set is simply the average cost over all training
instances. It can be written in a single expression called the log loss.

The partial derivatives of the cost function with regards to the jth model parameter θj is given
by

17 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Softmax Regression

• The Logistic Regression model can be generalized to support multiple classes directly,
without having to train and combine multiple binary classifiers. This is called Softmax
Regression, or Multinomial Logistic Regression

18 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

19 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Support Vector Machines


• A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning
model, capable of performing linear or nonlinear classification, regression, and even
outlier detection.
• SVMs are well suited for classification of complex but small- or medium-sized
datasets.

Linear SVM Classification

Hyperplane
A hyperplane is a decision boundary which separates between given set of data points having
different class labels. The SVM classifier separates data points using a hyperplane with the
maximum amount of margin. This hyperplane is known as the maximum margin hyperplane
and the linear classifier it defines is known as the maximum margin classifier.

Support Vectors
Support vectors are the sample data points, which are closest to the hyperplane. These data
points will define the separating line or hyperplane better by calculating margins.

Margin
A margin is a separation gap between the two lines on the closest data points. It is calculated
as the perpendicular distance from the line to support vectors or closest data points. In SVMs,
we try to maximize this separation gap so that we get maximum margin.

20 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

In SVMs, the main objective is to select a hyperplane with the maximum possible margin
between support vectors in the given dataset.
SVM searches for the maximum margin hyperplane in the following 2 step process –

1. Generate hyperplanes which segregates the classes in the best possible way. There are
many hyperplanes that might classify the data. We should look for the best hyperplane
that represents the largest separation, or margin, between the two classes.

2. So, choose the hyperplane so that distance from it to the support vectors on each side
is maximized. If such a hyperplane exists, it is known as the maximum margin
hyperplane and the linear classifier it defines is known as a maximum margin
classifier.

The following diagram illustrates the concept of maximum margin and maximum margin
hyperplane.

source: https://www.kaggle.com/code/prashant111/svm-classifier-tutorial#1.-Introduction-to-Support-Vector-Machines-

Soft Margin Classification


Soft Margin Classification is an approach used in Support Vector Machines (SVM) to handle
cases where data is not perfectly linearly separable. Hard Margin Classification requires a
clear separation without any misclassified points, Soft Margin Classification allows some
misclassifications to achieve a better overall model.

Hard Margin Classification:


• Requires all data points to be correctly classified with a clear margin. Only works if
the data is perfectly linearly separable.
• Very sensitive to outliers. Even a single outlier can make it impossible to find a
suitable decision boundary.

21 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Soft Margin Classification:


• Allows some points to be within the margin or even misclassified to create a more
robust model.
• Introduces a balance between maximizing the margin and minimizing classification
errors.

Soft Margin Classification Working


• The SVM algorithm finds a compromise between maximizing the margin and
allowing some errors (margin violations).
• This is controlled by a hyperparameter C, which determines the trade-off between a
larger margin and fewer margin violations.
• A small 𝐶 value encourages a larger margin, even if it means more margin violations
(misclassified points).
• A large 𝐶 value aims to classify all training examples correctly, resulting in a smaller
margin.
• In figure, using a low C=1 value the margin is quite large, but many instances end up
on the street. On the right, using a high C value the classifier makes fewer margin
violations but ends up with a smaller margin

The objective function in Soft Margin Classification includes a penalty for misclassified
points. This penalty is proportional to the distance of the points from the correct side of the
margin. Mathematically, this is expressed by adding a term to the cost function that penalizes
errors, weighted by C.

22 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Nonlinear SVM Classification

Linear SVM classifiers are efficient and work well in many cases, but some datasets are not
linearly separable. One way to handle these nonlinear datasets is to add more features, like
polynomial features, which can sometimes make the dataset linearly separable.

Example: Consider a dataset with one feature x1. This dataset is not linearly separable (as
shown in the left plot of Figure). However, if you add a second feature x2 = (x1)2, the
resulting 2D dataset becomes linearly separable.

Consider dataset for binary classification in which the data points are shaped as two
interleaving half circles.

23 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Polynomial Kernel

• Adding polynomial features is simple to implement and can work great with all sorts
of Machine Learning algorithms, but at a low polynomial degree it cannot deal with
very complex datasets, and with a high polynomial degree it creates a huge number of
features, making the model too slow.

• The kernel trick is a powerful technique used in Support Vector Machines (SVMs) to
handle nonlinear datasets without explicitly mapping the data to a higher-dimensional
space. Instead, it uses kernel functions to compute the similarity between data points
in this higher dimensional space directly, saving computational resources and
simplifying the process.

Adding Similarity Features

• To tackle nonlinear problems is to add features computed using a similarity function


that measures how much each instance resembles a particular landmark.
• For example, let’s take the one-dimensional dataset and add two landmarks to it at
x1 = –2 and x1 = 1. Next, let’s define the similarity function to be the Gaussian Radial
Basis Function (RBF) with γ = 0.3

• It is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at
the landmark). Now we are ready to compute the new features.
• For example, let’s look at the instance x1 = –1: it is located at a distance of 1 from the
first landmark, and 2 from the second landmark.

24 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• Therefore, its new features are x2 = exp (–0.3 × 12) ≈ 0.74 and x3=exp(–0.3×22) ≈ 0.30.
The plot on the right of Figure shows the transformed dataset (dropping the original
features). As you can see, it is now linearly separable.

Gaussian RBF Kernel

The models are trained with different values of hyperparameters gamma (γ) and C. Increasing
gamma makes the bell-shape curve narrower (see the above left plot of Figure), and as a
result each instance’s range of influence is smaller: the decision boundary ends up being
more irregular, wiggling around individual instances. Conversely, a small gamma value
makes the bell-shaped curve wider, so instances have a larger range of influence, and the
decision boundary ends up smoother. So γ acts like a regularization hyperparameter: if your
model is overfitting, you should reduce it, and if it is underfitting, you should increase it.

Figure --- SVM classifiers using an RBF kernel

25 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

SVM Regression

Support Vector Machine (SVM) regression is a versatile method that supports both linear and
nonlinear regression. The primary goal in SVM regression is to fit as many data points as
possible within a predefined margin while limiting margin violations (i.e., points outside the
margin)
• In SVM classification, the objective is to maximize the margin between classes.
However, in SVM regression, the aim is to fit as many data points as possible within a
margin (referred to as the "street").
• The width of this street is controlled by the hyperparameter 𝜖. Only points that fall
outside this margin affect the model. A larger ϵ results in a wider street, leading to
fewer points outside the margin, while a smaller 𝜖 results in a narrower street.

Below figure shows two linear SVM Regression models trained on some random linear data,
one with a large margin (ϵ = 1.5) and the other with a small margin (ϵ = 0.5).

• Linear SVM Regression: The model tries to find a linear function that fits within the
margin.
• For nonlinear regression tasks, kernelized SVM models are used. The kernel trick
allows the SVM to perform in a higher-dimensional space without explicitly
transforming the data.

Figure shows SVM Regression on a random quadratic training set, using a 2nd-degree
polynomial kernel. There is little regularization on the left plot (i.e., a large C value), and
much more regularization on the right plot (i.e., a small C value).

26 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

27 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

MODULE 4
Decision Trees
Introduction to Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification
and regression. The goal is to create a model that predicts the value of a target variable by
learning simple decision rules inferred from the data features

Key Terms in Decision Trees:


• Root Node: The starting point of the tree.
• Splitting: Dividing a node into multiple sub-nodes.
• Decision Node: A node that splits into more sub-nodes.
• Leaf Node: A node that does not split further and represents an outcome.
• Pruning: Removing sub-nodes to simplify the tree.
• Branch: A part of the tree with multiple nodes

Figure - Decision tree Structure

Figure: A decision tree for the concept PlayTennis

1 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

How Decision Trees Operate:

• A decision tree looks like an upside-down tree, starting with a root node.
• From the root node, the tree splits into decision nodes, and these further split until they
reach leaf nodes, which show the outcomes.
• Each decision node represents a condition, and the branches represent the possible
answers

Types of decision trees in machine learning


Decision trees in machine learning can either be classification trees or regression trees.

1. Classification trees determine whether an event happened or not. This involves a


“yes” or “no” outcome. It deals with categories.
Example: Is an animal a reptile or mammal?

2. Regression trees predict continuous values based on previous data or information


sources. It deals with numerical outcomes.
Example: Predicting house prices based on features like size and location.

2 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Training and Visualizing a Decision Tree


To understand Decision Trees, let’s just build one and take a look at how it makes
predictions.

Example Using the Iris Dataset:


The iris dataset is a famous dataset used to practice classification algorithms. It contains data
about different iris flowers, including their petal lengths and widths, which we will use to
classify them into different species

Step-by-Step Process:
1. Load the Dataset and Train the Model

from sklearn.datasets import load_iris


from sklearn.tree import DecisionTreeClassifier

# Load the iris dataset


iris = load_iris()
X = iris.data[:, 2:] # Use petal length and width as features
y = iris.target # Target variable (species)

# Initialize and train the Decision Tree Classifier


tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

2. Visualize the Decision Tree


You can visualize the trained Decision Tree by first using the export_graphviz()
method to output a graph definition file called iris_tree.dot

from sklearn.tree import export_graphviz

export_graphviz(
tree_clf,
out_file="iris_tree.dot",
feature_names=iris.feature_names[2:], # Petal length and width
class_names=iris.target_names, # Species names
rounded=True,
filled=True
)

3 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

3. Convert the .dot File to an Image


Use the Graphviz tool to convert the .dot file to a more accessible format like PNG

$ dot -Tpng iris_tree.dot -o iris_tree.png

Decision tree

Making Predictions

Suppose you find an iris flower and you want to classify it.
→ Start at the Root Node:
• Find is the petal length smaller than 2.45 cm?
• If yes, move to the left child node.
• If no, move to the right child node
→ Left Child Node (Depth 1, Left):
• This node is a leaf node, meaning it does not ask any more questions.
• Prediction: The flower is classified as Iris-Setosa
→ Right Child Node (Depth 1, Right):
• This node is not a leaf node, so it asks another question that Is the petal width
smaller than 1.75 cm?
• If yes, move to the left child node (Depth 2, Left).
• If no, move to the right child node (Depth 2, Right).
→ Depth 2, Left Node:
• Prediction: The flower is classified as Iris-Versicolor.
→ Depth 2, Right Node:
• Prediction: The flower is classified as Iris-Virginica.

4 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Attributes of a Node:
Samples: Number of training instances it applies to.
• Example: If 100 training instances have a petal length > 2.45 cm (Depth 1, Right), and
among them, 54 have a petal width < 1.75 cm (Depth 2, Left).

Value: Number of training instances of each class in that node.


• Example: The bottom-right node (Depth 2, Right) applies to 0 Iris-Setosa, 1 Iris-
Versicolor, and 45 Iris-Virginica.

Gini Impurity: Measures the node's impurity. A pure node (all instances belong to one class)
has a Gini score of 0.

Estimating Class Probabilities


• A Decision Tree can also estimate the probability that an instance belongs to a
particular class k.
• First it traverses the tree to find the leaf node for this instance, and then it returns the
ratio of training instances of class k in this node.
• For example, consider a flower whose petals are 5 cm long and 1.5 cm wide. The
corresponding leaf node is the depth-2 left node, so the Decision Tree should output
the following probabilities

• 0% for Iris-Setosa (0/54)

5 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• 90.7% for Iris-Versicolor (49/54)


• and 9.3% for Iris-Virginica (5/54)
The predicted class is Iris-Versicolor (class 1) since it has the highest probability

The CART Training Algorithm

CART (Classification and Regression Trees) is a decision tree technique employed in


machine learning to address both classification and regression tasks. It identifies patterns and
relationships within a dataset and constructs a tree structure based on the variable values
present in the data.

Algorithm
• First splits the training set in two subsets using a single feature k and a threshold t k
(e.g., “petal length ≤ 2.45 cm”)
• To choose k and tk, search for the pair (k, tk) that produces the purest subsets
(weighted by their size).
• The cost function that the algorithm tries to minimize is given by

• Once successfully split the training set in two, splits the subsets using the same logic,
then the sub-subsets and so on, recursively.
• Stops recursing once it reaches the maximum depth, or if it cannot find a split that will
reduce impurity.

6 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Algorithm

Step 1. Initialization:
• Begin with the entire dataset as the root node.
Step 2. Splitting Criteria:
• For each node, considers all possible splits for each feature.
• The goal is to find the split that minimizes the impurity (for classification) or the
variance (for regression) in the resulting sub-nodes.

Classification: The impurity measure used is the Gini impurity.

Regression: The variance measure typically used is the mean squared error (MSE).

Step 3. Choose the Best Split


• Evaluate all possible splits for all features.
• Choose the split that results in the lowest Gini impurity (for classification) or the
lowest variance (for regression).
Step 4. Split the Node
• Split the dataset into two subsets based on the best split.
• Create two new child nodes and assign the data points to these nodes based on the
split criteria.
Step 5. Repeat
For each child node, repeat steps 2 to 4 until one of the stopping conditions is met:
• Maximum depth of the tree is reached.
• Minimum number of samples in a node is reached.
• No further reduction in impurity or variance can be achieved.
• Node contains only samples of a single class (for classification).

7 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Computational Complexity

• Making predictions requires traversing the Decision Tree from the root to a leaf and
they are relatively balanced. so traversing the Decision Tree requires going through
roughly O(log2(m)) nodes, where m is the number of training instances. Each node
checks only one feature, making predictions very fast, even for large datasets.
• The complexity of making a prediction with a Decision Tree is O(log2 (m)), meaning it
remains quick regardless of the number of features in your data.
• During training, the algorithm evaluates all features for all samples at each node. This
process results in a training complexity of O(n×mlog(m)), where n is the number of
features and m is the number of training instances.

Gini Impurity or Entropy?

Information gain
• Information gain measures how well a given attribute separates the training examples
according to their target classification.
• It is used to select among the candidate attributes at each step while growing the tree.
• Information gain, is the expected reduction in entropy caused by partitioning the
examples according to this attribute.
• The information gain, Gain(S, A) of an attribute A, relative to a collection of examples
S, is defined as

Entropy
• Entropy measures the impurity of a collection of examples.
• Given a collection S, containing positive and negative examples of some target
concept, the entropy of S relative to this Boolean classification is

Where,
p+ is the proportion of positive examples in S
p- is the proportion of negative examples in S.
• The entropy is 0 if all members of S belong to the same class
• The entropy is 1 when the collection contains an equal number of positive and
negative examples
• If the collection contains unequal numbers of positive and negative examples, the
entropy is between 0 and 1

8 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Gini Index
The Gini Index or Gini Impurity, measures the probability for a random instance being
misclassified when chosen randomly. It is used to determine the best feature to split the data
at each node in the tree.

• Gini = 0: All elements in the node belong to a single class, hence the node is pure.
• Gini > 0: There are multiple classes present in the node, indicating impurity.
• Maximum Impurity (0.5): For a binary classification, this occurs when the classes are
perfectly split (50% each).

At each node of decision tree, the algorithm calculates the Gini Index for all possible splits
and chooses the split that results in the lowest Gini Index for the child nodes, indicating the
purest possible nodes.

So should you use Gini impurity or entropy?


• Gini Impurity: Tends to be faster to compute and is a good default choice. It isolates
the most frequent class in its own branch of the tree.
• Entropy: Tends to produce slightly more balanced trees by considering the
information gain. This can be more informative but slightly slower to compute.

Regularization Hyperparameters in Decision Trees

• Decision Trees: They make minimal assumptions about the training data, allowing the
tree structure to adapt closely to the data. This can lead to overfitting.
• Nonparametric Models: These models have an undefined number of parameters
before training, meaning they can fit closely to the training data. Decision Trees are an
example of this.
• Parametric Models: These models have a predetermined number of parameters,
reducing the risk of overfitting but increasing the risk of underfitting.

To avoid overfitting the training data, restrict the Decision Tree’s freedom during training,
this is called regularization. This is done by controlling hyperparameters that limit the tree's
growth and complexity.

9 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Regularization Hyperparameters in Scikit-Learn: The DecisionTreeClassifier class has a few


other parameters that restrict the shape of the Decision Tree:

1. max_depth: Limits the maximum depth of the tree. Reducing max_depth restricts
how deep the tree can grow, thus controlling overfitting.
2. min_samples_split: The minimum number of samples required to split an internal
node. Increasing this value means nodes must have more samples to split, which
reduces tree complexity.
3. min_samples_leaf: The minimum number of samples a leaf node must have. Prevents
creating leaf nodes with very few samples, reducing overfitting.
4. min_weight_fraction_leaf: Similar to min_samples_leaf but expressed as a fraction
of the total number of weighted instances. Ensures leaf nodes contain a minimum
fraction of the dataset, adding regularization
5. max_leaf_nodes: Limits the maximum number of leaf nodes. Restricts the overall size
and complexity of the tree
6. max_features: The maximum number of features to consider for splitting at each
node. Limits the number of features evaluated, simplifying the model and reducing the
risk of overfitting.

Adjusting max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf,


max_leaf_nodes, and max_features helps balance the model's complexity and its ability to
generalize to new data.

10 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Regression

Decision Trees are not only useful for classification tasks but also for regression tasks.
Regression with Decision Trees involves predicting continuous values instead of classes.

Building a Regression Tree


Use the DecisionTreeRegressor class from Scikit-Learn.

from sklearn.tree import DecisionTreeRegressor


tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)

Each leaf node in the regression tree predicts a continuous value. The predicted value is the
average of the target values of all training instances in that leaf node.

Example:
• Suppose you want to predict for a new instance with x1 = 0.6. Then start at the root of
the tree and traverse it according to the feature values until you reach a leaf node.
• The leaf node predicts value = 0.1106, which is the average target value of the training
instances in that node. This prediction results in a Mean Squared Error (MSE) of
0.0151 over these 110 instances.

This model’s predictions are represented in below figure. If max_depth is set to 2, the
predictions are less detailed. Increasing max_depth to 3 results in more detailed predictions.
The predicted value for each region is always the average target value of the instances in that
region.

11 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The CART (Classification and Regression Tree) algorithm splits the dataset to minimize the
Mean Squared Error (MSE) rather than impurity. Equation shows the cost function that the
algorithm tries to minimize.

Decision Trees are prone to overfitting when dealing with regression tasks. Without any
regularization (i.e., using the default hyperparameters), you get the predictions as shown on
the left of below Figure. It is obviously overfitting the training set very badly. Just setting
min_samples_leaf=10 results in a much more reasonable model, represented on the right of
Figure.

12 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Question Bank

1. Define Decision Tree. How Decision Trees Operate.


2. Explain the types of decision trees in machine learning
3. Describe the process of training and visualizing decision tree using an example
4. Explain CART algorithm to train decision trees
5. Discuss the computational complexity of making predictions with Decision Trees
6. Compare Gini Impurity and Entropy in the context of decision trees with example
7. Define Information Gain and explain its significance in decision trees with an example
8. Describe the concept of Entropy and how it measures impurity in a dataset and
consider an appropriate example
9. Explain the Gini Index and its role in determining the best feature for splitting data.
10. Discuss the importance of regularization hyperparameters in decision trees.
11. Explain the process and purpose of "Pruning" in decision trees.
12. Differentiate between classification trees and regression trees in machine learning with
an example

13 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Ensemble Learning and Random Forests


Introduction to Ensemble Learning

Ensemble Learning is a powerful technique in machine learning that combines the predictions
of multiple models to produce a better overall prediction

The Concept of Ensemble Learning

Wisdom of the Crowd:


• Imagine asking a complex question to thousands of random people and then
combining their answers. Often, this combined answer is better than the answer from a
single expert. This is known as the "wisdom of the crowd."
• Similarly, combining the predictions from multiple models (classifiers or regressors)
often results in more accurate predictions than any individual model.

Definition:
• A group of predictors (models) is called an ensemble.
• The technique of combining multiple models is called Ensemble Learning.
• An algorithm that implements Ensemble Learning is called an Ensemble Method.

Example - Random Forest:


• Train multiple Decision Tree classifiers on different random subsets of the training
data. To make a prediction, get predictions from all the individual trees and choose the
class that gets the most votes. This ensemble of Decision Trees is called a Random
Forest.

Types of Ensemble Methods

1. Bagging (Bootstrap Aggregating): Train multiple models on different random subsets


of the training data and combine their predictions.
2. Boosting: Train multiple models sequentially, each trying to correct the errors of the
previous model. Combine their predictions to improve accuracy.
3. Stacking: Train multiple models and then train a meta-model to combine their
predictions.
4. Random Forests: A specific type of bagging that uses Decision Trees. Combines the
predictions of multiple Decision Trees to improve accuracy and robustness.

14 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Voting Classifiers

• Voting Classifiers are a type of ensemble learning method that combine the
predictions of multiple classifiers to improve accuracy
• Imagine you have trained several classifiers, each achieving around 80% accuracy. For
example, you might have a Logistic Regression classifier, an SVM classifier, a
Random Forest classifier, and a K-Nearest Neighbors classifier. A simple way to
create an even better classifier is to combine their predictions.

Hard Voting:
• The most straightforward method is to aggregate the predictions from each classifier
and predict the class that gets the most votes. This is known as a hard voting
classifier.
• Surprisingly, this majority-vote classifier often achieves higher accuracy than the best
individual classifier in the ensemble. If each classifier is a weak learner (meaning it
does only slightly better than random guessing), the ensemble can still be a strong
learner (achieving high accuracy), provided there are a sufficient number of weak
learners and they are sufficiently diverse.

15 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Example:
• Suppose you have a slightly biased coin that has a 51% chance of coming up heads,
and 49% chance of coming up tails. If you toss it 1,000 times, you will get more or
less 510 heads and 490 tails, and hence a majority of heads.
• You will find that the probability of obtaining a majority of heads after 1,000
tosses is close to 75%
• The probability of getting a majority of heads increases with the number of tosses due
to the law of large numbers. With 10,000 tosses, the probability of a majority heads is
over 97%.

Figure shows 10 series of biased coin tosses. You can see that as the number of tosses
increases, the ratio of heads approaches 51%. Eventually all 10 series end up so close to 51%
that they are consistently above 50%.

• Similarly, suppose you build an ensemble containing 1,000 classifiers that are
individually correct only 51% of the time. If you predict the majority voted class, you
can hope for up to 75% accuracy!
• However, this high accuracy assumes all classifiers are perfectly independent and
make uncorrelated errors. In reality, since they are trained on the same data, they are
likely to make similar errors, reducing the ensemble’s overall accuracy.

16 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Bagging and Pasting

Bagging and Pasting are ensemble learning techniques used to improve the accuracy and
robustness of machine learning models.

Creating Diverse Predictors:


• One way to create a diverse set of classifiers is to use different training algorithms.
• Another approach is to use the same training algorithm for every predictor but train
them on different random subsets of the training set.

Bagging (Bootstrap Aggregating):


• Each predictor is trained on a random subset of the training set where sampling is done
with replacement. This means some instances may be repeated in the same subset, this
method is called bagging
• Once all predictors are trained, their predictions are aggregated. For classification
tasks, the most frequent prediction is chosen (like a hard voting classifier). For
regression tasks, the average prediction is used.
• Each predictor trained on a subset has a higher bias, but when combined, the ensemble
reduces both bias and variance, resulting in better overall performance.

Pasting:
• Each predictor is trained on a random subset of the training set where sampling is done
without replacement. This means instances are not repeated in the same subset, this
method is called pasting
• Similar to bagging, predictions from all predictors are aggregated to make the final
prediction.
As you can see in above Figure, Predictors can be trained in parallel, using different CPU
cores or servers, making both bagging and pasting scalable and efficient. Predictions can also
be made in parallel, which speeds up the overall process.

17 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Advantages of Bagging and Pasting


• Reduction in Variance: By training on different subsets, the overall model's variance
is reduced compared to a single predictor trained on the entire training set.
• Scalability: The ability to train and predict in parallel makes these methods suitable
for large datasets and complex models.

Example of Bagging and Pasting


• Bagging: Imagine training 10 decision tree classifiers. Each tree is trained on a subset
of the training data created by randomly sampling with replacement. The final
prediction for a new instance is based on the majority vote of these 10 trees.
• Pasting: Similarly, you train 10 decision tree classifiers, but this time each tree is
trained on a subset created by random sampling without replacement. The final
prediction is also based on the majority vote.

Bagging and Pasting in Scikit-Learn

• Scikit-Learn offers a simple API for both bagging and pasting with the
BaggingClassifier class (or BaggingRegressor for regression).
• The following code trains an ensemble of 500 Decision Tree classifiers, each trained
on 100 training instances randomly sampled from the training set with replacement
(bagging) but if you want to use pasting instead, just set bootstrap=False. The n_jobs
parameter tells Scikit-Learn the number of CPU cores to use for training and
predictions.

from sklearn.ensemble import BaggingClassifier


from sklearn.tree import DecisionTreeClassifier

# Create a bagging classifier


bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1)

# Train the bagging classifier


bag_clf.fit(X_train, y_train)

# Make predictions
y_pred = bag_clf.predict(X_test)

BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1): This creates an ensemble of 500
Decision Trees. Each tree is trained on 100 random samples from the training set with
replacement.

18 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

n_jobs=-1: Utilizes all available CPU cores for parallel processing, making training faster.
Below Figure compares the decision boundary of a single Decision Tree with the decision
boundary of a bagging ensemble of 500 trees, both trained on the moons dataset.

• A single Decision Tree might have a complex decision boundary that overfits the
training data.
• A bagging ensemble of trees will generally have a smoother decision boundary, which
reduces variance and improves generalization.

Out-of-Bag Evaluation
• In bagging, each predictor is trained on a random subset of the training set with
replacement. On average, about 63% of the training instances are used for training
each predictor. The remaining 37% of the instances are not used and are called out-of-
bag (OOB) instances. Each predictor has a different set of OOB instances.
• OOB instances are not seen by the predictor during training, so they can be used to
evaluate the predictor's performance. This provides a way to evaluate the ensemble
without needing a separate validation set. The overall performance of the ensemble
can be assessed by averaging the OOB evaluations of all predictors.

19 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Random Patches and Random Subspaces

Bagging isn't limited to sampling training instances but it can also sample features. The
BaggingClassifier class supports sampling the features. This is controlled by two
hyperparameters: max_features and bootstrap_features.
• max_features: Specifies the number of features to sample.
• bootstrap_features: If set to True, features are sampled with replacement

Random Patches: Sampling both training instances and features is called the Random
Patches method. This method is used when dealing with high-dimensional data (e.g., images).

In Scikit-Learn:
• bootstrap=True: Sample training instances with replacement.
• max_samples=<fraction>: Fraction of training instances to sample.
• bootstrap_features=True: Sample features with replacement.
• max_features=<fraction>: Fraction of features to sample.

Random Subspaces: All training instances are used, but features are sampled.

In Scikit-Learn:
• bootstrap=False: Use all training instances.
• max_samples=1.0: Use all training instances.
• bootstrap_features=True: Sample features with replacement.
• max_features=<fraction>: Fraction of features to sample.

Benefits
• Increased Diversity: Sampling features adds more diversity to the predictors.
• Bias-Variance Tradeoff: This approach generally increases bias but decreases
variance, improving generalization.

20 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Random Forests
Random forest is a supervised learning algorithm that combines multiple Decision Trees to
improve the accuracy and stability of predictions. The “forest” it builds is an ensemble of
decision trees, trained with the bagging method.

How Does a Random Forest Work?


Training Phase:
• Create multiple subsets of the original training data using bootstrapping (sampling
with replacement).
• Train a Decision Tree on each subset. Each tree is grown to its maximum depth.
• At each split in a tree, a random subset of features is considered for splitting instead of
considering all features. This adds randomness and reduces correlation between trees.
Prediction Phase:
• For a new data point, each tree in the forest makes a prediction.
• For classification tasks, the final prediction is made by taking the majority vote of all
the trees.
• For regression tasks, the final prediction is made by averaging the predictions of all
the trees.

Example Code - Here’s how to train a Random Forest classifier with 500 trees, each limited
to a maximum of 16 nodes, using all available CPU cores:

from sklearn.ensemble import RandomForestClassifier


# Create a Random Forest classifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)

# Train the classifier


rnd_clf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rnd_clf.predict(X_test)

A RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to


control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control
the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of
searching for the very best feature when splitting a node, it searches for the best feature
among a random subset of features. This results in a greater tree diversity, which trades a
higher bias for a lower variance, generally yielding an overall better model.

21 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The following BaggingClassifier is roughly equivalent to the above


RandomForestClassifier:

from sklearn.ensemble import BaggingClassifier


from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

splitter="random": Introduces randomness by selecting a random subset of features for


splitting nodes.

22 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Boosting

Boosting refers to any Ensemble method that can combine several weak learners into a strong
learner. The general idea of most boosting methods is to train predictors sequentially, each
trying to correct its predecessor.

The most popular are boosting methods are


• AdaBoost (Adaptive Boosting)
• Gradient Boosting

AdaBoost

• AdaBoost is a machine learning algorithm that combines multiple weak learners to


create a strong learner. A weak learner is a model that performs slightly better than
random guessing. AdaBoost focuses on improving the performance of these weak
learners by adjusting their weights based on their performance on the training data.
• Example, to build an AdaBoost classifier, a first base classifier (such as a Decision
Tree) is trained and used to make predictions on the training set. The relative weight of
misclassified training instances is then increased. A second classifier is trained using
the updated weights and again it makes predictions on the training set, weights are
updated, and so on.

The below figure shows the decision boundaries of five consecutive predictors on the moons
dataset. The first classifier gets many instances wrong, so their weights get boosted. The
second classifier therefore does a better job on these instances, and so on. The plot on the
right represents the same sequence of predictors except that the learning rate is halved.

23 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Once all predictors are trained, the ensemble makes predictions very much like bagging or
pasting, except that predictors have different weights depending on their overall accuracy on
the weighted training set.

AdaBoost algorithm
• Each instance weight w(i) is initially set to 1/m. A first predictor is trained and its
weighted error rate r1 is computed on the training set.

• The predictor’s weight αj is then computed using Equation - 2, where η is the learning
rate hyperparameter (defaults to 1). The more accurate the predictor is, the higher its
weight will be. If it is just guessing randomly, then its weight will be close to zero.
However, if it is most often wrong (i.e., less accurate than random guessing), then its
weight will be negative.

• Next the instance weights are updated using Equation 3: the misclassified instances are
boosted.

24 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• Finally, a new predictor is trained using the updated weights, and the whole process is
repeated.
• The algorithm stops when the desired number of predictors is reached, or when a
perfect predictor is found.

To make predictions, AdaBoost computes the predictions of all the predictors and weighs
them using the predictor weights αj. The predicted class is the one that receives the majority
of weighted votes (see Equation - 4).

Gradient Boosting

• Gradient Boosting is a powerful machine learning technique used for regression and
classification tasks. It builds an ensemble of models sequentially, each one correcting
the errors of its predecessor.
• In this the models are added one at a time to the ensemble. Each new model corrects
the errors of the combined previous models.
• Instead of adjusting instance weights like AdaBoost, Gradient Boosting fits new
models to the residual errors (the difference between the actual and predicted values)
of the existing ensemble. Typically, Decision Trees are used as the base learners in
Gradient Boosting.
• The Learning Rate hyperparameter scales the contribution of each new model,
controlling the step size of each iteration. Setting a low learning rate usually improves
generalization by preventing overfitting, though it requires more iterations.
• To avoid overfitting, training can be stopped early when the validation error stops
improving.
• Stochastic Gradient Boosting: This technique introduces randomness by training each
model on a random subset of the data, which can reduce overfitting and improve
performance.

25 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Steps to Implement Gradient Boosting

#Initial Model:

from sklearn.tree import DecisionTreeRegressor


tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

#Second Model on Residuals


y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

#Third Model on New Residuals


y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

#Predictions with Ensembl


y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2,
tree_reg3))

Below figure represents the predictions of these three trees in the left column, and the
ensemble’s predictions in the right column. In the first row, the ensemble has just one tree, so
its predictions are exactly the same as the first tree’s predictions. In the second row, a new
tree is trained on the residual errors of the first tree. On the right you can see that the
ensemble’s predictions are equal to the sum of the predictions of the first two trees. Similarly,
in the third row another tree is trained on the residual errors of the second tree.

26 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Figure - Gradient Boosting

27 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Stacking

Stacking (or stacked generalization) is an ensemble method that combines multiple models to
improve prediction accuracy. Unlike simple aggregation methods like voting, stacking uses
another model to learn how to best combine the outputs of the base models.

How Stacking Works

• First Layer - Base Models: Multiple models are trained on the training dataset. These
can be any type of model (e.g., decision trees, linear regression, etc.). Each model
makes predictions on the training data.
• Hold-out Set: The training dataset is split into two subsets: one for training the base
models and one for creating the hold-out predictions. The base models make
predictions on the hold-out set, producing an array of predictions for each instance.
• Second Layer - Blender (Meta Learner): The predictions from the base models are
used as input features to train a new model called the blender or meta learner. The
target values for training the blender are the actual values from the hold-out set.
• Making Predictions: To make a prediction on new data, the input is first passed
through the base models. The predictions from the base models are then fed into the
blender, which makes the final prediction.

28 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Key Points
• Blender Training: The blender learns to optimize the combination of base model
predictions.
• Hold-out Set: Ensures that the predictions used to train the blender are independent
and not biased by the training data of the base models.
• Multiple Layers: It's possible to add multiple layers of models, but this increases
complexity and computational cost.

29 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Question Bank

1. Define ensemble learning and explain the concept of the "wisdom of the crowd" in the
context of machine learning.
2. What is an ensemble method, and how does it improve prediction accuracy?
3. Describe the differences between bagging, boosting, and stacking.
4. Explain how a random forest combines multiple decision trees to make predictions.
5. What is a hard voting classifier? Describe how it works with an example.
6. Explain why combining multiple classifiers that are weak learners can still result in a
strong learner.
7. Differentiate between bagging and pasting in terms of their sampling methods.
8. Provide a real-world example where bagging could be used effectively.
9. Write a Python code snippet to implement a bagging classifier using Scikit-Learn with
decision trees as base estimators.
10. What is out-of-bag (OOB) evaluation, and how is it useful in bagging methods?
11. Explain how OOB evaluation provides an estimate of the ensemble's performance
without using a separate validation set.
12. How do the random patches and random subspaces methods increase diversity among
predictors?
13. Describe the benefits and potential drawbacks of using random patches in high-
dimensional data.
14. Explain the process of training a random forest, highlighting the role of bootstrapping
and feature selection.
15. Write a Python code snippet to train a random forest classifier using Scikit-Learn.
16. Compare and contrast a single decision tree with a random forest in terms of bias,
variance, and overall prediction accuracy.
17. Describe the AdaBoost algorithm and how it adjusts weights during training to
improve prediction accuracy.
18. What are the advantages and disadvantages of using AdaBoost compared to other
ensemble methods?
19. Explain how gradient boosting differs from AdaBoost and provide a step-by-step
explanation of its training process.
20. Write a Python code snippet to implement a basic gradient boosting model using
Scikit-Learn.
21. What is stacking in ensemble learning, and how does it differ from simple aggregation
methods like voting?
22. Describe the role of the meta-learner in stacking and its importance in improving
prediction accuracy.

30 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

MODULE 5
BAYESIAN LEARNING
Bayesian reasoning provides a probabilistic approach to inference. It is based on the
assumption that the quantities of interest are governed by probability distributions and that
optimal decisions can be made by reasoning about these probabilities together with observed
data

INTRODUCTION
Bayesian learning methods are relevant to study of machine learning for two different reasons.
1. First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to certain
types of learning problems
2. The second reason is that they provide a useful perspective for understanding many
learning algorithms that do not explicitly manipulate probabilities.

Features of Bayesian Learning Methods


• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example
• Prior knowledge can be combined with observed data to determine the final probability
of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a
prior probability for each candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
• Bayesian methods can accommodate hypotheses that make probabilistic predictions
• New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
• Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.

Practical difficulty in applying Bayesian methods


1. One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities. When these probabilities are not known in
advance they are often estimated based on background knowledge, previously available
data, and assumptions about the form of the underlying distributions.
2. A second practical difficulty is the significant computational cost required to determine
the Bayes optimal hypothesis in the general case. In certain specialized situations, this
computational cost can be significantly reduced.

1 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

BAYES THEOREM

Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior
probability, the probabilities of observing various data given the hypothesis, and the observed
data itself.
Notations
• P(h) prior probability of h, reflects any background knowledge about the chance that h
is correct
• P(D) prior probability of D, probability that D will be observed
• P(D|h) probability of observing D given a world in which h holds
• P(h|D) posterior probability of h, reflects confidence that h holds after D has been
observed

Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to
calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D)
and P(D|h).

• P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
• P(h|D) decreases as P(D) increases, because the more probable it is that D will be
observed independent of h, the less evidence D provides in support of h.

Maximum a Posteriori (MAP) Hypothesis

• In many learning scenarios, the learner considers some set of candidate hypotheses H
and is interested in finding the most probable hypothesis h ∈ H given the observed data
D. Any such maximally probable hypothesis is called a maximum a posteriori (MAP)
hypothesis.
• Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP
is a MAP hypothesis provided

• P(D) can be dropped, because it is a constant independent of h

2 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Maximum Likelihood (ML) Hypothesis

• In some cases, it is assumed that every hypothesis in H is equally probable a priori


(P(hi) = P(hj) for all hi and hj in H).
• In this case the below equation can be simplified and need only consider the term P(D|h)
to find the most probable hypothesis.

P(D|h) is often called the likelihood of the data D given h, and any hypothesis that maximizes
P(D|h) is called a maximum likelihood (ML) hypothesis

Example
• Consider a medical diagnosis problem in which there are two alternative hypotheses:
(1) that the patient has particular form of cancer, and (2) that the patient does not. The
available data is from a particular laboratory test with two possible outcomes: +
(positive) and - (negative).
• We have prior knowledge that over the entire population of people only .008 have this
disease. Furthermore, the lab test is only an imperfect indicator of the disease.
• The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the
disease is not present. In other cases, the test returns the opposite result.
• The above situation can be summarized by the following probabilities:

Suppose a new patient is observed for whom the lab test returns a positive (+) result.
Should we diagnose the patient as having cancer or not?

The exact posterior probabilities can also be determined by normalizing the above quantities
so that they sum to 1

3 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Basic formulas for calculating probabilities are summarized in Table

4 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

BAYES THEOREM AND CONCEPT LEARNING

What is the relationship between Bayes theorem and the problem of concept learning?

Since Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data, and can use it as the basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the most
probable.

Brute-Force Bayes Concept Learning

Consider the concept learning problem


• Assume the learner considers some finite hypothesis space H defined over the instance
space X, in which the task is to learn some target concept c : X → {0,1}.
• Learner is given some sequence of training examples ((x1, d1) . . . (xm, dm)) where xi is
some instance from X and where di is the target value of xi (i.e., di = c(xi)).
• The sequence of target values are written as D = (d1 . . . dm).

We can design a straightforward concept learning algorithm to output the maximum a posteriori
hypothesis, based on Bayes theorem, as follows:

BRUTE-FORCE MAP LEARNING algorithm:

1. For each hypothesis h in H, calculate the posterior probability

2. Output the hypothesis hMAP with the highest posterior probability

In order specify a learning problem for the BRUTE-FORCE MAP LEARNING algorithm we
must specify what values are to be used for P(h) and for P(D|h) ?

Let’s choose P(h) and for P(D|h) to be consistent with the following assumptions:
• The training data D is noise free (i.e., di = c(xi))
• The target concept c is contained in the hypothesis space H
• Do not have a priori reason to believe that any hypothesis is more probable than any
other.

5 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

What values should we specify for P(h)?


• Given no prior knowledge that one hypothesis is more likely than another, it is
reasonable to assign the same prior probability to every hypothesis h in H.
• Assume the target concept is contained in H and require that these prior probabilities
sum to 1.

What choice shall we make for P(D|h)?


• P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the fixed set
of instances (x1 . . . xm), given a world in which hypothesis h holds
• Since we assume noise-free training data, the probability of observing classification di
given h is just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,

Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for the above
BRUTE-FORCE MAP LEARNING algorithm.

Recalling Bayes theorem, we have

Consider the case where h is inconsistent with the training data D

The posterior probability of a hypothesis inconsistent with D is zero

Consider the case where h is consistent with D

Where, VSH,D is the subset of hypotheses from H that are consistent with D

To summarize, Bayes theorem implies that the posterior probability P(h|D) under our assumed
P(h) and P(D|h) is

6 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The Evolution of Probabilities Associated with Hypotheses

• Figure (a) all hypotheses have the same probability.


• Figures (b) and (c), As training data accumulates, the posterior probability for
inconsistent hypotheses becomes zero while the total probability summing to 1 is
shared equally among the remaining consistent hypotheses.

MAP Hypotheses and Consistent Learners

• A learning algorithm is a consistent learner if it outputs a hypothesis that commits zero


errors over the training examples.
• Every consistent learner outputs a MAP hypothesis, if we assume a uniform prior
probability distribution over H (P(hi) = P(hj) for all i, j, and deterministic, noise free
training data (P(D|h) =1 if D and h are consistent, and 0 otherwise).

Example:
• FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis under the
probability distributions P(h) and P(D|h) defined above.
• Are there other probability distributions for P(h) and P(D|h) under which FIND-S
outputs MAP hypotheses? Yes.
• Because FIND-S outputs a maximally specific hypothesis from the version space, its
output hypothesis will be a MAP hypothesis relative to any prior probability distribution
that favours more specific hypotheses.

Note
• Bayesian framework is a way to characterize the behaviour of learning algorithms
• By identifying probability distributions P(h) and P(D|h) under which the output is a
optimal hypothesis, implicit assumptions of the algorithm can be characterized
(Inductive Bias)
• Inductive inference is modelled by an equivalent probabilistic reasoning system based
on Bayes theorem

7 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR HYPOTHESES

Consider the problem of learning a continuous-valued target function such as neural network
learning, linear regression, and polynomial curve fitting

A straightforward Bayesian analysis will show that under certain assumptions any learning
algorithm that minimizes the squared error between the output hypothesis predictions and the
training data will output a maximum likelihood (ML) hypothesis

• Learner L considers an instance space X and a hypothesis space H consisting of some


class of real-valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training
examples of the form <xi,di>
• The problem faced by L is to learn an unknown target function f : X → R
• A set of m training examples is provided, where the target value of each example is
corrupted by random noise drawn according to a Normal probability distribution with
zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable
representing the noise.
– It is assumed that the values of the ei are drawn independently and that they are
distributed according to a Normal distribution with zero mean.
• The task of the learner is to output a maximum likelihood hypothesis or a MAP
hypothesis assuming all hypotheses are equally probable a priori.

Using the definition of hML we have

Assuming training examples are mutually independent given h, we can write P(D|h) as the
product of the various (di|h)

Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each
di must also obey a Normal distribution around the true targetvalue f(xi). Because we are
writing the expression for P(D|h), we assume h is the correct description of f.
Hence, µ = f(xi) = h(xi)

8 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Maximize the less complicated logarithm, which is justified because of the monotonicity of
function p

The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding

Maximizing this negative quantity is equivalent to minimizing the corresponding positive


quantity

Finally, discard constants that are independent of h.

Thus, above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values di and the
hypothesis predictions h(xi)

Note:
Why is it reasonable to choose the Normal distribution to characterize noise?
• Good approximation of many types of noise in physical systems
• Central Limit Theorem shows that the sum of a sufficiently large number of
independent, identically distributed random variables itself obeys a Normal distribution
Only noise in the target value is considered, not in the attributes describing the instances
themselves

9 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

MAXIMUM LIKELIHOOD HYPOTHESES FOR PREDICTING


PROBABILITIES

• Consider the setting in which we wish to learn a nondeterministic (probabilistic)


function f : X → {0, 1}, which has two discrete output values.
• We want a function approximator whose output is the probability that f(x) = 1. In other
words, learn the target function f ` : X → [0, 1] such that f ` (x) = P(f(x) = 1)

How can we learn f ` using a neural network?


• Use of brute force way would be to first collect the observed frequencies of 1's and 0's
for each possible value of x and to then train the neural network to output the target
frequency for each x.

What criterion should we optimize in order to find a maximum likelihood hypothesis for f' in
this setting?
• First obtain an expression for P(D|h)
• Assume the training data D is of the form D = {(x1, d1) . . . (xm, dm)}, where di is the
observed 0 or 1 value for f (xi).
• Both xi and di as random variables, and assuming that each training example is drawn
independently, we can write P(D|h) as

Applying the product rule

The probability P(di|h, xi)

Re-express it in a more mathematically manipulable form, as

Equation (4) to substitute for P(di |h, xi) in Equation (5) to obtain

We write an expression for the maximum likelihood hypothesis

10 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The last term is a constant independent of h, so it can be dropped

It easier to work with the log of the likelihood, yielding

Equation (7) describes the quantity that must be maximized in order to obtain the maximum
likelihood hypothesis in our current problem setting

Gradient Search to Maximize Likelihood in a Neural Net

• Derive a weight-training rule for neural network learning that seeks to maximize G(h,D)
using gradient ascent
• The gradient of G(h,D) is given by the vector of partial derivatives of G(h,D) with
respect to the various network weights that define the hypothesis h represented by the
learned network
• In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to
unit j is

• Suppose our neural network is constructed from a single layer of sigmoid units. Then,

where xijk is the kth input to unit j for the ith training example, and d(x) is the derivative
of the sigmoid squashing function.

• Finally, substituting this expression into Equation (1), we obtain a simple expression for
the derivatives that constitute the gradient

11 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Because we seek to maximize rather than minimize P(D|h), we perform gradient ascent rather
than gradient descent search. On each iteration of the search the weight vector is adjusted in
the direction of the gradient, using the weight update rule

Where, η is a small positive constant that determines the step size of the i gradient ascent search

12 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

MINIMUM DESCRIPTION LENGTH PRINCIPLE

The Minimum Description Length (MDL) principle is a concept from information theory
applied to model selection in machine learning. It helps in choosing the best hypothesis (model)
by focusing on how concisely the hypothesis and the data it explains can be encoded.

The Minimum Description Length principle is motivated by interpreting the definition of hMAP
in the light of basic concepts from information theory.

Consider - MAP that aims to find the hypothesis ℎ that maximizes the product of the likelihood
P(D∣h) and the prior P(h):

In terms of logarithms, this becomes

Alternatively, it can be expressed as minimizing the negative of this quantity

Designing an Optimal Code for Transmitting Messages

• Consider there are bunch of different messages, and need to send these messages over a
communication channel. Each message 𝑖 has a certain probability 𝑝𝑖 of being sent.
• The main Goal is to create a code that Minimize the average number of bits needed to
transmit a message.
• Here the strategy is to Assign shorter codes to more frequent messages. This way, we
save space by using fewer bits for common messages and more bits for rare ones.
• The optimal number of bits to encode a message i is −log2(pi)
o If a message is more likely, it gets a shorter code.
o If a message is less likely, it gets a longer code.
• The number of bits needed to encode message 𝑖 with this optimal code is called the
description length of the message, denoted as LC(i)

13 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The equation (1) can be interpreted as a statement that short hypotheses are preferred, assuming
a particular representation scheme for encoding hypotheses and data

• -log2P(h): the description length of h under the optimal encoding for the hypothesis
space H, LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.
• -log2P(D | h): the description length of the training data D given hypothesis h, under the
optimal encoding from the hypothesis space H: LCH (D|h) = −log2P(D| h) , where C D|h
is the optimal code for describing data D assuming that both the sender and receiver
know the hypothesis h.
• Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given
by the description length of the hypothesis plus the description length of the data given
the hypothesis.

Where, CH and CD|h are the optimal encodings for H and for D given h

The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths of equ.

Minimum Description Length principle:

Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis

The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH,
and if we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP

14 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Application to Decision Tree Learning

Apply the MDL principle to the problem of learning decision trees from some training data.
What should we choose for the representations C1 and C2 of hypotheses and data?
• For C1: C1 might be some obvious encoding, in which the description length grows with
the number of nodes and with the number of edges
• For C2: Suppose that the sequence of instances (x1 . . .xm) is already known to both the
transmitter and receiver, so that we need only transmit the classifications (f (x1) . . . f
(xm)).
• Now if the training classifications (f (x1) . . .f(xm)) are identical to the predictions of the
hypothesis, then there is no need to transmit any information about these examples. The
description length of the classifications given the hypothesis ZERO
• If examples are misclassified by h, then for each misclassification we need to transmit
a message that identifies which example is misclassified as well as its correct
classification
• The hypothesis hMDL under the encoding C1 and C2 is just the one that minimizes the
sum of these description lengths.

15 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

BAYES OPTIMAL CLASSIFIER

When working with machine learning, there are two important questions we often ask:

1. What is the most probable hypothesis given the training data?


2. What is the most probable classification of a new instance given the training data?

Although it might seem that we can answer the second question by just using the most probable
hypothesis, known as the Maximum A Posteriori (MAP) hypothesis, there is actually a better
way.

16 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Any system that classifies new instances according to this method is called a Bayes optimal
classifier, or Bayes optimal learner

17 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Gibbs Algorithm

The Bayes optimal classifier is the best possible way to classify new data based on the training
data. However, it is computationally expensive because it calculates and combines the
probabilities of all hypotheses in the hypothesis space 𝐻. The Gibbs algorithm offers a simpler
but less optimal alternative

Algorithm:
• Choose a hypothesis h from H at random, according to the posterior probability
distribution over H.
• Use h to predict the classification of the next instance x.

This means that for each new instance to be classified, the Gibbs algorithm randomly picks one
hypothesis based on the current probabilities and uses it to make the prediction.

Performance of the Gibbs Algorithm


• Under certain conditions, the Gibbs algorithm's expected misclassification error is at
most twice that of the Bayes optimal classifier. This means that, on average, the Gibbs
algorithm makes twice as many mistakes as the Bayes optimal classifier.

Implications for Concept Learning


• If the learner assumes a uniform prior distribution over the hypothesis space 𝐻, and if
the actual target concepts follow this distribution, then using the Gibbs algorithm will
result in an expected error at most twice that of the Bayes optimal classifier.
• This shows how a Bayesian analysis can provide insights into the performance of a
simpler, non-Bayesian algorithm.

18 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

NAIVE BAYES CLASSIFIER

• The naive Bayes classifier applies to learning tasks where each instance x is described
by a conjunction of attribute values and where the target function f (x) can take on any
value from some finite set V.
• A set of training examples of the target function is provided, and a new instance is
presented, described by the tuple of attribute values (al, a2.. .am).
• The learner is asked to predict the target value, or classification, for this new instance.

The Bayesian approach to classifying the new instance is to assign the most probable target
value, VMAP, given the attribute values (al, a2.. .am) that describe the instance

Use Bayes theorem to rewrite this expression as

• The naive Bayes classifier is based on the assumption that the attribute values are
conditionally independent given the target value. Means, the assumption is that given
the target value of the instance, the probability of observing the conjunction (al, a2.. .am),
is just the product of the probabilities for the individual attributes:

Substituting this into Equation (1),

Naive Bayes classifier:

Where, VNB denotes the target value output by the naive Bayes classifier

19 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

An Illustrative Example
• Let us apply the naive Bayes classifier to a concept learning problem i.e., classifying
days according to whether someone will play tennis.
• The below table provides a set of 14 training examples of the target concept PlayTennis,
where each day is described by the attributes Outlook, Temperature, Humidity, and
Wind

Day Outlook Temperature Humidity Wind PlayTennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

• Use the naive Bayes classifier and the training data from this table to classify the
following novel instance:
< Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong >

• Our task is to predict the target value (yes or no) of the target concept PlayTennis for
this new instance

The probabilities of the different target values can easily be estimated based on their
frequencies over the 14 training examples
• P(P1ayTennis = yes) = 9/14 = 0.64
• P(P1ayTennis = no) = 5/14 = 0.36

20 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Similarly, estimate the conditional probabilities. For example, those for Wind = strong
• P(Wind = strong | PlayTennis = yes) = 3/9 = 0.33
• P(Wind = strong | PlayTennis = no) = 3/5 = 0.60

Calculate VNB according to Equation (1)

Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new
instance, based on the probability estimates learned from the training data.

By normalizing the above quantities to sum to one, calculate the conditional probability that
the target value is no, given the observed attribute values

Estimating Probabilities

• We have estimated probabilities by the fraction of times the event is observed to occur
over the total number of opportunities.
• For example, in the above case we estimated P(Wind = strong | Play Tennis = no) by
the fraction nc /n where, n = 5 is the total number of training examples for which
PlayTennis = no, and nc = 3 is the number of these for which Wind = strong.
• When nc = 0, then nc /n will be zero and this probability term will dominate the quantity
calculated in Equation (2) requires multiplying all the other probability terms by this
zero value
• To avoid this difficulty we can adopt a Bayesian approach to estimating the probability,
using the m-estimate defined as follows

m -estimate of probability:

• p is our prior estimate of the probability we wish to determine, and m is a constant


called the equivalent sample size, which determines how heavily to weight p relative
to the observed data
• Method for choosing p in the absence of other information is to assume uniform
priors; that is, if an attribute has k possible values we set p = 1 /k.

21 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

BAYESIAN BELIEF NETWORKS

• The naive Bayes classifier makes significant use of the assumption that the values of the
attributes a1 . . .an are conditionally independent given the target value v.
• This assumption dramatically reduces the complexity of learning the target function

A Bayesian belief network describes the probability distribution governing a set of variables
by specifying a set of conditional independence assumptions along with a set of conditional
probabilities
Bayesian belief networks allow stating conditional independence assumptions that apply to
subsets of the variables

Notation
• Consider an arbitrary set of random variables Y1 . . . Yn , where each variable Yi can
take on the set of possible values V(Yi).
• The joint space of the set of variables Y to be the cross product V(Y 1) x V(Y2) x. . .
V(Yn).
• In other words, each item in the joint space corresponds to one of the possible
assignments of values to the tuple of variables (Y1 . . . Yn). The probability distribution
over this joint' space is called the joint probability distribution.
• The joint probability distribution specifies the probability for each of the possible
variable bindings for the tuple (Y1 . . . Yn).
• A Bayesian belief network describes the joint probability distribution for a set of
variables.

Conditional Independence

Let X, Y, and Z be three discrete-valued random variables. X is conditionally independent of


Y given Z if the probability distribution governing X is independent of the value of Y given a
value for Z, that is, if

Where,

The above expression is written in abbreviated form as


P(X | Y, Z) = P(X | Z)

Conditional independence can be extended to sets of variables. The set of variables X1 . . . Xl


is conditionally independent of the set of variables Y1 . . . Ym given the set of variables Z1 . . .
Zn if

22 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

The naive Bayes classifier assumes that the instance attribute A1 is conditionally independent
of instance attribute A2 given the target value V. This allows the naive Bayes classifier to
calculate P(Al, A2 | V) as follows,

Representation

A Bayesian belief network represents the joint probability distribution for a set of variables.
Bayesian networks (BN) are represented by directed acyclic graphs.

The Bayesian network in above figure represents the joint probability distribution over the
boolean variables Storm, Lightning, Thunder, ForestFire, Campfire, and BusTourGroup

A Bayesian network (BN) represents the joint probability distribution by specifying a set of
conditional independence assumptions
• BN represented by a directed acyclic graph, together with sets of local conditional
probabilities
• Each variable in the joint space is represented by a node in the Bayesian network
• The network arcs represent the assertion that the variable is conditionally independent
of its non-descendants in the network given its immediate predecessors in the network.
• A conditional probability table (CPT) is given for each variable, describing the
probability distribution for that variable given the values of its immediate predecessors

The joint probability for any desired assignment of values (y1, . . . , yn) to the tuple of network
variables (Y1 . . . Ym) can be computed by the formula

23 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Where, Parents(Yi) denotes the set of immediate predecessors of Yi in the network.

Example:
Consider the node Campfire. The network nodes and arcs represent the assertion that Campfire
is conditionally independent of its non-descendants Lightning and Thunder, given its
immediate parents Storm and BusTourGroup.

This means that once we know the value of the variables Storm and BusTourGroup, the
variables Lightning and Thunder provide no additional information about Campfire
The conditional probability table associated with the variable Campfire. The assertion is

P(Campfire = True | Storm = True, BusTourGroup = True) = 0.4

Inference

• Use a Bayesian network to infer the value of some target variable (e.g., ForestFire) given
the observed values of the other variables.
• Inference can be straightforward if values for all of the other variables in the network
are known exactly.
• A Bayesian network can be used to compute the probability distribution for any subset
of network variables given the values or distributions for any subset of the remaining
variables.
• An arbitrary Bayesian network is known to be NP-hard

Learning Bayesian Belief Networks

Affective algorithms can be considered for learning Bayesian belief networks from training
data by considering several different settings for learning problem

24 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

➢ First, the network structure might be given in advance, or it might have to be inferred from
the training data.
➢ Second, all the network variables might be directly observable in each training example,
or some might be unobservable.
• In the case where the network structure is given in advance and the variables are fully
observable in the training examples, learning the conditional probability tables is
straightforward and estimate the conditional probability table entries
• In the case where the network structure is given but only some of the variable values
are observable in the training data, the learning problem is more difficult. The learning
problem can be compared to learning weights for an ANN.

Gradient Ascent Training of Bayesian Network

The gradient ascent rule which maximizes P(D|h) by following the gradient of ln P(D|h) with
respect to the parameters that define the conditional probability tables of the Bayesian network.

Let wijk denote a single entry in one of the conditional probability tables. In particular wijk
denote the conditional probability that the network variable Yi will take on the value yi, given
that its immediate parents Ui take on the values given by uik.

The gradient of ln P(D|h) is given by the derivatives for each of the wijk.
As shown below, each of these derivatives can be calculated as

Derive the gradient defined by the set of derivatives for all i, j, and k. Assuming the
training examples d in the data set D are drawn independently, we write this derivative as

We write the abbreviation Ph(D) to represent P(D|h).

25 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

26 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

THE EM ALGORITHM

The EM algorithm can be used even for variables whose value is never directly observed,
provided the general form of the probability distribution governing these variables is known.

Estimating Means of k Gaussians

• Consider a problem in which the data D is a set of instances generated by a probability


distribution that is a mixture of k distinct Normal distributions.

• This problem setting is illustrated in Figure for the case where k = 2 and where the
instances are the points shown along the x axis.
• Each instance is generated using a two-step process.
• First, one of the k Normal distributions is selected at random.
• Second, a single random instance xi is generated according to this selected
distribution.
• This process is repeated to generate a set of data points as shown in the figure.
• To simplify, consider the special case

27 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

• The selection of the single Normal distribution at each step is based on choosing
each with uniform probability
• Each of the k Normal distributions has the same variance σ2, known value.
• The learning task is to output a hypothesis h = (μ1 , . . . ,μk) that describes the means of
each of the k distributions.
• We would like to find a maximum likelihood hypothesis for these means; that is, a
hypothesis h that maximizes p(D |h).

In this case, the sum of squared errors is minimized by the sample mean

• Our problem here, however, involves a mixture of k different Normal distributions, and
we cannot observe which instances were generated by which distribution.
• Consider full description of each instance as the triple (xi, zi1, zi2),
• where xi is the observed value of the ith instance and
• where zi1 and zi2 indicate which of the two Normal distributions was used to
generate the value xi
• In particular, zij has the value 1 if xi was created by the jth Normal distribution and 0
otherwise.
• Here xi is the observed variable in the description of the instance, and zil and zi2 are
hidden variables.
• If the values of zil and zi2 were observed, we could use following Equation to solve for
the means p1 and p2
• Because they are not, we will instead use the EM algorithm

EM algorithm

28 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

29 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Machine Learning 21AI63

Question Bank

1. Define Bayesian theorem? What is the relevance and features of Bayesian theorem?
2. Explain the practical difficulties of Bayesian theorem.
3. Define is Maximum a Posteriori (MAP) Maximum Likelihood (ML) Hypothesis. Derive
the relation for hMAP and hML using Bayesian theorem.
4. Consider a medical diagnosis problem in which there are two alternative hypotheses:1. that
the patient has a particular form of cancer (+) and 2. That the patient does not (-). A patient
takes a lab test and the result comes back positive. The test returns a correct positive result
in only 98% of the cases in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not present. Furthermore, .008 of
the entire population have this cancer. Determine whether the patient has Cancer or not
using MAP hypothesis.
5. Explain Brute force Bayes Concept Learning
6. What are Consistent Learners?
7. Discuss Maximum Likelihood and Least Square Error Hypothesis
8. Describe Maximum Likelihood Hypothesis for predicting probabilities.
9. Explain the Gradient Search to Maximize Likelihood in a Neural Net
10. Describe the concept of MDL. Obtain the equation for hMDL
11. Explain Bayes optimal classifier.
12. Explain Gibbs algorithm.
13. Explain Naïve Bayes Classifier with an Example
14. What are Bayesian Belief nets? Where are they used?
15. Explain Bayesian belief network and conditional independence with example
16. Explain Gradient Ascent Training of Bayesian Networks
17. Explain the concept of EM Algorithm. Discuss what are Gaussian Mixtures

30 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

You might also like