Unit I - Mlfinal
Unit I - Mlfinal
Review of Linear Algebra for machine learning; Introduction and motivation for machine
learning; Examples of machine learning applications, Vapnik-Chervonenkis (VC) dimension,
Probably Approximately Correct (PAC) learning, Hypothesis spaces, Inductive bias,
Generalization, Bias variance trade-off.
The term Linear Algebra was initially introduced in the early 18 th century to find out the
unknowns in Linear equations and solve the equation easily; hence it is an important
branch of mathematics that helps study data. Also, no one can deny that Linear Algebra
is undoubtedly the important and primary thing to process the applications of Machine
Learning. It is also a prerequisite to start learning Machine Learning and data science.
Linear algebra plays a vital role and key foundation in machine learning, and it enables
ML algorithms to run on a huge number of datasets.
The concepts of linear algebra are widely used in developing algorithms in machine
learning. Although it is used almost in each concept of Machine learning, specifically, it
can perform the following task:
o Optimization of data.
o Applicable in loss functions, regularisation, covariance matrices, Singular Value
Decomposition (SVD), Matrix Operations, and support vector machine
classification.
o Implementation of Linear Regression in Machine Learning.
Besides the above uses, linear algebra is also used in neural networks and the data
science field.
Basic mathematics principles and concepts like Linear algebra are the foundation of
Machine Learning and Deep Learning systems.
Linear Algebra is just similar to the flour of bakery in Machine Learning. As the cake is
based on flour similarly, every Machine Learning Model is also based on Linear
Algebra. Further, the cake also needs more ingredients like egg, sugar, cream, soda.
Similarly, Machine Learning also requires more concepts as vector calculus, probability,
and optimization theory. So, we can say that Machine Learning creates a useful model
with the help of the above-mentioned mathematical concepts.
Below are some benefits of learning Linear Algebra before Machine learning:
1
2
o Better Graphic experience
o Improved Statistics
o Creating better Machine Learning algorithms
o Estimating the forecast of Machine Learning
o Easy to Learn
Linear Algebra helps to provide better graphical processing in Machine Learning like
Image, audio, video, and edge detection. These are the various graphical representations
supported by Machine Learning projects that you can work on. Further, parts of the
given data set are trained based on their categories by classifiers provided by machine
learning algorithms. These classifiers also remove the errors from the trained data.
Moreover, Linear Algebra helps solve and compute large and complex data set through a
specific terminology named Matrix Decomposition Techniques. There are two most
popular matrix decomposition techniques, which are as follows:
o Q-R
o L-U
Improved Statistics:
Linear Algebra also helps to create better supervised as well as unsupervised Machine
Learning algorithms.
Few supervised learning algorithms can be created using Linear Algebra, which is as
follows:
o Logistic Regression
o Linear Regression
o Decision Trees
o Support Vector Machines (SVM)
Further, below are some unsupervised learning algorithms listed that can also be created
with the help of linear algebra as follows:
With the help of Linear Algebra concepts, you can also self-customize the various
parameters in the live project and understand in-depth knowledge to deliver the same
with more accuracy and precision.
If you are working on a Machine Learning project, then you must be a broad-minded person
and also, you will be able to impart more perspectives. Hence, in this regard, you must
increase the awareness and affinity of Machine Learning concepts. You can begin with
setting up different graphs, visualization, using various parameters for diverse machine
learning algorithms or taking up things that others around you might find difficult to
understand.
Easy to Learn:
Notation in linear algebra enables you to read algorithm descriptions in papers, books,
and websites to understand the algorithm's working. Even if you use for-loops rather than
matrix operations, you will be able to piece things together.
Operations:
Working with an advanced level of abstractions in vectors and matrices can make concepts
clearer, and it can also help in the description, coding, and even thinking capability. In
linear algebra, it is required to learn the basic operations such as addition, multiplication,
inversion, transposing of matrices, vectors, etc.
Matrix Factorization:
One of the most recommended areas of linear algebra is matrix factorization, specifically
matrix deposition methods such as SVD and QR.
3
4
o Recommender Systems
o One-hot encoding
o Regularization
o Principal Component Analysis
o Images and Photographs
o Singular-Value Decomposition
o Deep Learning
o Latent Semantic Analysis
Each machine learning project works on the dataset, and we fit the machine learning model
using this dataset.
Each dataset resembles a table-like structure consisting of rows and columns. Where each
row represents observations, and each column represents features/Variables. This dataset
is handled as a Matrix, which is a key data structure in Linear Algebra.
Further, when this dataset is divided into input and output for the supervised learning model,
it represents a Matrix(X) and Vector(y), where the vector is also an important concept of
linear algebra.
In machine learning, images/photographs are used for computer vision applications. Each
Image is an example of the matrix from linear algebra because an image is a table
structure consisting of height and width for each pixel.
Moreover, different operations on images, such as cropping, scaling, resizing, etc., are
performed using notations and operations of Linear Algebra.
In machine learning, sometimes, we need to work with categorical data. These categorical
variables are encoded to make them simpler and easier to work with, and the popular
encoding technique to encode these variables is known as one-hot encoding.
In the one-hot encoding technique, a table is created that shows a variable with one column
for each category and one row for each example in the dataset. Further, each row is
encoded as a binary vector, which contains either zero or one value. This is an example
of sparse representation, which is a subfield of Linear Algebra.
4. Linear Regression
4
5
learning to predict numerical values. The most common way to solve linear regression
problems using Least Square Optimization is solved with the help of Matrix
factorization methods. Some commonly used matrix factorization methods are LU
decomposition, or Singular-value decomposition, which are the concept of linear
algebra.
5. Regularization
In machine learning, we usually look for the simplest possible model to achieve the best
outcome for the specific problem. Simpler models generalize well, ranging from specific
examples to unknown datasets. These simpler models are often considered models with
smaller coefficient values.
A technique used to minimize the size of coefficients of a model while it is being fit on
data is known as regularization. Common regularization techniques are L1 and L2
regularization. Both of these forms of regularization are, in fact, a measure of the
magnitude or length of the coefficients as a vector and are methods lifted directly from
linear algebra called the vector norm.
Generally, each dataset contains thousands of features, and fitting the model with such a
large dataset is one of the most challenging tasks of machine learning. Moreover, a
model built with irrelevant features is less accurate than a model built with relevant
features. There are several methods in machine learning that automatically reduce the
number of columns of a dataset, and these methods are known as Dimensionality
reduction. The most commonly used dimensionality reductions method in machine
learning is Principal Component Analysis or PCA. This technique makes projections of
high-dimensional data for both visualizations and training models. PCA uses the matrix
factorization method from linear algebra.
7. Singular-Value Decomposition
Natural Language Processing or NLP is a subfield of machine learning that works with
text and spoken words.
NLP represents a text document as large matrices with the occurrence of words. For
example, the matrix column may contain the known vocabulary words, and rows may
contain sentences, paragraphs, pages, etc., with cells in the matrix marked as the count
or frequency of the number of times the word occurred. It is a sparse matrix
representation of text. Documents processed in this way are much easier to compare,
query, and use as the basis for a supervised machine learning model.
5
6
This form of data preparation is called Latent Semantic Analysis, or LSA for short, and
is also known by the name Latent Semantic Indexing or LSI.
9. Recommender System
Artificial Neural Networks or ANN are the non-linear ML algorithms that work to
process the brain and transfer information from one layer to another in a similar way.
6
7
In the above figure a Machine Learning process begins by feeding the machine lots of
data. By using this data, the machine is trained to detect hidden insights and trends.
These insights are then used to build a Machine Learning Model by using an algorithm
in order to solve a problem.
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Supervised Learning
Supervised learning is a technique in which we teach or train the machine using data
which is well labeled.
7
8
Supervised Learning
Consider the above figure. Here we’re feeding the machine images of Tom and Jerry
and the goal is to identify and classify the images into two groups (Tom images and
Jerry images). The training data set that is fed to the model is labeled, as in, we’re telling
the machine, ‘this is how Tom looks and this is Jerry’. By doing so you’re training the
machine by using labeled data. In Supervised Learning, there is a well-defined training
phase done with the help of labeled data.
Unsupervised Learning
Unsupervised learning involves training by using unlabeled data and allowing the
model to act on that information without guidance.
Think of unsupervised learning as a smart kid that learns without any guidance. In this
type of Machine Learning, the model is not fed with labeled data, as in the model has no
clue that ‘this image is Tom and this is Jerry’, it figures out patterns and the differences
between Tom and Jerry on its own by taking in tons of data.
Unsupervised Learning
8
9
For example, it identifies prominent features of Tom such as pointy ears, bigger size,
etc, to understand that this image is of type 1. Similarly, it finds such features in Jerry
and knows that this image is of type 2. Therefore, it classifies the images into two
different classes withoutknowing who Tom is or Jerry is.
Reinforcement Learning
The goal of reinforcement learning in this case is to train the dog (agent) to complete a
task within an environment, which includes the surroundings ofthe dog as well as the
trainer.
First, the trainer issues a command, which the dog observes observation).the dog then
responds by taking an action. If the action is close to the desired behavior, the trainer
will likely provide a reward, such as a food treat; otherwise, no reward or a negative
reward will be provided.
At the beginning of training, the dog will likely take more random actions like rolling
over when the command given is “Sit”, as it is trying to associate specific observations
with actions and rewards. This association, rmapping, between observations and actions
is called policy.
From the dog‘s perspective, the ideal case would be one in which it would respond
correctly to every command, so that it gets as many treats as possible.
So, the whole meaning of reinforcement learning training is to ―tune‖ the dog‘s policy
so that it learns the desired behaviors that will maximize some reward. After training is
complete, the dog should be able to observe the owner and take the appropriate action,
for example, sitting when commanded to ―sit‖ by using the internal policy it has
developed.
10
11
let’s assume that you have been given a problem that needs to be solved by using
Machine Learning.
The problem is to predict the occurrence of rain in your local area by using Machine
Learning.
At this step, we must understand what exactly needs to be predicted. In our case, the
objective is to predict the possibility of rain by studying weather conditions. At this
stage, it is also essential to take mental notes on what kind of data can be used to solve
this problem or the type of approach you must follow to get to the solution.
Once you know the types of data that is required, you must understand how you can
derive this data. Data collection can be done manually or by web scraping. However, if
you’re a beginner and you’re just looking to learn Machine Learning you don’t have to
worry about getting the data. There are 1000s of data resources on the web, you can just
download the data set and get going.
Coming back to the problem at hand, the data needed for weather forecasting includes
measures such as humidity level, temperature, pressure, locality, whether or not you live
in a hill station, etc. Such data must be collected and stored for analysis.
11
12
Step 3: Data Preparation
The data you collected is almost never in the right format. You will encounter a lot of
inconsistencies in the data set such as missing values, redundant variables, duplicate
values, etc. Removing such inconsistencies is very essential because they might lead to
wrongful computations and predictions. Therefore, at this stage, you scan the data set for
any inconsistencies and you fix them then and there.
Grab your detective glasses because this stage is all about diving deep into data and
finding all the hidden data mysteries. EDA or Exploratory Data Analysis is the
brainstorming stage of Machine Learning. Data Exploration involves understanding the
patterns and trends in the data. At this stage, all the useful insights are drawn and
correlations between the variables are understood.
For example, in the case of predicting rainfall, we know that there is a strong possibility
of rain if the temperature has fallen low. Such correlations must be understood and
mapped at this stage.
All the insights and patterns derived during Data Exploration are used to build the
Machine Learning Model. This stage always begins by splitting the data set into two
parts, training data, and testing data. The training data will be used to build and analyze
the model. The logic of the model is based on the Machine Learning Algorithm that is
being implemented.
In the case of predicting rainfall, since the output will be in the form of True (if it will
rain tomorrow) or False (no rain tomorrow), we can use a Classification Algorithm such
as Logistic Regression.
Choosing the right algorithm depends on the type of problem you’re trying to solve, the
data set and the level of complexity of the problem. In the upcoming sections, we will
discuss the different types of problems that can be solved by using Machine Learning.
After building a model by using the training data set, it is finally time to put the model to
a test. The testing data set is used to check the efficiency of the model and how
accurately it can predict the outcome. Once the accuracy is calculated, any further
improvements in the model can be implemented at this stage. Methods like parameter
tuning and cross-validation can be used to improve the performance of the model.
Step 7: Predictions
Once the model is evaluated and improved, it is finally used to make predictions. The
final output can be a Categorical variable (eg. True or False) or it can be a Continuous
Quantity (eg. the predicted value of a stock).
12
13
In our case, for predicting the occurrence of rainfall, the output will be a categorical
variable.
So that was the entire Machine Learning process. Now it’s time to learn about the
different ways in which Machines can learn.
Step 1- Choosing the Training Experience: The very important and first task is to
choose the training data or training experience which will be fed to the Machine
Learning Algorithm. The training experience will be able to provide direct or indirect
feedback regarding choices.
For example: While Playing chess the training data will provide feedback to itself like
instead of this move if this is chosen the chances of success increases.
13
14
Second important attribute is the degree to which the learner will control the
sequences of training
Third important attribute is how it will represent the distribution of examples over
which performance will be measured.
Step 2- Choosing target function: The next important step is choosing the target
function. It means according to the knowledge fed to the algorithm the machine learning
will choose NextMove function which will describe what type of legal moves should be
taken
Step 3- Choosing Representation for Target function: When the machine algorithm
will know all the possible legal moves the next step is to choose the optimized move
using any representation i.e. using linear Equations, Hierarchical Graph Representation,
Tabular form etc.
Step 5- Final Design: The final design is created at last when system goes from number
of examples , failures and success , correct and incorrect decision and what will be the
next step etc. Example: DeepBlue is an intelligent computer which is ML-based won
chess game against the chess expert Garry Kasparov, and it became the first computer
which had beaten a human chess expert.
The learning process, whether by a human or a machine, can be divided into four
components, namely, data storage, abstraction, generalization and evaluation. Figure 1.1
illustrates the various components and the steps involved in the learning process.
1. Data storage
Facilities for storing and retrieving huge amounts of data are an important
component of the learning process. Humans and computers alike utilize data storage
as a foundation for advanced reasoning.
In a human being, the data is stored in the brain and data is retrieved using electro
chemical signals.
Computers use hard disk drives, flash memory, random access memory and similar
devices to store data and use cables and other technology to retrieve data.
2. Abstraction
3. Generalization
4. Evaluation
2. In finance, banks analyze their past data to build models to use in credit
applications, fraud detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control,
and troubleshooting.
4. In medicine, learning programs are used for medical diagnosis.
3. VC Dimension
Shattering Instances
A hypothesis space is said to shatter a set of instances iff for every partition of the
instances into positive and negative, there is a hypothesis that produces that
partition.
For example, consider 2 instances described using a single real-valued feature being
shattered by intervals.
VC Dimension for axis aligned Straight Line with two points, three points and
four points
For Two points: There are four possible training set types to consider:
Draw A Line that Classify the two classes(+,-) , We need to show that there are values of a,
b which realize all the possible four dichotomies (+, +),(-, -),(+, -),(-, +). if total data points
are two then the two numbers can be classified as
16
17
Here N=2 so 2 =42
Consider a straight line as the classification model, a perceptron. The line should
separate positive and negative data points. There exists sets of 3 non collinear points that
can indeed be shattered using this model. Since there are 3 points and 2 classes(+, -), we
can have 23=8 types of positions shown below in x,y plane.
Figure showing how a linear model can shatter all dichotomies of 3 points on a 2D plane
It can be seen that a straight line can shatter 3 points, but it cannot shatter4 points. Thus
the VC dimension of a model straight line in 2D plane is 3.
Take two data points, the data points inside the rectangle is positive and outside the rectangle is
negative . hence axis align rectangle can shatter two points in two dimension space R 2
17
18
For three data points , hence axis align rectangle can shatter three points in two dimension space
R2
For four data points, hence axis align rectangle can shatter four points in two dimension space
R2
18
19
Since 4 is maximum data point that can shatter by axis align rectangle, so
In computer science, computational learning theory (or just learning theory) is a subfield
of artificial intelligence devoted to studying the design and analysis of machine learning
algorithms. In computational learning theory, probably approximately correct learning
(PAC learning) is a framework for mathematical analysis of machine learning
algorithms. It was proposed in 1984 by Leslie Valiant.
In this framework, the learner (that is, the algorithm) receives samples and must select a
hypothesis from a certain class of hypotheses. The goal is that, with high probability (the
“probably” part), the selected hypothesis will have low generalization error (the
“approximately correct” part).
4(a) PAC-learnability
19
20
20
21
Hypothesis spaces
Definition
1. Hypothesis
In a binary classification problem, a hypothesis is a statement or a proposition
purporting to explain a given set of facts or observations.
2. Hypothesis space
The hypothesis space for a binary classification problem is a set of hypotheses for the
problem
21
22
that might possibly be returned by it.
Let x be an example in a binary classification problem and let c(x) denote the class label
assigned to x (c(x) is 1 or 0). Let D be a set of training examples for the problem. Let h
be a hypothesis for the problem and h(x) be the class label assigned to x by the
hypothesis h.
(a) We say that the hypothesis h is consistent with the set of training examples D
if h(x) = c(x) for all x ∈ D
Examples
1. Consider the set of observations of a variable x with the associated class labels given in
Table 2.1:
x 27 15 23 20 25 17 12 30 6 10
Class 1 0 1 1 1 0 0 1 0 0
x
0 6 10 12 15 17 23 25 27 30
20
Figure 2.1: Data in Table 2.1 with hollow dots representing positive examples and solid dots repre-
senting negative examples
Looking at Figure 2.1, it appears that the class labeling has been done based on the following
rule.
h′ : IF x ≥ 20 THEN “1” ELSE “0”. (2.1)
Note that h′ is consistent with the training examples in Table 2.1. For example, we have:
h′(5) = 0, h′(28) = 1.
The hypothesis h′ explains the data. The following proposition also explains the data:
It is not enough that the hypothesis explains the given data; it must also predict correctly the
class label of future observations. So we consider a set of such hypotheses and choose the
“best” one. The set of hypotheses can be defined using a parameter, say m, as given below:
22
The set of all hypotheses obtained by assigning different values to m constitutes the hypothesis sp3ace H ;
For the same data, we can have different hypothesis spaces. For example, for the data in
Table 2.1, we may also consider the hypothesis space defined by the following proposition:
′
hm : IF x ≤ m THEN “0” ELSE “1”.
Consistent Hypothesis
For Example:
Solution:
There are five attributes and buy is the target variable. Now check h1 and h2 is consistent
with all training examples.
h1 = (?, ?, No, ?, Many) – Check the consistent with above training example
23
No-match with no (In Library)
3
?-match with Affordable(Price)
Many-does not match with one .So this hypothesis does not match with eg1x h1 classify
as negative and we are expecting it as neg ,so eg1 is no.
So all are matched , so we can say it is positive class. And classify as positive. So h1 is
consistent with exmple1 and also example2.
h1 = (?, ?, No, ?, Many) – Consistent Hypothesis as it is consistent with all the training
examples. If it is consistent with all examples.
24
3
Version space
Definition
The version space VSH,D is the subset of the hypothesis from H consistent with the
training example in D,
Consider a binary classification problem. Let D be a set of training examples and H a hypothesis space for
the problem. The version space for the problem with respect to the set D and the space H is the set of
hypotheses from H consistent with D; that is, it is the set
VSD,H = {h ∈ H ∶ h(x) = c(x) for all x ∈ D}.
Examples
Example2:
Consider the problem of finding a rule for determining days on which one can enjoy water sport. The rule is to
depend on a few attributes like “temp”, ”humidity”, etc. Suppose we have the following data to help us devise the
rule. In the data, a value of “1” for “enjoy” means “yes” and a value of “0” indicates ”no”.
25
4
Find the hypothesis space and the version space for the problem.
26
4
List-Then-Eliminate algorithm
2. For each training example, <a(x), c(x)> Remove from VersionSpace any
hypothesis h for which h(x) != c(x)
Example:
F1 – > A, B
F2 – > X, Y
Here F1 and F2 are two features (attributes) with two possible values for each feature or
attribute.
Instance Space: (A, X), (A, Y), (B, X), (B, Y) – 4 Examples
Hypothesis Space: (A, X), (A, Y), (A, ø), (A, ?), (B, X), (B, Y), (B, ø), (B, ?), (ø, X), (ø, Y),
(ø, ø), (ø, ?), (?, X), (?, Y), (?, ø), (?, ?) – 16 Hypothesis
If any null hypothesis then it will not never matches, here we have 7 null HP,so we can take
sample null and remove all other null hypo
Semantically Distinct Hypothesis : (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y
(?, ?), (ø, ø) – 10
27
Version Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y) (?, ?), (ø, ø)
4
•Training Instances
F1 F2 Target
A X Yes
A Y Yes
Find all hypothesis which are consistent here , consider hp one at a time. Check with training
examples , if it matches then retain it other wise remove for VersionSpace.
First take
(A,X)- A matches with A and X matches with X ,So expected as yes , here also yes so
consistent, But for (A,X)=, A matches with A but X not matches with Y so it is no but it is yes
,so incsistent. Remove from version space
After Step1: Version Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y) (?, ?),
(ø, ø)
Step2:
After Step2: Version Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y) (?, ?),
(ø, ø)
Step3: (A,?)-A matches A and ? matches X so consistent, similarly A matches A and ? matches
Y so consistent , it is yes.So retain the hypothesis
After Step3: Version Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y) (?, ?),
(ø, ø)
After Step4,5 and 6: Version Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y)
(?, ?), (ø, ø)
28
Similarly for (?,X) is match with eg1 but not match with eg2 and (?,Y) : not match with eg1 so
both are inconsistent, remove from VS 4
After Step7 and 8: Version Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y) (?,
?), (ø, ø)
Now (?,?) matches both eg1 and eg2 and expected as positive, so it is consistent
After Step9: Version Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y) (?, ?),
(ø, ø)
After Step 10: Version Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y) (?, ?),
(ø, ø)
Consistent Hypothesis :(A, ?) and (?, ?) by using List-then-Eliminate, wt we do is first we list all
the hypothesis in the VS and then we eliminate which is consistent.
Enumeration of all the hypothesis, rather inefficient (cz of list all hp is waste of time)
Hypothesis:
It is usually represented with an ‘h’. In supervised machine learning, a
hypothesis is a function that best characterizes the target.
29
4
Specific Hypothesis:
If a hypothesis, h, covers none of the negative cases and there is no other
hypothesis, h′, that covers none of the negative examples, then h is strictly
more general than h′, then h is said to be the most specific hypothesis.
Find-S:
The find-S algorithm is a machine learning concept learning algorithm.
The find-S technique identifies the hypothesis that best matches all of the
positive cases. The find-S algorithm considers only positive cases.
When the find-S method fails to categorize observed positive training data,
it starts with the most particular hypothesis and generalizes it.
30
4
Representations:
1. Initialize the value of the hypothesis for all attributes with the most
specific one. That is,
h0 = < ϕ, ϕ, ϕ, ϕ...........>
2. Take the next example, if the taken example is negative leave them and
move on to another example without changing our hypothesis for the step.
For each attribute, check if the value of the attribute is equal to that of the
value we took in our hypothesis.
If the value is equal then we’ll use the same value for the attribute in our
hypothesis and move to another attribute.
If the value of the attribute is not equal to that of the value in our specific
hypothesis then change the value of our attribute in a specific hypothesis to
the most general hypothesis (?).
31
Let’s have a look at an example to see how Find-S works.
4
Consider the following data set, which contains information about the best
day for a person to enjoy their preferred sport.
Sky Air temp Humidity Wind Water Forecast EnjoySport
Now initializing the value of the hypothesis for all attributes with the most
specific one.
h0 = < ϕ, ϕ, ϕ, ϕ, ϕ, ϕ>
Consider example 1, The attribute values are < Sunny, Warm, Normal,
Strong, Warm, Same>. Since its target class(EnjoySport) value is yes, it is
considered as a positive
example.
32
Now, We can see that our first hypothesis is more specific, and we must
4
generalize it in this case. As a result, the hypothesis is:
h1 = < Sunny, Warm, Normal, Strong, Warm, Same>
The second training example (also positive in this case) compels the
algorithm to generalize h further, this time by replacing any attribute value
in h that is not met by the new example with a “?”.
The attribute values are < Sunny, Warm, High, Strong, Warm, Same>
Consider example 3, The attribute values are < Rainy, Cold, High, Strong,
Warm, Change>. But since the target class value is No, it is considered as
a negative example.
h3 = < Sunny, Warm, ?, Strong, Warm, Same > (Same as that of h2)
Every negative example is simply ignored by the FIND-S algorithm. As a
result, no changes to h will be necessary in reaction to any unfavorable
case.
33
4
34
4
FIND-S will always return the most specific hypothesis inside H that
matches the positive training instances.
Output:
else:
35
Find the maximally general hypothesis and maximally specific hypothesis for the training
examples given in the table using the candidate elimination algorithm. 5
Training Example:
Step 1:
Initialize G & S as most General and specific hypothesis.
S = {'φ','φ','φ','φ','φ','φ'}
Step 2:
for each +ve example: make a specific hypothesis more general.
s = {'φ','φ','φ','φ','φ','φ'}
Step 3:
Compare with another positive instance for each attribute.
if (attribute value = hypothesis value) do nothing.
else
replace the hypothesis value with more general constraint '?'.
36
Since instance 2 is also positive so we will compare with it. In instance 2 attribute humidity is
changing so we will generalize that attribute. 5
Step 4:
Instance 3 is negative so for each -ve example make general hypothesis more specific.
we will make the general hypothesis more specific by comparing all the attributes of the negative
instance with the positive instance if attribute found different to create a dedicated set for the
attribute.
G ={<'sunny', '?','?','?', '?','?'> , <'?', 'warm','?','?', '?','?'> , <'?', '?','Normal','?', '?','?'> ,
< '?', '?','?','?', '?','same'>}
step 5:
Instance 4 is positive so repeat step 3:
S={'sunny', 'warm','?', 'Strong', '?', '?'}
Discard the general hypothesis set which is contradicting with a resultant specific hypothesis
here humidity and forecast attribute is contradicting.
G ={<'sunny', '?','?','?', '?','?'> , <'?', 'warm','?','?', '?','?'> }
37
Inductive Bias in Machine Learning
5
The phrase “inductive bias” refers to a collection of (explicit or implicit) assumptions
made by a learning algorithm in order to conduct induction or generalize a limited set of
observations (training data) into a general model of the domain.
Why Inductive Bias?
As we know that in Candidate-Elimination Algorithm, we get two hypotheses, one
specific and one general at the end as a final solution.
Now, we also need to check if the hypothesis we got from the algorithm is actually
correct or not, also make decisions like what training examples should the machine learn
next.
Induction would be impossible without such a bias, because observations may generally
be extended in a variety of ways. Predictions for new scenarios could not be formed if all
of these options were treated equally, that is, without any bias in the sense of a preference
for certain forms of generalization (representing previous information about the target
function to be learned).
The idea of inductive bias is to let the learner generalize beyond the observed training
examples to deduce new examples.
‘ > ’ -> Inductively inferred from.
For example,
x > y means y is inductively deduced from x.
38
Nearest neighbors: Assume that the majority of the examples in a local neighborhood in
feature space are from the same class. 5
If the class of a case is unknown, assume that it belongs to the same class as the majority of
the people in its near vicinity. The k-nearest neighbor’s algorithm employs this bias. Cases
that are close to each other are assumed to belong to the same class.
Generalisation
How well a model trained on the training set predicts the right output for new instances is
called generalization.
Generalization refers to how well the concepts learned by a machine learning model
apply to specific examples not seen by the model when it was learning. The goal of a
good machine learning model is to generalize well from the training data to any data from
the problem domain.
This allows us to make predictions in the future on data the model has never seen.
Overfitting and underfitting are the two biggest causes for poor performance of machine
learning algorithms. The model should be selected having the best generalisation. This is
said to be the case if these problems are avoided.
• Underfitting
Underfitting is the production of a machine learning model that is not complex
enough to accurately capture relationships between a dataset â˘A ´ Zs features and a
target variable.
• Overfitting
o Overfitting is the production of an analysis which corresponds too closely or
exactly to a particular set of data, and may therefore fail to fit additional data or
predict future observations reliably.
39
5
Consider a dataset shown in Figure(a). Let it be required to fit a regression model to the data. The
graph of a model which looks “just right” is shown in Figure(b). In Figure (c)we have a linear
regression model for the same dataset and this model does seem to capture the essential features
of the dataset. So this model suffers from underfitting. In Figure(d) we have a regression model
which corresponds too closely to the given dataset and hence it does not account for small
random noises in the dataset. Hence it suffers from overfitting.
We can measure the generalization ability of a hypothesis, namely, the quality of its inductive bias, if
we have access to data outside the training set. We simulate this by dividing the training set we
have into two parts. We use one part for training (that is, to find a hypothesis), and the remaining
part is called the validation set and is used to test the generalization ability. Assuming large
enough training and validation sets, the hypothesis that is the most accurate on the validation set
is the best one (the one that has the best inductive bias). This process is called cross-validation.
Bias-Variance Tradeoff
Whenever we discuss model prediction, it’s important to understand prediction errors (bias and
variance). There is a tradeoff between a model’s ability to minimize bias and variance. Gaining a
proper understanding of these errors would help us not only to build accurate models but also to
avoid the mistake of overfitting and underfitting.
Bias
40
Bias is the difference between the average prediction of our model and the correct value which
we are trying to predict. Model with high bias pays very little attention to the training 5 data and
oversimplifies the model. It always leads to high error on training and test data.
Variance
Variance is the variability of model prediction for a given data point or a value which tells us
spread of our data. Model with high variance pays a lot of attention to training data and does not
generalize on the data which it hasn’t seen before. As a result, such models perform very well on
training data but has high error rates on test data.
Mathematically
Let the variable we are trying to predict as Y and other covariates as X. We assume there is a
relationship between the two such that: Y=f(X) + e
Where e is the error term and it’s normally distributed with a mean of 0.
First, let’s take a simple definition. Bias-Variance Trade-off refers to the property of a machine
learning model such that as the bias of the model increased, the variance reduces and as the bias
reduces, the variance increases. Therefore the problem is to determine the amount of bias and
variance to make the model optimal.
Sources of Error
We recall the problem of underfitting and overfitting when trying to fit a regression line through
a set of data points. In case of underfitting, the bias is an error from a faulty assumption in the
learning algorithm. This is such that when the bias is too large, the algorithm would be able to
correctly model the relationship between the features and the target outputs.
In case of overfitting, variance is an error resulting from fluctuations int he training dataset. A
high value for variance would cause the algorithm may capture the most data points put would
not be generalized enough to capture new data points. This is overfitting.
The trade-off, means that a model would be chosen carefully to both correctly capture that
regularities in the training data and at the same time be generalized enough to correctly
categorize new observation
Assuming you have several training data sets for the same population:
Training Data 1
Training Data 2
Training Data 3
41
5
Let’s also assume that, you pass different values of x (x1, x2 and x3) into the same model.
Instead of getting different outputs, you get the same output y. In this case, the algorithm is said
to have high bias error, which results in a problem of underfitting. This is illustrated in Figure 3
below:
High variance means that the algorithm has become too specific. High bias means that the
algorithm have failed to understand the pattern in the input data. It’s generally not possible to
42
minimize both errors simultaneously, since high bias would always have low variance, whereas
low bias would always have high variance. 5
The graph in Figure 3 is a typical plot of the bias/variance trade-off which we would briefly
examine.
The bias/variance graph shows a plot of Error against Model Complexity. It also shows:
Relationship of variance and Model Complexity: As we increase the variance, the
variance increases
Relationship of bias and Model Complexity: As the bias increase, the model complexity
reduces
Relationship of variance and Error: As the variance increases, the error increases.
Relationship of bias and Error: As the bias increases, the error increases.
43