0% found this document useful (0 votes)
23 views43 pages

ML Unit I

Machine learning involves using algorithms to learn from data without being explicitly programmed. It involves feeding a model large amounts of training data, allowing the model to learn patterns and make predictions on new data. Key aspects of machine learning include gathering and preprocessing data, dividing it into training and test sets, building models using algorithms, and evaluating model performance on the test set. Machine learning is being applied in many fields like online shopping recommendations, cancer detection, email filtering, and more. It has the potential to automate tasks that would be very time-consuming for humans to perform manually.

Uploaded by

Vasu 22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views43 pages

ML Unit I

Machine learning involves using algorithms to learn from data without being explicitly programmed. It involves feeding a model large amounts of training data, allowing the model to learn patterns and make predictions on new data. Key aspects of machine learning include gathering and preprocessing data, dividing it into training and test sets, building models using algorithms, and evaluating model performance on the test set. Machine learning is being applied in many fields like online shopping recommendations, cancer detection, email filtering, and more. It has the potential to automate tasks that would be very time-consuming for humans to perform manually.

Uploaded by

Vasu 22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Mechine learning G.Hari Krishna M.

Tech
UNIT I
Introduction :

Definition of Machine Learning: Arthur Samuel, an early American leader in


the field of computer gaming and artificial intelligence, coined the term
“Machine Learning ” in 1959 while at IBM. He defined machine learning as “the
field of study that gives computers the ability to learn without being explicitly
programmed “. However, there is no universally accepted definition for machine
learning. Different authors define the term differently. We give below two more
definitions.
 Machine learning is programming computers to optimize a performance
criterion using example data or past experience . We have a model defined
up to some parameters, and learning is the execution of a computer
program to optimize the parameters of the model using the training data or
past experience. The model may be predictive to make predictions in the
future, or descriptive to gain knowledge from data.
 The field of study known as machine learning is concerned with the question
of how to construct computer programs that automatically improve with
experience.
 Arthur Samuel, a pioneer in the field of artificial intelligence and computer
gaming, coined the term “Machine Learning”. He defined machine learning
as – a “Field of study that gives computers the capability to learn without
being explicitly programmed”. In a very layman’s manner, Machine
Learning(ML) can be explained as automating and improving the learning
process of computers based on their experiences without being actually
programmed i.e. without any human assistance. The process starts with
feeding good quality data and then training our machines(computers) by
building machine learning models using the data and different algorithms.
The choice of algorithms depends on what type of data do we have and
what kind of task we are trying to automate. Example: Training of students
during exams. While preparing for the exams students don’t actually cram
the subject but try to learn it with complete understanding. Before the
examination, they feed their machine(brain) with a good amount of high-
quality data (questions and answers from different books or teachers’ notes,
or online video lectures). Actually, they are training their brain with input as
well as output i.e. what kind of approach or logic do they have to solve a
different kinds of questions. Each time they solve practice test papers and
find the performance (accuracy /score) by comparing answers with the
answer key given, Gradually, the performance keeps on increasing, gaining
more confidence with the adopted approach. That’s how actually models are
built, train machine with data (both inputs and outputs are given to the
model), and when the time comes test on data (with input only) and achieve
our model scores by comparing its answer with the actual output which has
not been fed while training. Researchers are working with assiduous efforts
to improve algorithms, and techniques so that these models perform even
much
better.

Basic Difference in ML and Traditional Programming?

 Traditional Programming: We feed in DATA (Input) + PROGRAM (logic),


run it on the machine, and get the output.
 Machine Learning: We feed in DATA(Input) + Output, run it on the machine
during training and the machine creates its own program(logic), which can
be evaluated while testing.
 What does exactly learning mean for a computer? A computer is said to be
learning from Experiences with respect to some class of Tasks if its
performance in a given task improves with the Experience.

 A computer program is said to learn from experience E with respect to some


class of tasks T and performance measure P, if its performance at tasks
in T, as measured by P, improves with experience E Example: playing
checkers. E = the experience of playing many games of checkers T = the
task of playing checkers. P = the probability that the program will win the
next game In general, any machine learning problem can be assigned to
one of two broad classifications: Supervised learning and Unsupervised
learning. How things work in reality:-
 Talking about online shopping, there are millions of users with an unlimited
range of interests with respect to brands, colors, price range, and many
more. While online shopping, buyers tend to search for a number of
products. Now, searching a product frequently will make buyers’ Facebook,
web pages, search engines or online stores start recommending or showing
offers on that particular product. There is no one sitting over there to code
such a task for each and every user, all this task is completely automatic.
Here, ML plays its role. Researchers, data scientists, and machine learners
build models on the machine using good quality and a huge amount of data
and now their machine is automatically performing and even improving with
more and more experience and time. Traditionally, the advertisement was
only done using newspapers, magazines and radio but now technology has
made us smart enough to do Targeted advertisement (online ad system)
which is a way more efficient method to target the most receptive audience.
 Even in health care also, ML is doing a fabulous job. Researchers and
scientists have prepared models to train machines for detecting cancer just
by looking at slide-cell images. For humans to perform this task it would
have taken a lot of time. But now, no more delay, machines predict the
chances of having or not having cancer with some accuracy and doctors just
have to give an assurance call, that’s it. The answer to – how is this possible
is very simple -all that is required, is, a high computation machine, a large
amount of good quality image data, ML model with good algorithms to
achieve state-of-the-art results. Doctors are using ML even to diagnose
patients based on different parameters under consideration.
 You all might have to use IMDB(internet movie Database) ratings, Google
Photos where it recognizes faces, Google Lens where the ML image-text
recognition model can extract text from the images you feed in,
and Gmail which categories E-mail as social, promotion, updates, or forums
using text classification, which is a part of ML.
How does ML work?
 Gathering past data in any form suitable for processing. The better the
quality of data, the more suitable it will be for modeling
 Data Processing – Sometimes, the data collected is in raw form and it needs
to be pre-processed. Example: Some tuples may have missing values for
certain attributes, and, in this case, it has to be filled with suitable values in
order to perform machine learning or any form of data mining. Missing
values for numerical attributes such as the price of the house may be
replaced with the mean value of the attribute whereas missing values for
categorical attributes may be replaced with the attribute with the highest
mode. This invariably depends on the types of filters we use. If data is in the
form of text or images then converting it to numerical form will be required,
be it a list or array or matrix. Simply, Data is to be made relevant and
consistent. It is to be converted into a format understandable by the machine
 Divide the input data into training, cross-validation, and test sets. The ratio
between the respective sets must be 6:2:2
 Building models with suitable algorithms and techniques on the training set.
 Testing our conceptualized model with data that was not fed to the model at
the time of training and evaluating its performance using metrics such as F1
score, precision, and recall.
 Linear Algebra
 Statistics and Probability
 Calculus
 Graph theory
 Programming Skills – Languages such as Python, R, MATLAB, C+
+, or Octave

Well Posed Learning Problem – A computer program is said to learn from


experience E in context to some task T and some performance measure P, if its
performance on T, as was measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three
traits –

 Task
 Performance Measure
 Experience
Certain examples that efficiently defines the well-posed learning problem
are –
1. To better filter emails as spam or not
 Task – Classifying emails as spam or not
 Performance Measure – The fraction of emails accurately classified as spam
or not spam
 Experience – Observing you label emails as spam or not spam
2. A checkers learning problem
 Task – Playing checkers game
 Performance Measure – percent of games won against opposer
 Experience – playing implementation games against itself

3. A Robot Driving Problem


 Task – driving on public four-lane highways using sight scanners
 Performance Measure – average distance progressed before a fallacy
 Experience – order of images and steering instructions noted down while
observing a human driver
4. Fruit Prediction Problem
 Task – forecasting different fruits for recognition
 Performance Measure – able to predict maximum variety of fruits
 Experience – training machine with the largest datasets of fruits images
5. Face Recognition Problem
 Task – predicting different types of faces
 Performance Measure – able to predict maximum types of faces
 Experience – training machine with maximum amount of datasets of different
face images
6. Automatic Translation of documents
 Task – translating one type of language used in a document to other
language
 Performance Measure – able to convert one language to other efficiently
 Experience – training machine with a large dataset of different types of
languages

Designing a Learning System

 According to Arthur Samuel “Machine Learning enables a Machine to


Automatically learn from Data, Improve performance from an Experience
and predict things without explicitly programmed.”
 In Simple Words, When we fed the Training Data to Machine Learning
Algorithm, this algorithm will produce a mathematical model and with the
help of the mathematical model, the machine will make a prediction and take
a decision without being explicitly programmed. Also, during training data,
the
more

machine will work with it the more it will get experience and the more
efficient result is produced.
 Example : In Driverless Car, the training data is fed to Algorithm like how to
Drive Car in Highway, Busy and Narrow Street with factors like speed limit,
parking, stop at signal etc. After that, a Logical and Mathematical model is
created on the basis of that and after that, the car will work according to the
logical model. Also, the more data the data is fed the more efficient output is
produced.
Steps for Designing Learning System are:

 Step 1) Choosing the Training Experience: The very important and first task
is to choose the training data or training experience which will be fed to the
Machine Learning Algorithm. It is important to note that the data or
experience that we fed to the algorithm must have a significant impact on
the Success or Failure of the Model. So Training data or experience should
be chosen wisely.
 Below are the attributes which will impact on Success and Failure of Data:
 The training experience will be able to provide direct or indirect feedback
regarding choices. For example: While Playing chess the training data will
provide feedback to itself like instead of this move if this is chosen the
chances of success increases.
 Second important attribute is the degree to which the learner will control the
sequences of training examples. For example: when training data is fed to
the machine then at that time accuracy is very less but when it gains
experience while playing again and again with itself or opponent the
machine algorithm will get feedback and control the chess game
accordingly.
 Third important attribute is how it will represent the distribution of examples
over which performance will be measured. For example, a Machine learning
algorithm will get experience while going through a number of different
cases and different examples. Thus, Machine Learning Algorithm will get
more and more experience by passing through more and more examples
and hence its performance will increase.
 Step 2- Choosing target function: The next important step is choosing the
target function. It means according to the knowledge fed to the algorithm the
machine learning will choose NextMove function which will describe what
type of legal moves should be taken. For example : While playing chess
with the opponent, when opponent will play then the machine learning
algorithm will decide what be the number of possible legal moves taken in
order to get success.
 Step 3- Choosing Representation for Target function: When the machine
algorithm will know all the possible legal moves the next step is to choose
the optimized move using any representation i.e. using linear Equations,
Hierarchical Graph Representation, Tabular form etc. The NextMove
function will move the Target move like out of these move which will provide
more success rate. For Example : while playing chess machine have 4
possible moves, so the machine will choose that optimized move which will
provide success to it.
 Step 4- Choosing Function Approximation Algorithm: An optimized move
cannot be chosen just with the training data. The training data had to go
through with set of example and through these examples the training data
will approximates which steps are chosen and after that machine will provide
feedback on it. For Example : When a training data of Playing chess is fed
to algorithm so at that time it is not machine algorithm will fail or get success
and again from that failure or success it will measure while next move what
step should be chosen and what is its success rate.
 Step 5- Final Design: The final design is created at last when system goes
from number of examples , failures and success , correct and incorrect
decision and what will be the next step etc. Example: DeepBlue is an
intelligent computer which is ML-based won chess game against the chess
expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.

Perspectives and issues in machine learning


In Machine Learning, there occurs a process of analyzing data for building or
training models. It is just everywhere; from Amazon product recommendations
to self-driven cars, it beholds great value throughout. As per the latest research,
the global machine learning market is expected to grow by 43% by 2024. This
revolution has enhanced the demand for machine learning professionals to a
great extent. AI and machine learning jobs have observed a significant growth
rate of 75% in the past four years, and the industry is growing continuously. A
career in the Machine learning domain offers job satisfaction, excellent growth,
insanely high salary, but it is a complex and challenging process.
There are a lot of challenges that machine learning professionals face to
inculcate ML skills and create an application from scratch. What are these
challenges? In this blog, we will discuss seven major challenges faced by
machine learning professionals. Let’s have a look.

1. Poor Quality of Data

Data plays a significant role in the machine learning process. One of the
significant issues that machine learning professionals face is the absence of
good quality data. Unclean and noisy data can make the whole process
extremely exhausting. We don’t want our algorithm to make inaccurate or faulty
predictions. Hence the quality of data is essential to enhance the output.
Therefore, we need to ensure that the process of data preprocessing which
includes removing outliers, filtering missing values, and removing unwanted
features, is done with the utmost level of perfection.

2. Underfitting of Training Data

This process occurs when data is unable to establish an accurate relationship


between input and output variables. It simply means trying to fit in undersized
jeans. It signifies the data is too simple to establish a precise relationship. To
overcome this issue:

 Maximize the training time


 Enhance the complexity of the model
 Add more features to the data
 Reduce regular parameters
 Increasing the training time of model

3. Overfitting of Training Data

Overfitting refers to a machine learning model trained with a massive amount of


data that negatively affect its performance. It is like trying to fit in Oversized
jeans. Unfortunately, this is one of the significant issues faced by machine
learning professionals. This means that the algorithm is trained with noisy and
biased data, which will affect its overall performance. Let’s understand this with
the help of an example. Let’s consider a model trained to differentiate between
a cat, a rabbit, a dog, and a tiger. The training data contains 1000 cats, 1000
dogs, 1000 tigers, and 4000 Rabbits. Then there is a considerable probability
that it will identify the cat as a rabbit. In this example, we had a vast amount of
data, but it was biased; hence the prediction was negatively affected.

We can tackle this issue by:

 Analyzing the data with the atmost level of perfection


 Use data augmentation technique
 Remove outliers in the training set
 Select a model with lesser features
To know more, you can visit here.

4. Machine Learning is a Complex Process

The machine learning industry is young and is continuously changing. Rapid hit
and trial experiments are being carried on. The process is transforming, and
hence there are high chances of error which makes the learning complex. It
includes analyzing the data, removing data bias, training data, applying
complex mathematical calculations, and a lot more. Hence it is a really
complicated process which is another big challenge for Machine learning
professionals.

5. Lack of Training Data

The most important task you need to do in the machine learning process is to
train the data to achieve an accurate output. Less amount training data will
produce inaccurate or too biased predictions. Let us understand this with the
help of an example. Consider a machine learning algorithm similar to training a
child. One day you decided to explain to a child how to distinguish between an
apple and a watermelon. You will take an apple and a watermelon and show
him the difference between both based on their color, shape, and taste. In this
way, soon, he will attain perfection in differentiating between the two. But on the
other hand, a machine-learning algorithm needs a lot of data to distinguish. For
complex problems, it may even require millions of data to be trained. Therefore
we need to ensure that Machine learning algorithms are trained with sufficient
amounts of data.

6. Slow Implementation
This is one of the common issues faced by machine learning professionals. The
machine learning models are highly efficient in providing accurate results, but it
takes a tremendous amount of time. Slow programs, data overload, and
excessive requirements usually take a lot of time to provide accurate results.
Further, it requires constant monitoring and maintenance to deliver the best
output.

7. Imperfections in the Algorithm When Data Grows

So you have found quality data, trained it amazingly, and the predictions are
really concise and accurate. Yay, you have learned how to create a machine
learning algorithm!! But wait, there is a twist; the model may become useless
in the future as data grows. The best model of the present may become
inaccurate in the coming Future and require further rearrangement. So you
need regular monitoring and maintenance to keep the algorithm working. This is
one of the most exhausting issues faced by machine learning professionals.

Conclusion: Machine learning is all set to bring a big bang transformation in


technology. It is one of the most rapidly growing technologies used in medical
diagnosis, speech recognition, robotic training, product recommendations,
video surveillance, and this list goes on. This continuously evolving domain
offers immense job satisfaction, excellent opportunities, global exposure, and
exorbitant salary. It is a high risk and a high return technology. Before starting
your machine learning journey, ensure that you carefully examine the
challenges mentioned above. To learn this fantastic technology, you need to
plan carefully, stay patient, and maximize your efforts. Once you win this battle,
you can conquer the Future of work and land your dream job!

General-To-Specific Ordering of Hypothesis

Concept Learning Task:

The theories can be sorted from the most specific to the most general. This will
allow the machine learning algorithm to thoroughly investigate the hypothesis
space without having to enumerate each and every hypothesis in it, which is
impossible when the hypothesis space is infinitely vast.

Now we’ll speak about general-to-specific ordering and how to utilize it to


construct a feeling of order in a hypothesis space in any concept learning issue.

Let us have a look at our EnjoySport example:

Task T: Determine the value of EnjoySport for every given day based on the
values of the day’s qualities.

The total proportion of days (EnjoySport) accurately anticipated is


the performance metric P.

Experience E: A collection of days with pre-determined labels (EnjoySport:


Yes/No).

Each hypothesis can be considered as a set of six constraints, with the values
of the six attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast
specified.

Sky Air temp Humidity Wind Water Forecast EnjoySport

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Rainy Cold High Strong Warm Change No

Sunny Warm High Strong Cool Change Yes

Take a look at the following two hypotheses:

h1 = <Rainy, Warm, Strong>

h2 = <Rainy, ?, Strong>
The question is how many and which examples are classed as positive by each
of these theories (i.e., satisfy these hypotheses). Only example 4 is satisfactory
for h1, however, both examples 3 and 4 are satisfactory and categorized as
positive for h2.

What is the reason behind this? What makes these two hypotheses so
different? The solution is found in the rigor with which each of these theories
imposes limits. As you can see, h1 places more restrictions on you than h2!
Naturally, h2 can categorize more good cases than h1! In this case, we may
really assert the following:

“If an example meets h1, it will almost certainly meet h2, but not the other way
around.”

This is due to the fact that h2 is more general than h1. This may be seen in the
following example: h2 has a wider range of choices than h1. If an instance has
the following values:< Rainy, Freezing, Strong>, h2 will classify it as positive,
but h1 will not be fulfilled.

However, if h1 identifies an occurrence as positive, such as <Rainy, Warm,


Strong>, h2 will almost certainly categorise it as positive as well.

In fact, each case that is categorised as positive by h1 is likewise classed as


positive by h2. As a result, we might conclude that h2 is more generic than h1.

We state that x fulfils h if and only if h(x) = 1 for each instance x in X and
hypothesis h in H.

Definition:

Let hj and hk be boolean-valued functions that are defined over X. If and only if,
hj is more general than or equal to hk (written hj >=g hk).

We can show this relationship with the following notation:

hj ≥g hk
The letter g stands for “general.” There are times when one theory is more
general than the other, but it is not the same.

Because every case that fulfils hl also satisfies h2, hypothesis h2 is more
general than hl.

In the same way, h2 is a more broad term than h3.

It’s worth noting that neither hl nor h3 are more general than the other; while
the instances met by both hypotheses overlap, neither set subsumes the other.

A handful of the key algorithms that may be used to explore the hypothesis
space, H, by making use of the g operation. Finding-S is the name of the
method, with S standing for specific and implying that the purpose is to identify
the most particular hypothesis.
We can observe that all the occurrences that fulfill both h1 and h3 also satisfy
h2, thus we can conclude that:

h2 ≥g. h1 and h3 are two different types of h2 g h1 and h3.

FINDING S Algorithm

introduction :

The find-S algorithm is a basic concept learning algorithm in machine learning.


The find-S algorithm finds the most specific hypothesis that fits all the positive
examples. We have to note here that the algorithm considers only those
positive training example. The find-S algorithm starts with the most specific
hypothesis and generalizes this hypothesis each time it fails to classify an
observed positive training data. Hence, the Find-S algorithm moves from the
most specific hypothesis to the most general hypothesis.
Important Representation :

1. ? indicates that any value is acceptable for the attribute.


2. specify a single required value ( e.g., Cold ) for the attribute.
3. Φ indicates that no value is acceptable.
4. The most general hypothesis is represented by: {?, ?, ?, ?, ?, ?}
5. The most specific hypothesis is represented by: {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
Steps Involved In Find-S :

1. Start with the most specific hypothesis.


h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
2. Take the next example and if it is negative, then no changes occur to the
hypothesis.
3. If the example is positive and we find that our initial hypothesis is too
specific then we update our current hypothesis to a general condition.
4. Keep repeating the above steps till all the training examples are complete.
5. After we have completed all the training examples we will have the final
hypothesis when can use to classify the new examples.
Example :
Consider the following data set having the data about which particular seeds
are poisonous.

First, we consider the hypothesis to be a more specific hypothesis. Hence, our


hypothesis would be :
h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}

Consider example 1 :
The data in example 1 is { GREEN, HARD, NO, WRINKLED }. We see that our
initial hypothesis is more specific and we have to generalize it for this example.
Hence, the hypothesis becomes :
h = { GREEN, HARD, NO, WRINKLED }
Consider example 2 :
Here we see that this example has a negative outcome. Hence we neglect this
example and our hypothesis remains the same.
h = { GREEN, HARD, NO, WRINKLED }

Consider example 3 :
Here we see that this example has a negative outcome. Hence we neglect this
example and our hypothesis remains the same.
h = { GREEN, HARD, NO, WRINKLED }
Consider example 4 :
The data present in example 4 is { ORANGE, HARD, NO, WRINKLED }. We
compare every single attribute with the initial data and if any mismatch is found
we replace that particular attribute with a general case ( ” ? ” ). After doing the
process the hypothesis becomes :
h = { ?, HARD, NO, WRINKLED }
Consider example 5 :
The data present in example 5 is { GREEN, SOFT, YES, SMOOTH }. We
compare every single attribute with the initial data and if any mismatch is found
we replace that particular attribute with a general case ( ” ? ” ). After doing the
process the hypothesis becomes:
h = { ?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis have
the general condition, example 6 and example 7 would result in the same
hypothesizes with all general attributes.
h = { ?, ?, ?, ? }
Hence, for the given data the final hypothesis would be :
Final Hyposthesis: h = { ?, ?, ?, ? }

Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
For each attribute constraint a, in h
If the constraint a, is satisfied by x
Then do nothing
Else replace a, in h by the next more general constraint that is satisfied by x
3. Output hypothesis h

Version Spaces
A version space is a hierarchical representation of knowledge that enables you
to keep track of all the useful information supplied by a sequence of learning
examples without remembering any of the examples.

The version space method is a concept learning process accomplished by


managing multiple models within a version space.

Version Space Characteristics

Tentative(not certain) heuristics(mental shortcuts) are represented using


version spaces.

A version space represents all the alternative plausible descriptions of a


heuristic.
A plausible description is one that is applicable to all known positive examples
and no known negative example.

A version space description consists of two complementary trees:

1. One that contains nodes connected to overly general models, and


2. One that contains nodes connected to overly specific models.

Node values/attributes are discrete.

Fundamental Assumptions

1. The data is correct; there are no erroneous instances.


2. A correct description is a conjunction of some of the attributes with
values.
Diagrammatical Guidelines

There is a generalization tree and a specialization tree. Each node is


connected to a model. Nodes in the generalization tree are connected to a
model that matches everything in its sub tree. Nodes in the specialization tree
are connected to a model that matches only one thing in its sub tree.

Links between nodes and their models denote

 generalization relations in a generalization tree, and


 specialization relations in a specialization tree.

Diagram of a Version Space

In the diagram below, the specialization tree is colored red, and the
generalization tree is colored green.

Generalization and Specialization Leads to Version Space Convergence


The key idea in version space learning is that specialization of the general
models and generalization of the specific models may ultimately lead to just
one correct model that matches all observed positive examples and does not
match any negative examples.

That is, each time a negative example is used to specialilize the general
models, those specific models that match the negative example are eliminated
and each time a positive example is used to generalize the specific models,
those general models that fail to match the positive example are eliminated.
Eventually, the positive and negative examples may be such that only one
general model and one identical specific model survive.

Version Space Method Learning Algorithm: Candidate-Elimination

The version space method handles positive and negative examples


symmetrically.

Given:

 A representation language.
 A set of positive and negative examples expressed in that language.

Compute: a concept description that is consistent with all the positive examples
and none of the negative examples.

Method:

 Initialize G, the set of maximally general hypotheses, to contain one


element: the null description (all features are variables).
 Initialize S, the set of maximally specific hypotheses, to contain one
element: the first positive example.
 Accept a new training example.
1. If the example is positive:

Generalize all the specific models to match the positive example, but ensure
the following:

 The new specific models involve minimal changes.


 Each new specific model is a specialization of some general model.
 No new specific model is a generalization of some other specific
model.
cut away all the general models that fail to match the positive example.

2. . If the example is negative:

a. Specialize all general models to prevent match with the negative example,
but ensure the following:
 The new general models involve minimal changes.
 Each new general model is a generalization of some specific model.
 No new general model is a specialization of some other general
model.
3. cut away all the specific models that match the negative example.
o If S and G are both singleton sets, then:

 if they are identical, output their value and halt.


 if they are different, the training cases were inconsistent.
Output this result and halt.
 else continue accepting new training examples.

The algorithm stops when:

1. It runs out of data.


2. The number of hypotheses remaining is:
o 0 - no consistent description for the data in the language.
o 1 - answer (version space converges).
o 2+ - all descriptions in the language are implicitly included.

Comments on the Version Space Method

The version space method is still a trial and error method.


The program does not base its choice of examples, or its learned heuristics, on
an analysis of what works or why it works, but rather on the simple assumption
that what works will probably work again.

Unlike the decision tree ID3 algorithm,

 Candidate-elimination searches an incomplete set of hypotheses (ie.


only a subset of the potentially teachable concepts are included in the
hypothesis space).
 Candidate-elimination finds every hypothesis that is consistent with the
training data, meaning it searches the hypothesis space completely.
 Candidate-elimination's inductive bias is a consequence of how well it
can represent the subset of possible hypotheses it will search. In other
words, the bias is a product of its search space.
 No additional bias is introduced through Candidate-eliminations's search
strategy.

Advantages of the version space method:

 Can describe all the possible hypotheses in the language consistent with
the data.
 Fast (close to linear).

Disadvantages of the version space method:

 Inconsistent data (noise) may cause the target concept to be pruned.


 Learning disjunctive concepts is challenging.

The candidate elimination algorithm:


The candidate elimination algorithm incrementally builds the version space
given a hypothesis space H and a set E of examples. The examples are added
one by one; each example possibly shrinks the version space by removing the
hypotheses that are inconsistent with the example. The candidate elimination
algorithm does this by updating the general and specific boundary for each new
example.

 You can consider this as an extended form of Find-S algorithm.


 Consider both positive and negative examples.
 Actually, positive examples are used here as Find-S algorithm (Basically
they are generalizing from the specification).
 While the negative example is specified from generalize form.
Terms Used:
 Concept learning: Concept learning is basically learning task of the machine

(Learn by Train data)


 General Hypothesis: Not Specifying features to learn the machine.

 G = {‘?’, ‘?’,’?’,’?’…}: Number of attributes

 Specific Hypothesis: Specifying features to learn machine (Specific feature)

 S= {‘pi’,’pi’,’pi’…}: Number of pi depends on number of attributes.


 Version Space: It is intermediate of general hypothesis and Specific hypothesis.

It not only just written one hypothesis but a set of all possible hypothesis
based on training data-set.

Algorithm:

Step1: Load Data set


Step2: Initialize General Hypothesis and Specific Hypothesis.
Step3: For each training example
Step4: If example is positive example
if attribute_value == hypothesis_value:
Do nothing
else:
replace attribute value with '?' (Basically generalizing it)
Step5: If example is Negative example
Make generalize hypothesis more specific.
Example:
Consider the dataset given below:

Algorithmic steps:
Initially : G = [[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?,
?, ?],
[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?,
?, ?]]
S = [Null, Null, Null, Null, Null, Null]

For instance 1 : <'sunny','warm','normal','strong','warm ','same'>


and positive output.
G1 = G
S1 = ['sunny','warm','normal','strong','warm ','same']

For instance 2 : <'sunny','warm','high','strong','warm ','same'> and


positive output.
G2 = G
S2 = ['sunny','warm',?,'strong','warm ','same']

For instance 3 : <'rainy','cold','high','strong','warm ','change'>


and negative output.
G3 = [['sunny', ?, ?, ?, ?, ?], [?, 'warm', ?, ?, ?, ?],
[?, ?, ?, ?, ?, ?],
[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?],
[?, ?, ?, ?, ?, 'same']]
S3 = S2
For instance 4 : <'sunny','warm','high','strong','cool','change'> and
positive output.
G4 = G3
S4 = ['sunny','warm',?,'strong', ?, ?]

At last, by synchronizing the G4 and S4 algorithm produce the


output.
Output :
G = [['sunny', ?, ?, ?, ?, ?], [?, 'warm', ?, ?, ?, ?]]

S = ['sunny','warm',?,'strong', ?, ?]

INDUCTIVE BIAS

 The inductive bias (also known as learning bias) of a learning algorithm is


the set of assumptions that the learner uses to predict outputs of given
inputs that it has not encountered.
 In machine learning, one aims to construct algorithms that are able
to learn to predict a certain target output. To achieve this, the learning
algorithm is presented some training examples that demonstrate the
intended relation of input and output values. Then the learner is supposed to
approximate the correct output, even for examples that have not been
shown during training. Without any additional assumptions, this problem
cannot be solved since unseen situations might have an arbitrary output
value. The kind of necessary assumptions about the nature of the target
function are subsumed in the phrase inductive bias.
 A classical example of an inductive bias is Occam's razor, assuming that the
simplest consistent hypothesis about the target function is actually the best.
Here consistent means that the hypothesis of the learner yields correct
outputs for all of the examples that have been given to the algorithm.
 Approaches to a more formal definition of inductive bias are based
on mathematical logic. Here, the inductive bias is a logical formula that,
together with the training data, logically entails the hypothesis generated by
the learner. However, this strict formalism fails in many practical cases,
where the inductive bias can only be given as a rough description (e.g. in the
case of artificial neural networks), or not at all.
Types of Biases

The following is a list of common inductive biases in machine learning


algorithms.

 Maximum conditional independence: if the hypothesis can be cast in


a Bayesian framework, try to maximize conditional independence. This is the
bias used in the Naive Bayes classifier.
 Minimum cross-validation error: when trying to choose among hypotheses,
select the hypothesis with the lowest cross-validation error. Although cross-
validation may seem to be free of bias, the "no free lunch" theorems show
that cross-validation must be biased.
 Maximum margin: when drawing a boundary between two classes, attempt
to maximize the width of the boundary. This is the bias used in support vector
machines. The assumption is that distinct classes tend to be separated by
wide boundaries.
 Minimum description length: when forming a hypothesis, attempt to minimize
the length of the description of the hypothesis.
 Minimum features: unless there is good evidence that a feature is useful, it
should be deleted. This is the assumption behind feature
selection algorithms.
 Nearest neighbors: assume that most of the cases in a small neighborhood
in feature space belong to the same class. Given a case for which the class
is unknown, guess that it belongs to the same class as the majority in its
immediate neighborhood. This is the bias used in the k-nearest neighbors
algorithm. The assumption is that cases that are near each other tend to
belong to the same class.

Decision Tree Introduction

 Decision tree algorithm falls under the category of supervised learning. They can be used to
solve both regression and classification problems.
 Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the tree.
 We can represent any Boolean function on discrete attributes using the decision tree.

Below are some assumptions that we made while using decision tree:
 At the beginning, we consider the whole training set as the root.
 Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
 On the basis of attribute values records are distributed recursively.
 We use statistical methods for ordering attributes as root or the internal node.

As you can see from the above image that Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting the use
of computer in the daily life of the people.
In Decision Tree the major challenge is to identification of the attribute for the root node in each
level. This process is known as attribute selection. We have two popular attribute selection
measures:

1. Information Gain
2. Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into smaller subsets
the entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v,
and Values (A) is the set of all possible values of A, then

Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an
arbitrary collection of examples. The higher the entropy more the information content.

Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v,


and Values (A) is the set of all possible values of A, then

Example:
For the set X = {a,a,a,b,b,b,b,b}

Total instances: 8

Instances of b: 5
Instances of a: 3

= -[0.375 * (-1.415) + 0.625 * (-0.678)]

=-(-0.53-0.424)

= 0.954

Building Decision Tree using Information Gain


The essentials:
 Start with all training instances associated with the root node
 Use info gain to choose which attribute to label each node with
 Note: No root-to-leaf path should contain the same discrete attribute twice
 Recursively construct each subtree on the subset of training instances that would be
classified down that path in the tree.
The border cases:
 If all positive or all negative training instances remain, label that node “yes” or “no”
accordingly
 If no attributes remain, label with a majority vote of training instances left at that node
 If no instances remain, label with a majority vote of the parent’s training instances
Example:
Now, lets draw a Decision Tree for the following data using Information gain.
Training set: 3 features and 2 classes\

X Y Z C

1 1 1 1

1 1 0 1

0 0 1 II

1 0 0 II

Here, we have 3 features and 2 output classes.


To build a decision tree using Information gain. We will take each of the feature and calculate
the information for each feature.

Split on feature X

Split on feature Y

Split on feature Z
From the above images we can see that the information gain is maximum when we make a split
on feature Y. So, for the root node best suited feature is feature Y. Now we can see that while
splitting the dataset by feature Y, the child contains pure subset of the target variable. So we
don’t need to further split the dataset.
The final tree for the above dataset would be look like this:

2. Gini Index
 Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
 It means an attribute with lower Gini index should be preferred.
 Sk learn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
 The Formula for the calculation of the of the Gini Index is given below.

What are appropriate problems for Decision tree learning?

Although a variety of decision tree learning methods have been developed with
somewhat differing capabilities and requirements, decision tree learning is
generally best suited to problems with the following characteristics:

1. Instances are represented by attribute-value pairs.


“Instances are described by a fixed set of attributes (e.g., Temperature) and
their values (e.g., Hot). The easiest situation for decision tree learning is when
each attribute takes on a small number of disjoint possible values (e.g., Hot,
Mild, Cold). However, extensions to the basic algorithm allow handling real-
valued attributes as well (e.g., representing Temperature numerically).”
2. The target function has discrete output values.
“The decision tree is usually used for Boolean classification
(e.g., yes or no) kind of example. Decision tree methods easily extend to
learning functions with more than two possible output values. A more
substantial extension allows learning target functions with real-valued outputs,
though the application of decision trees in this setting is less common.”
3. Disjunctive descriptions may be required.
Decision trees naturally represent disjunctive (lack of connective) expressions.

4. The training data may contain errors.

“Decision tree learning methods are robust to errors, both errors in


classifications of the training examples and errors in the attribute values that
describe these examples.”

5. The training data may contain missing attribute values.


“Decision tree methods can be used even when some training examples have
unknown values (e.g., if the Humidity of the day is known for only some of the
training examples).”

Decision Tree Learning Algorithm

Decision tree learning is a method for approximating discrete-valued target


functions, in which the learned function is represented by a decision tree.
Decision tree learning is one of the most widely used and practical methods for
inductive inference'.

Decision tree learning algorithm has been successfully used in expert systems
in capturing knowledge. The main task performed in these systems is using
inductive methods to the given values of attributes of an unknown object to
determine appropriate classification according to decision tree rules.

Decision trees classify instances by traverse from root node to leaf node. We
start from root node of decision tree, testing the attribute specified by this node,
then moving down the tree branch according to the attribute value in the given
set. This process is the repeated at the subtree level.

What is decision tree learning algorithm suited for:


1. Instance are represented by attribute-value pairs. For example, attribute
'Temperature' and its value 'hot', 'mild', 'cool'. We are also concerning to extend
attribute-value to continuous-valued data(numeric attribute value) in our project.

2.The target function has discrete output values. It can easily deal with instance
which is assigned to a boolean decision, such as 'true' and 'false', 'p(positive)'
and 'n(negative)'. Although it is possible to extend target to real-valued outputs,
we will cover the issue in the later part of this report.

3.The training data may contain errors .This can be dealt with pruning
techniques that we will not cover here.

The 3 widely used decision tree learning algorithms are : ID3, ASSISTANT and
C4.5.

In most supervised machine learning algorithm, our main goal is to find out a
possible hypothesis from the hypothesis space that could possibly map out the
inputs to the proper outputs.
The following figure shows the common method to find out the possible
hypothesis from the Hypothesis space:

Hypothesis Space (H):


Hypothesis space is the set of all the possible legal hypothesis. This is the set
from which the machine learning algorithm would determine the best possible
(only one) which would best describe the target function or the outputs.
Hypothesis (h):
A hypothesis is a function that best describes the target in supervised machine
learning. The hypothesis that an algorithm would come up depends upon the
data and also depends upon the restrictions and bias that we have imposed on
the data. To better understand the Hypothesis Space and Hypothesis consider
the following coordinate that shows the distribution of some data:

Say suppose we have test data for which we have to determine the outputs or
results. The test data is as shown below:

We can predict the outcomes by dividing the coordinate as shown below:


So the test data would yield the following result:

But note here that we could have divided the coordinate plane as:
The way in which the coordinate would be divided depends on the data,
algorithm and constraints.

 All these legal possible ways in which we can divide the coordinate plane to
predict the outcome of the test data composes of the Hypothesis Space.
 Each individual possible way is known as the hypothesis.
Hence, in this example the hypothesis space would be like:

;
Inductive Learning Algorithm :

Inductive Learning Algorithm (ILA) is an iterative and inductive machine


learning algorithm which is used for generating a set of a classification rule,
which produces rules of the form “IF-THEN”, for a set of examples, producing
rules at each iteration and appending to the set of rules. Basic Idea: There are
basically two methods for knowledge extraction firstly from domain experts and
then with machine learning. For a very large amount of data, the domain
experts are not very useful and reliable. So we move towards the machine
learning approach for this work. To use machine learning One method is to
replicate the experts logic in the form of algorithms but this work is very tedious,
time taking and expensive. So we move towards the inductive algorithms which
itself generate the strategy for performing a task and need not instruct
separately at each step. Need of ILA in presence of other machine learning
algorithms: The ILA is a new algorithm which was needed even when other
reinforcement learnings like ID3 and AQ were available.
 The need was due to the pitfalls which were present in the previous
algorithms, one of the major pitfalls was lack of generalisation of rules.
 The ID3 and AQ used the decision tree production method which was too
specific which were difficult to analyse and was very slow to perform for
basic short classification problems.
 The decision tree-based algorithm was unable to work for a new problem if
some attributes are missing.
 The ILA uses the method of production of a general set of rules instead of
decision trees, which overcome the above problems
THE ILA ALGORITHM: General requirements at start of the algorithm:-

1. list the examples in the form of a table ‘T’ where each row corresponds to an
example and each column contains an attribute value.
2. create a set of m training examples, each example composed of k attributes
and a class attribute with n possible decisions.
3. create a rule set, R, having the initial value false.
4. initially all rows in the table are unmarked.
Steps in the algorithm:- Step 1: divide the table ‘T’ containing m examples into
n sub-tables (t1, t2,…..tn). One table for each possible value of the class
attribute. (repeat steps 2-8 for each sub-table) Step 2: Initialize the attribute
combination count ‘ j ‘ = 1. Step 3: For the sub-table on which work is going on,
divide the attribute list into distinct combinations, each combination with ‘j ‘
distinct attributes. Step 4: For each combination of attributes, count the number
of occurrences of attribute values that appear under the same combination of
attributes in unmarked rows of the sub-table under consideration, and at the
same time, not appears under the same combination of attributes of other sub-
tables. Call the first combination with the maximum number of occurrences the
max-combination ‘ MAX’. Step 5: If ‘MAX’ = = null , increase ‘ j ‘ by 1 and go to
Step 3. Step 6: Mark all rows of the sub-table where working, in which the
values of ‘MAX’ appear, as classified. Step 7: Add a rule (IF attribute = “XYZ” –
> THEN decision is YES/ NO) to R whose left-hand side will have attribute
names of the ‘MAX’ with their values separated by AND, and its right-hand side
contains the decision attribute value associated with the sub-table. Step 8: If all
rows are marked as classified, then move on to process another sub-table and
go to Step 2. else, go to Step 4. If no sub-tables are available, exit with the set
of rules obtained till then. An example showing the use of ILA suppose an
example set having attributes Place type, weather, location, decision and seven
examples, our task is to generate a set of rules that under what condition what
is the decision.

Example Place weathe decisio


no. type r location n

I) hilly winter kullu Yes

Mumba
II ) mountain windy i No

III ) mountain windy Shimla Yes

Mumba
IV ) beach windy i No
Example Place weathe decisio
no. type r location n

V) beach warm goa Yes

VI ) beach windy goa No

VII ) beach warm Shimla Yes


step 1 subset 1
s.n place Weathe locatio decisio
o type r n n

1 hilly Winter kullu Yes

2 mountain Windy Shimla Yes

3 beach Warm goa Yes

4 beach Warm Shimla Yes


subset 2
s.n place Weathe decisio
o type r location n

Mumba
5 mountain Windy i No

Mumba
6 beach Windy i No

7 beach Windy goa No


step (2-8) at iteration 1 row 3 & 4 column weather is selected and row 3 & 4 are
marked. the rule is added to R IF weather is warm then a decision is yes. at
iteration 2 row 1 column place type is selected and row 1 is marked. the rule is
added to R IF place type is hilly then the decision is yes. at iteration 3 row 2
column location is selected and row 2 is marked. the rule is added to R IF
location is Shimla then the decision is yes. at iteration 4 row 5&6 column
location is selected and row 5&6 are marked. the rule is added to R IF location
is Mumbai then a decision is no. at iteration 5 row 7 column place type & the
weather is selected and row 7 is marked. rule is added to R IF place type is
beach AND weather is windy then the decision is no. finally we get the rule
set :- Rule Set
 Rule 1: IF the weather is warm THEN the decision is yes.
 Rule 2: IF place type is hilly THEN the decision is yes.
 Rule 3: IF location is Shimla THEN the decision is yes.
 Rule 4: IF location is Mumbai THEN the decision is no.
 Rule 5: IF place type is beach AND the weather is windy THEN the decision
is no.

Issues in decision tree learning.


A decision tree as we’ve already discussed is a method for approximating
discrete-valued target attributes, under the category of supervised learning.
They can be used to address problems involving regression and classification.

Practical issues in learning decision trees:

Determining how deep to grow the decision tree, handling continuous


attributes, choosing an appropriate attribute selection measure, and handling
training data with missing attribute values, handling attributes with different
costs, and improving computational efficiency are all practical issues in learning
decision trees.

Let’s have a look at each one of them briefly,

Overfitting the Data:

A model is regarded to be a good machine learning model if it generalizes any


new input data from the issue domain in an appropriate manner while we are
creating it.

Each branch of the tree is grown just deep enough by the algorithm to properly
categorize the training instances.
In reality, when there is noise in the data or when the number of training
instances is insufficient to provide a representative sample of the underlying
target function, it might cause problems.

This basic technique may yield trees that overfit the training samples in either
instance.

The formal definition of overfitting is, “Given a hypothesis space H, a


hypothesis h € H is said to overfit the training data if another hypothesis h’€ H
exists, with h having less error than h’ over the training examples but h’ having
smaller error over the full distribution of cases.”

As the tree is built, the horizontal axis of this graphic shows the total number of
nodes in the decision tree. The accuracy of the tree’s predictions is indicated by
the vertical axis.

The solid line depicts the decision tree’s accuracy over the training instances,
whereas the broken line depicts accuracy over a separate set of test cases not
included in the training set.

The tree’s accuracy over the training instances grows in a linear fashion as it
matures. The accuracy assessed over the independent test cases, on the other
hand, increases at first, then falls.

As can be observed, once the tree size reaches about 25 nodes, additional
elaboration reduces the tree’s accuracy on the test cases while boosting it on
the training examples.

What is Underfitting:

When a machine learning system fails to capture the underlying trend of the
data, it is considered to be underfitting. Our machine learning model’s accuracy
is ruined by underfitting.
Its recurrence merely indicates that our model or method does not adequately
fit the data. Underfitting may be prevented by collecting additional data and
employing feature selection to reduce the number of characteristics.

Both of the errors usually occur when the training example contains errors or
noise.

What is Noise?

Real-world data contains noise, which is unnecessary or nonsensical data that


may dramatically impair various data analyses. Classification, grouping, and
association analysis are examples of machine learning tasks.

Even when the training data is noise-free, overfitting can occur, especially
when tiny numbers of samples are connected with leaf nodes.

In this scenario, coincidence regularities are feasible, in which some property,


despite being unrelated to the actual goal function, occurs to divide the cases
quite effectively.

There is a risk of overfitting whenever such accidental regularities emerge.

What can we do to avoid overfitting? Here are a few examples of frequent


heuristics:

 Don’t try to fit all of the examples in; instead, quit before the training set runs
out.
 After fitting all of the instances, prune the resulting tree.

In decision tree learning, there are numerous methods for preventing


overfitting.

These may be divided into two categories:


 Techniques that stop growing the tree before it reaches the point where it
properly classifies the training data.
 Then post-prune the tree, and ways that allow the tree to overfit the data and
then post-prune the tree.
 Despite the fact that the first strategy appears to be more straightforward, the
second approach of post-pruning overfit trees has shown to be more effective in
reality. The criterion used to determine the correct final tree size:

 To assess the usefulness of post-pruning nodes from the tree, use a separate
set of examples from the training examples.
 Use all available data for training, but do a statistical test to see if extending (or
pruning) a specific node would result in a better result than the training set.
 A chi-square test is performed to see if enlarging a node would increase
performance throughout the full instance distribution or only on the current
sample of training data.
 When encoding the training samples and the decision tree, use an explicit
measure of complexity, with the tree’s development halted when the encoding
size is reduced. This method is based on the Minimum Description Length
concept, which is a heuristic.

You might also like