ML Unit I
ML Unit I
Tech
UNIT I
Introduction :
Task
Performance Measure
Experience
Certain examples that efficiently defines the well-posed learning problem
are –
1. To better filter emails as spam or not
Task – Classifying emails as spam or not
Performance Measure – The fraction of emails accurately classified as spam
or not spam
Experience – Observing you label emails as spam or not spam
2. A checkers learning problem
Task – Playing checkers game
Performance Measure – percent of games won against opposer
Experience – playing implementation games against itself
machine will work with it the more it will get experience and the more
efficient result is produced.
Example : In Driverless Car, the training data is fed to Algorithm like how to
Drive Car in Highway, Busy and Narrow Street with factors like speed limit,
parking, stop at signal etc. After that, a Logical and Mathematical model is
created on the basis of that and after that, the car will work according to the
logical model. Also, the more data the data is fed the more efficient output is
produced.
Steps for Designing Learning System are:
Step 1) Choosing the Training Experience: The very important and first task
is to choose the training data or training experience which will be fed to the
Machine Learning Algorithm. It is important to note that the data or
experience that we fed to the algorithm must have a significant impact on
the Success or Failure of the Model. So Training data or experience should
be chosen wisely.
Below are the attributes which will impact on Success and Failure of Data:
The training experience will be able to provide direct or indirect feedback
regarding choices. For example: While Playing chess the training data will
provide feedback to itself like instead of this move if this is chosen the
chances of success increases.
Second important attribute is the degree to which the learner will control the
sequences of training examples. For example: when training data is fed to
the machine then at that time accuracy is very less but when it gains
experience while playing again and again with itself or opponent the
machine algorithm will get feedback and control the chess game
accordingly.
Third important attribute is how it will represent the distribution of examples
over which performance will be measured. For example, a Machine learning
algorithm will get experience while going through a number of different
cases and different examples. Thus, Machine Learning Algorithm will get
more and more experience by passing through more and more examples
and hence its performance will increase.
Step 2- Choosing target function: The next important step is choosing the
target function. It means according to the knowledge fed to the algorithm the
machine learning will choose NextMove function which will describe what
type of legal moves should be taken. For example : While playing chess
with the opponent, when opponent will play then the machine learning
algorithm will decide what be the number of possible legal moves taken in
order to get success.
Step 3- Choosing Representation for Target function: When the machine
algorithm will know all the possible legal moves the next step is to choose
the optimized move using any representation i.e. using linear Equations,
Hierarchical Graph Representation, Tabular form etc. The NextMove
function will move the Target move like out of these move which will provide
more success rate. For Example : while playing chess machine have 4
possible moves, so the machine will choose that optimized move which will
provide success to it.
Step 4- Choosing Function Approximation Algorithm: An optimized move
cannot be chosen just with the training data. The training data had to go
through with set of example and through these examples the training data
will approximates which steps are chosen and after that machine will provide
feedback on it. For Example : When a training data of Playing chess is fed
to algorithm so at that time it is not machine algorithm will fail or get success
and again from that failure or success it will measure while next move what
step should be chosen and what is its success rate.
Step 5- Final Design: The final design is created at last when system goes
from number of examples , failures and success , correct and incorrect
decision and what will be the next step etc. Example: DeepBlue is an
intelligent computer which is ML-based won chess game against the chess
expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.
Data plays a significant role in the machine learning process. One of the
significant issues that machine learning professionals face is the absence of
good quality data. Unclean and noisy data can make the whole process
extremely exhausting. We don’t want our algorithm to make inaccurate or faulty
predictions. Hence the quality of data is essential to enhance the output.
Therefore, we need to ensure that the process of data preprocessing which
includes removing outliers, filtering missing values, and removing unwanted
features, is done with the utmost level of perfection.
The machine learning industry is young and is continuously changing. Rapid hit
and trial experiments are being carried on. The process is transforming, and
hence there are high chances of error which makes the learning complex. It
includes analyzing the data, removing data bias, training data, applying
complex mathematical calculations, and a lot more. Hence it is a really
complicated process which is another big challenge for Machine learning
professionals.
The most important task you need to do in the machine learning process is to
train the data to achieve an accurate output. Less amount training data will
produce inaccurate or too biased predictions. Let us understand this with the
help of an example. Consider a machine learning algorithm similar to training a
child. One day you decided to explain to a child how to distinguish between an
apple and a watermelon. You will take an apple and a watermelon and show
him the difference between both based on their color, shape, and taste. In this
way, soon, he will attain perfection in differentiating between the two. But on the
other hand, a machine-learning algorithm needs a lot of data to distinguish. For
complex problems, it may even require millions of data to be trained. Therefore
we need to ensure that Machine learning algorithms are trained with sufficient
amounts of data.
6. Slow Implementation
This is one of the common issues faced by machine learning professionals. The
machine learning models are highly efficient in providing accurate results, but it
takes a tremendous amount of time. Slow programs, data overload, and
excessive requirements usually take a lot of time to provide accurate results.
Further, it requires constant monitoring and maintenance to deliver the best
output.
So you have found quality data, trained it amazingly, and the predictions are
really concise and accurate. Yay, you have learned how to create a machine
learning algorithm!! But wait, there is a twist; the model may become useless
in the future as data grows. The best model of the present may become
inaccurate in the coming Future and require further rearrangement. So you
need regular monitoring and maintenance to keep the algorithm working. This is
one of the most exhausting issues faced by machine learning professionals.
The theories can be sorted from the most specific to the most general. This will
allow the machine learning algorithm to thoroughly investigate the hypothesis
space without having to enumerate each and every hypothesis in it, which is
impossible when the hypothesis space is infinitely vast.
Task T: Determine the value of EnjoySport for every given day based on the
values of the day’s qualities.
Each hypothesis can be considered as a set of six constraints, with the values
of the six attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast
specified.
h2 = <Rainy, ?, Strong>
The question is how many and which examples are classed as positive by each
of these theories (i.e., satisfy these hypotheses). Only example 4 is satisfactory
for h1, however, both examples 3 and 4 are satisfactory and categorized as
positive for h2.
What is the reason behind this? What makes these two hypotheses so
different? The solution is found in the rigor with which each of these theories
imposes limits. As you can see, h1 places more restrictions on you than h2!
Naturally, h2 can categorize more good cases than h1! In this case, we may
really assert the following:
“If an example meets h1, it will almost certainly meet h2, but not the other way
around.”
This is due to the fact that h2 is more general than h1. This may be seen in the
following example: h2 has a wider range of choices than h1. If an instance has
the following values:< Rainy, Freezing, Strong>, h2 will classify it as positive,
but h1 will not be fulfilled.
We state that x fulfils h if and only if h(x) = 1 for each instance x in X and
hypothesis h in H.
Definition:
Let hj and hk be boolean-valued functions that are defined over X. If and only if,
hj is more general than or equal to hk (written hj >=g hk).
hj ≥g hk
The letter g stands for “general.” There are times when one theory is more
general than the other, but it is not the same.
Because every case that fulfils hl also satisfies h2, hypothesis h2 is more
general than hl.
It’s worth noting that neither hl nor h3 are more general than the other; while
the instances met by both hypotheses overlap, neither set subsumes the other.
A handful of the key algorithms that may be used to explore the hypothesis
space, H, by making use of the g operation. Finding-S is the name of the
method, with S standing for specific and implying that the purpose is to identify
the most particular hypothesis.
We can observe that all the occurrences that fulfill both h1 and h3 also satisfy
h2, thus we can conclude that:
FINDING S Algorithm
introduction :
Consider example 1 :
The data in example 1 is { GREEN, HARD, NO, WRINKLED }. We see that our
initial hypothesis is more specific and we have to generalize it for this example.
Hence, the hypothesis becomes :
h = { GREEN, HARD, NO, WRINKLED }
Consider example 2 :
Here we see that this example has a negative outcome. Hence we neglect this
example and our hypothesis remains the same.
h = { GREEN, HARD, NO, WRINKLED }
Consider example 3 :
Here we see that this example has a negative outcome. Hence we neglect this
example and our hypothesis remains the same.
h = { GREEN, HARD, NO, WRINKLED }
Consider example 4 :
The data present in example 4 is { ORANGE, HARD, NO, WRINKLED }. We
compare every single attribute with the initial data and if any mismatch is found
we replace that particular attribute with a general case ( ” ? ” ). After doing the
process the hypothesis becomes :
h = { ?, HARD, NO, WRINKLED }
Consider example 5 :
The data present in example 5 is { GREEN, SOFT, YES, SMOOTH }. We
compare every single attribute with the initial data and if any mismatch is found
we replace that particular attribute with a general case ( ” ? ” ). After doing the
process the hypothesis becomes:
h = { ?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis have
the general condition, example 6 and example 7 would result in the same
hypothesizes with all general attributes.
h = { ?, ?, ?, ? }
Hence, for the given data the final hypothesis would be :
Final Hyposthesis: h = { ?, ?, ?, ? }
Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
For each attribute constraint a, in h
If the constraint a, is satisfied by x
Then do nothing
Else replace a, in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
Version Spaces
A version space is a hierarchical representation of knowledge that enables you
to keep track of all the useful information supplied by a sequence of learning
examples without remembering any of the examples.
Fundamental Assumptions
In the diagram below, the specialization tree is colored red, and the
generalization tree is colored green.
That is, each time a negative example is used to specialilize the general
models, those specific models that match the negative example are eliminated
and each time a positive example is used to generalize the specific models,
those general models that fail to match the positive example are eliminated.
Eventually, the positive and negative examples may be such that only one
general model and one identical specific model survive.
Given:
A representation language.
A set of positive and negative examples expressed in that language.
Compute: a concept description that is consistent with all the positive examples
and none of the negative examples.
Method:
Generalize all the specific models to match the positive example, but ensure
the following:
a. Specialize all general models to prevent match with the negative example,
but ensure the following:
The new general models involve minimal changes.
Each new general model is a generalization of some specific model.
No new general model is a specialization of some other general
model.
3. cut away all the specific models that match the negative example.
o If S and G are both singleton sets, then:
Can describe all the possible hypotheses in the language consistent with
the data.
Fast (close to linear).
It not only just written one hypothesis but a set of all possible hypothesis
based on training data-set.
Algorithm:
Algorithmic steps:
Initially : G = [[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?,
?, ?],
[?, ?, ?, ?, ?, ?], [?, ?, ?, ?, ?, ?], [?, ?, ?, ?,
?, ?]]
S = [Null, Null, Null, Null, Null, Null]
S = ['sunny','warm',?,'strong', ?, ?]
INDUCTIVE BIAS
Decision tree algorithm falls under the category of supervised learning. They can be used to
solve both regression and classification problems.
Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the tree.
We can represent any Boolean function on discrete attributes using the decision tree.
Below are some assumptions that we made while using decision tree:
At the beginning, we consider the whole training set as the root.
Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
On the basis of attribute values records are distributed recursively.
We use statistical methods for ordering attributes as root or the internal node.
As you can see from the above image that Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting the use
of computer in the daily life of the people.
In Decision Tree the major challenge is to identification of the attribute for the root node in each
level. This process is known as attribute selection. We have two popular attribute selection
measures:
1. Information Gain
2. Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into smaller subsets
the entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v,
and Values (A) is the set of all possible values of A, then
Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an
arbitrary collection of examples. The higher the entropy more the information content.
Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
=-(-0.53-0.424)
= 0.954
X Y Z C
1 1 1 1
1 1 0 1
0 0 1 II
1 0 0 II
Split on feature X
Split on feature Y
Split on feature Z
From the above images we can see that the information gain is maximum when we make a split
on feature Y. So, for the root node best suited feature is feature Y. Now we can see that while
splitting the dataset by feature Y, the child contains pure subset of the target variable. So we
don’t need to further split the dataset.
The final tree for the above dataset would be look like this:
2. Gini Index
Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
It means an attribute with lower Gini index should be preferred.
Sk learn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
The Formula for the calculation of the of the Gini Index is given below.
Although a variety of decision tree learning methods have been developed with
somewhat differing capabilities and requirements, decision tree learning is
generally best suited to problems with the following characteristics:
Decision tree learning algorithm has been successfully used in expert systems
in capturing knowledge. The main task performed in these systems is using
inductive methods to the given values of attributes of an unknown object to
determine appropriate classification according to decision tree rules.
Decision trees classify instances by traverse from root node to leaf node. We
start from root node of decision tree, testing the attribute specified by this node,
then moving down the tree branch according to the attribute value in the given
set. This process is the repeated at the subtree level.
2.The target function has discrete output values. It can easily deal with instance
which is assigned to a boolean decision, such as 'true' and 'false', 'p(positive)'
and 'n(negative)'. Although it is possible to extend target to real-valued outputs,
we will cover the issue in the later part of this report.
3.The training data may contain errors .This can be dealt with pruning
techniques that we will not cover here.
The 3 widely used decision tree learning algorithms are : ID3, ASSISTANT and
C4.5.
In most supervised machine learning algorithm, our main goal is to find out a
possible hypothesis from the hypothesis space that could possibly map out the
inputs to the proper outputs.
The following figure shows the common method to find out the possible
hypothesis from the Hypothesis space:
Say suppose we have test data for which we have to determine the outputs or
results. The test data is as shown below:
But note here that we could have divided the coordinate plane as:
The way in which the coordinate would be divided depends on the data,
algorithm and constraints.
All these legal possible ways in which we can divide the coordinate plane to
predict the outcome of the test data composes of the Hypothesis Space.
Each individual possible way is known as the hypothesis.
Hence, in this example the hypothesis space would be like:
;
Inductive Learning Algorithm :
1. list the examples in the form of a table ‘T’ where each row corresponds to an
example and each column contains an attribute value.
2. create a set of m training examples, each example composed of k attributes
and a class attribute with n possible decisions.
3. create a rule set, R, having the initial value false.
4. initially all rows in the table are unmarked.
Steps in the algorithm:- Step 1: divide the table ‘T’ containing m examples into
n sub-tables (t1, t2,…..tn). One table for each possible value of the class
attribute. (repeat steps 2-8 for each sub-table) Step 2: Initialize the attribute
combination count ‘ j ‘ = 1. Step 3: For the sub-table on which work is going on,
divide the attribute list into distinct combinations, each combination with ‘j ‘
distinct attributes. Step 4: For each combination of attributes, count the number
of occurrences of attribute values that appear under the same combination of
attributes in unmarked rows of the sub-table under consideration, and at the
same time, not appears under the same combination of attributes of other sub-
tables. Call the first combination with the maximum number of occurrences the
max-combination ‘ MAX’. Step 5: If ‘MAX’ = = null , increase ‘ j ‘ by 1 and go to
Step 3. Step 6: Mark all rows of the sub-table where working, in which the
values of ‘MAX’ appear, as classified. Step 7: Add a rule (IF attribute = “XYZ” –
> THEN decision is YES/ NO) to R whose left-hand side will have attribute
names of the ‘MAX’ with their values separated by AND, and its right-hand side
contains the decision attribute value associated with the sub-table. Step 8: If all
rows are marked as classified, then move on to process another sub-table and
go to Step 2. else, go to Step 4. If no sub-tables are available, exit with the set
of rules obtained till then. An example showing the use of ILA suppose an
example set having attributes Place type, weather, location, decision and seven
examples, our task is to generate a set of rules that under what condition what
is the decision.
Mumba
II ) mountain windy i No
Mumba
IV ) beach windy i No
Example Place weathe decisio
no. type r location n
Mumba
5 mountain Windy i No
Mumba
6 beach Windy i No
Each branch of the tree is grown just deep enough by the algorithm to properly
categorize the training instances.
In reality, when there is noise in the data or when the number of training
instances is insufficient to provide a representative sample of the underlying
target function, it might cause problems.
This basic technique may yield trees that overfit the training samples in either
instance.
As the tree is built, the horizontal axis of this graphic shows the total number of
nodes in the decision tree. The accuracy of the tree’s predictions is indicated by
the vertical axis.
The solid line depicts the decision tree’s accuracy over the training instances,
whereas the broken line depicts accuracy over a separate set of test cases not
included in the training set.
The tree’s accuracy over the training instances grows in a linear fashion as it
matures. The accuracy assessed over the independent test cases, on the other
hand, increases at first, then falls.
As can be observed, once the tree size reaches about 25 nodes, additional
elaboration reduces the tree’s accuracy on the test cases while boosting it on
the training examples.
What is Underfitting:
When a machine learning system fails to capture the underlying trend of the
data, it is considered to be underfitting. Our machine learning model’s accuracy
is ruined by underfitting.
Its recurrence merely indicates that our model or method does not adequately
fit the data. Underfitting may be prevented by collecting additional data and
employing feature selection to reduce the number of characteristics.
Both of the errors usually occur when the training example contains errors or
noise.
What is Noise?
Even when the training data is noise-free, overfitting can occur, especially
when tiny numbers of samples are connected with leaf nodes.
Don’t try to fit all of the examples in; instead, quit before the training set runs
out.
After fitting all of the instances, prune the resulting tree.
To assess the usefulness of post-pruning nodes from the tree, use a separate
set of examples from the training examples.
Use all available data for training, but do a statistical test to see if extending (or
pruning) a specific node would result in a better result than the training set.
A chi-square test is performed to see if enlarging a node would increase
performance throughout the full instance distribution or only on the current
sample of training data.
When encoding the training samples and the decision tree, use an explicit
measure of complexity, with the tree’s development halted when the encoding
size is reduced. This method is based on the Minimum Description Length
concept, which is a heuristic.