ML Unit 1
ML Unit 1
IV B. Tech (CSE)
MACHINE LEARNING
SUBJECT: MACHINE LEARNING
B.Tech III YEAR II SEMESTER – CSE (2020-21 / SEM-2)
SUBJECT: MACHINE LEARNING (PROFESSIONAL ELECTIVE – III) (R16)
B.Tech. IV Year I Sem. L T P C
Course Code: CS733PE 3 0 0 3
UNIT - I
Introduction - Well-posed learning problems, designing a learning system, Perspectives and issues in machine learning
Concept learning and the general to specific ordering – introduction, a concept learning task,
concept learning as search, find-S: finding a maximally specific hypothesis, version spaces and the candidate
elimination algorithm, remarks on version spaces and candidate elimination, inductive bias.
Decision Tree Learning – Introduction, decision tree representation, appropriate problems for decision tree learning,
the basic decision tree learning algorithm, hypothesis space search in decision tree learning, inductive bias in decision tree
learning, issues in decision tree learning.
UNIT - II
Artificial Neural Networks-1– Introduction, neural network representation, appropriate problems for neural network
learning, perceptions, multilayer networks and the back- propagation algorithm.
Artificial Neural Networks-2- Remarks on the Back-Propagation algorithm, An illustrative example: face recognition,
advanced topics in artificial neural networks.
Evaluation Hypotheses – Motivation, estimation hypothesis accuracy, basics of sampling theory, a general approach for
deriving confidence intervals, difference in error of two hypotheses, comparing learning algorithms.
UNIT - III
Bayesian learning – Introduction, Bayes theorem, Bayes theorem and concept learning, Maximum Likelihood and least
squared error hypotheses, maximum likelihood hypotheses for predicting probabilities, minimum description length
principle, Bayes optimal classifier, Gibs algorithm, Naïve Bayes classifier, an example: learning to classify text, Bayesian
belief networks, the EM algorithm.
Computational learning theory – Introduction, probably learning an approximately correct hypothesis, sample complexity
for finite hypothesis space, sample complexity for infinite hypothesis spaces, the mistake bound model of learning.
Instance-Based Learning- Introduction, k-nearest neighbour algorithm, locally weighted regression, radial basis
functions, case-based reasoning, remarks on lazy and eager learning.
UNIT- IV
Genetic Algorithms – Motivation, Genetic algorithms, an illustrative example, hypothesis space search, genetic
programming, models of evolution and learning, parallelizing genetic algorithms.
Learning Sets of Rules – Introduction, sequential covering algorithms, learning rule sets: summary, learning First-Order
rules, learning sets of First-Order rules: FOIL, Induction as inverted deduction, inverting resolution.
Reinforcement Learning – Introduction, the learning task, Q–learning, non-deterministic, rewards and actions, temporal
difference learning, generalizing from examples, relationship to dynamic programming.
UNIT - V
Analytical Learning-1- Introduction, learning with perfect domain theories: PROLOG-EBG, remarks on explanation-based
learning, explanation-based learning of search control knowledge.
Analytical Learning-2-Using prior knowledge to alter the search objective, using prior knowledge to augment search
operators.
Combining Inductive and Analytical Learning – Motivation, inductive-analytical approaches to learning, using prior
knowledge to initialize the hypothesis.
TEXT BOOK:
1. Machine Learning – Tom M. Mitchell, - MGH
REFERENCE:
1. Machine Learning: An Algorithmic Perspective, Stephen Marshland, Taylor & Francis
UNIT - I
Concept learning and the general to specific ordering
Decision Tree Learning
UNIT - II
Artificial Neural Networks
Evaluation Hypotheses
UNIT - III
Bayesian learning
Computational learning theory
Instance-Based Learning
UNIT- IV
Genetic Algorithms
Reinforcement Learning
UNIT - V
Analytical Learning
Combining Inductive Learning
TEXT BOOK:
1. Machine Learning – Tom M. Mitchell, - MGH
REFERENCE:
1. Machine Learning: An Algorithmic Perspective, Stephen Marshland, Taylor & Francis
Prerequisites
• Data Structures
• Knowledge on statistical methods
Definition: Machine learning is a core sub-area of artificial intelligence enables computers to get the
ability into a mode of self-learning and improve from experience without being explicitly programmed.
Machine learning focuses on the development of computer programs that can access data and
use it learn for themselves when exposed to new data, these computer programs are enabled
to learn, grow, change, and develop by them.
The process of learning begins with observations or data, such as examples, direct experience,
or instruction, in order to look for patterns in data and make better decisions in the future
based on the examples that we provide.
What does it do? It enables the computers or the machines to make data-driven decisions rather
than being explicitly programmed for carrying out a certain task. These programs or algorithms are
designed in a way that they learn and improve over time when are exposed to new data.
Primary aim is to allow the computers learn automatically without human intervention or
assistance and adjust actions accordingly.
A form of artificial intelligence, automatically produce models that can analyze bigger, more
complex data and deliver faster, more accurate results – even on a very large scale.
By building precise models, an organization has a better chance of identifying profitable
opportunities – or avoiding unknown risks.
Unsupervised Learning:
Unsupervised learning means that the learning algorithm does not have any labels attached to
supervise the learning. We just provide algorithm with a large amount of data and characteristic of
each observation. Imagine there are no labels to the images of cats and dogs in above example. In
such case the algorithm itself cannot decide what a face is, but it can divide the data into groups. You
can employ unsupervised learning (for e.g. clustering) to separate the images in two groups based on
some inherent features of the pictures like color, size, shape etc.
Reinforcement Learning:
Another well-known class of ML problems is called reinforcement learning. This class of problems focus
on the end outcome to learn. Let’s illustrate by an example of learning to play chess. As input to this
problem ML algorithm receives information about whether a game played was won or lost. So ML does
not have every move in the game labelled as successful or not, but only has the result of the whole
game. The more games the algorithm plays, the more it learns about the winning moves.
Machine Learning with a constant evolution of the field there has been subsequent increase in the
model and development based on its demand and the influence of other fields like analytics and BIG
Data. Machine learning has also changed the way data extraction, and interpretation is done by
involving automatic sets of generic methods that have replaced traditional statistical techniques.
There are mainly two phases in ML. they are 1. Learning 2. Prediction.
Under Learning, we have 3 stages are there.
1. Pre-Processing
2. Learning
3. Error Analysis.
Preprocessing requires the Data Pre-Processing. This required Training Data. A Training data is used to
train an algorithm. Generally, training data is a certain percentage of an overall dataset along with
testing set. As a rule, the better the training data, the better the algorithm or classifier performs.
Once a model is trained on a training set, it’s usually evaluated on a test set. Oftentimes, these sets
are taken from the same overall dataset, though the training set should be labeled or enriched to
increase an algorithm’s confidence and accuracy.
Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other
words, whenever the data is gathered from different sources it is collected in raw format which is not
feasible for the analysis.
Learning Task can be accomplished in different ways like Supervised Learning, Unsupervised Learning
and Inferential Learning.
Generally speaking, model performance on training data tends to be optimistic, and therefore data
errors will be less than those involving test data. There are tradeoffs between the types of errors that a
machine learning practitioner must consider and often choose to accept.
For binary classification problems, there are two primary types of errors. Type 1 errors (false positives)
and Type 2 errors (false negatives). It’s often possible through model selection and tuning to increase
one while decreasing the other, and often one must choose which error type is more acceptable. This
can be a major tradeoff consideration depending on the situation.
Let us begin our study of machine learning by considering a few learning tasks. Put more precisely,
Definition: A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E.
For example, a computer program that learns to play checkers might improve its performance as
measured by its ability to win at the class of tasks involving playing checkers games, through
experience obtained by playing games against itself.
Let us understand with a checkers-playing program that can generate the legal moves from any board
state. The program needs only to learn how to choose the best move from among these legal moves.
In order to complete the design of the learning system, we must now choose
1. The exact type of knowledge to be learned
2. A representation for this target knowledge
3. A learning mechanism
– Let us therefore define the target value V(b) for an arbitrary board state b in B, as follows:
– if b is a final board state that is won, then V(b) = 100
– if b is a final board state that is lost, then V(b) = -100
– if b is a not a final state in the game, then V(b) = V(b’), where b' is the best final
board state that can be achieved starting from b and playing optimally until the
end of the game (assuming the opponent plays optimally, as well).
– V(b) gives a recursive definition for board state b
– Not usable because not efficient to compute except is first three trivial cases
– nonoperational definition
– Goal of learning is to discover an description of V
– ^ from Now on..
Let us Refer it as V
∑ ( V train ( b )−V^ ( b ) )
2
E≡
⟨b , V train ( b) ⟩ ∈training examples
• Require an algorithm that
– will incrementally refine
weights as new training
examples become available
– will be robust to errors in
these estimated training
values
• Least Mean Squares (LMS) is one
such algorithm
Final Design:
• The Critic takes as input the history or trace of the game and produces as output a set of
training examples of the target function.
• The Generalizer takes as input the training examples and produces an output hypothesis that
is its estimate of the target function. In our example, the Generalizer corresponds to the LMS
algorithm, and the output hypothesis is the function f described by the learned weights
wo, . . . , W6.
• The Experiment Generator takes as input the current hypothesis (currently learned function)
and outputs a new problem (i.e., initial board state) for the Performance System to explore. Its
role is to pick new practice problems that will maximize the learning rate of the overall system.
• Together, the design choices we made for our checkers program produce specific instantiations
for the Performance System, critic; generalizer, and experiment generator.
Consider, for example, the instances X and hypotheses H in the EnjoySport learning task. Given
that the attribute Sky has three possible values, and that AirTemp, Humidity, Wind, Water, and
Forecast each have two possible values, the instance space X contains exactly 3 x 2 x 2 x 2 x 2 x 2
= 96 distinct instances.
Now to distinguish these 96 instances either to be an exemplar or as a non-exemplar for a given
hypothesis, The number of possible ways that these 96 distinct instances can be represented in
Hypotheses:
•A hypothesis is a vector of constraints for each attribute
– indicate by a "?" that any value is acceptable for this attribute
– specify a single required value for the attribute
– indication by a "Ø" that no value is acceptable
If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive example
(h(x) = 1)
• Example hypothesis for ‘EnjoySport’: (also treat this as ‘Good Day for Water Sports’)
– Target concept c: EnjoySport : X → {0,1}
– Training Examples D: Positive or negative examples of the target function
• Determine
– A hypothesis h in H such that h(x) = c(x) for all x in X
Example of a Concept Learning task
Day Sky AirTemp Humidity Wind Water Forecast WaterSport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes
The goal of concept-learning is to find a hypothesis h:X --> {0,1} such that h(x)=c(x) for
all x in X.
• Most General Hypothesis: Every day is a good day for water sports <?,?,?,?,?,?>
• Most Specific Hypothesis: No day is a good day for water sports hs = <Ø, Ø, Ø, Ø, Ø, Ø>
the figure.
AirTem EnjoySpor
Example Sky Humidity Wind Water Forecast
p t
x1 Sunny Warm Normal Strong Warm Same Yes
x2 Sunny Warm High Strong Warm Same Yes
x3 Rainy Cold High Strong Warm Change No
x4 Sunny Warm High Strong Cool Change Yes
Table: Set of Training Examples(D)
Find-SAlgorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
3. For each attribute constraint ai in h
IF the constraint ai in h is satisfied by x THEN
do nothing
ELSE
4. replace ai in h by next more general constraint satisfied by x
5. Output hypothesis h
The FIND-S algorithm illustrates one way in which the more-general than partial ordering can be
used to organize the search for an acceptable hypothesis.
The search moves from hypothesis to hypothesis, searching from the most specific to
progressively more general hypotheses along one chain of the partial ordering. Figure
illustrates this search in terms of the instance and hypothesis spaces.
At each step, the hypothesis is generalized only as far as necessary to cover the new positive
example. Therefore, at each stage the hypothesis is the most specific hypothesis consistent
with the training examples observed up to this point (hence the name FIND-S). The literature on
concept learning is
• One limitation of the FIND-S algorithm is that it outputs just one hypothesis
consistent with the training data – there might be many.
• To overcome this, introduce notion of version space and algorithms to compute it.
Definition: Consistent:
A hypothesis h is Consistent with a set of training examples D of target concept c if and only if h(x) = c(x)
• First make an assumption that would like to target and reduce the concept Size is
called Inductive bias Reduction large concept space to small Target concept space.
General boundary G
Definition: The general boundary G, with respect to hypothesis space H and
training data D, is the set of maximally general members of H consistent with
Specific boundary S
Definition: The specific boundary S, with respect to hypothesis space H and
training data D, is the set of minimally general (i.e., maximally specific)
• Represents version space by its most general and most specific members
1: Procedure CandidateEliminationLearner(X,Y,D,H)
2: Inputs
3: X: set of input features, X={X1,...,Xn}
4: Y: target feature
5: D: set of examples from which to learn
6: H: hypothesis space
7: Output
8: general boundary G⊆H
9: specific boundary S⊆H consistent with E
10: Local
11: G: set of hypotheses in H
12: S: set of hypotheses in H
13: Let G={true}, S={false};
14: for each x∈D do
15: if ( x is a positive example) then
16: Elements of G that classify e as negative are removed from G;
17: Each element s of S that classifies e as negative is removed and
replaced by the minimal generalizations of s that classify x as positive and are less general
Training Examples:
T1: (Sunny,Warm, Normal, Strong,Warm, Same),Yes
T2: (Sunny,Warm, High, Strong,Warm, Same),Yes
T3: (Rainy,Cold, High, Strong,Warm,Change), No
T4: (Sunny,Warm, High, Strong,Cool,Change),Yes
The CANDIDATE-ELIMINATION algorithm computes the version space containing all hypotheses
from H that are consistent with an observed sequence of training examples.
It begins by initializing the version space to the set of all hypotheses in H; that is, by initializing the G
boundary set to contain the most general hypothesis in H
Go <- {(?, ?, ?, ?, ?, ?)}
Initializing the S boundary set to contain the most specific (least general) hypothesis
S0 <- <Ø, Ø, Ø, Ø, Ø, Ø>
These two boundary sets delimit the entire hypothesis space, because every other hypothesis in H is
both more general than so and more specific than Go. As each training example is considered, the S
and G boundary sets are generalized and specialized, respectively, to eliminate from the version space
any hypotheses found inconsistent with the new training example. After all examples have been
processed, the computed version space contains all the hypotheses consistent with these examples
and only these hypotheses.
Notice that the algorithm is specified in terms of operations such as computing minimal
generalizations and specializations of given hypotheses, and identifying non-nominal and non-maximal
hypotheses. The detailed implementation of these operations will depend, of course, on the specific
representations for instances and hypotheses. However, the algorithm itself can be applied to any
concept learning task and hypothesis space for which these operations are well-defined. In the
following example trace of this algorithm, we see how such operations can be implemented for the
representations used in the Enjoy-Sport example problem.
–Rote-Learner: This system simply memorizes the training data and their
classification--- No generalization is involved.
–Candidate-Elimination: New instances are classified only if all the hypotheses in the
version space agree on the classification
–Find-S: New instances are classified using the most specific hypothesis consistent with
the training data
Decision Tree is a supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
It is a graphical representation for getting all the possible solutions to a problem/decision based on given
conditions.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes
are the output of those decisions and do not contain any further branches.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes.
Decision Tree Representation:
Each non-leaf node is connected to a test that splits its set of possible answers into subsets
corresponding to different test results.
Each branch carries a particular test result's subset to another node.
Each node is connected to a set of possible answers.
Below diagram explains the general structure of a decision tree:
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The
creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say
that the purity of the node increases with respect to the target variable. The decision tree splits the
nodes on all available variables and then selects the split which results in most homogeneous sub-
nodes.
The algorithm selection is also based on the type of target variables. Let us look at some algorithms
used in Decision Trees:
The ID3 algorithm builds decision trees using a top-down greedy search approach through the space of
possible branches with no backtracking. A greedy algorithm, as the name suggests, always makes the
choice that seems to be the best at that moment.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Entropy:
Entropy is a measure of the randomness in the information being
processed. The higher the entropy, the harder it is to draw any
conclusions from that information. Flipping a coin is an example of an
action that provides information that is random.
From the graph, it is quite evident that the entropy H(X) is zero when the
probability is either 0 or 1. The Entropy is maximum when the probability
is 0.5 because it projects perfect randomness in the data and there is no
chance if perfectly determining the outcome.
Information Gain
ID3 follows the rule — A branch with an entropy of zero is a leaf node and A branch with
entropy more than zero needs further splitting.
Illustrative Example:
Concept: “Play Tennis”:
Data set:
Temperatur
Day Outlook Humidity Wind Play Golf
e
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
"The world is inherently simple. Therefore the smallest decision tree that is consistent with the samples is the one
that is most likely to identify unknown objects correctly."
Rather than building all the possible trees, measuring the size of each, and choosing the smallest tree that
best fits the data, we use Quinlan's ID3 algorithm for constructing a decision tree.