0% found this document useful (0 votes)
29 views36 pages

ML Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views36 pages

ML Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

MACHINE LEARNING

IV B. Tech (CSE)
MACHINE LEARNING
SUBJECT: MACHINE LEARNING
B.Tech III YEAR II SEMESTER – CSE (2020-21 / SEM-2)
SUBJECT: MACHINE LEARNING (PROFESSIONAL ELECTIVE – III) (R16)
B.Tech. IV Year I Sem. L T P C
Course Code: CS733PE 3 0 0 3
UNIT - I
Introduction - Well-posed learning problems, designing a learning system, Perspectives and issues in machine learning
Concept learning and the general to specific ordering – introduction, a concept learning task,
concept learning as search, find-S: finding a maximally specific hypothesis, version spaces and the candidate
elimination algorithm, remarks on version spaces and candidate elimination, inductive bias.
Decision Tree Learning – Introduction, decision tree representation, appropriate problems for decision tree learning,
the basic decision tree learning algorithm, hypothesis space search in decision tree learning, inductive bias in decision tree
learning, issues in decision tree learning.
UNIT - II
Artificial Neural Networks-1– Introduction, neural network representation, appropriate problems for neural network
learning, perceptions, multilayer networks and the back- propagation algorithm.
Artificial Neural Networks-2- Remarks on the Back-Propagation algorithm, An illustrative example: face recognition,
advanced topics in artificial neural networks.
Evaluation Hypotheses – Motivation, estimation hypothesis accuracy, basics of sampling theory, a general approach for
deriving confidence intervals, difference in error of two hypotheses, comparing learning algorithms.
UNIT - III
Bayesian learning – Introduction, Bayes theorem, Bayes theorem and concept learning, Maximum Likelihood and least
squared error hypotheses, maximum likelihood hypotheses for predicting probabilities, minimum description length
principle, Bayes optimal classifier, Gibs algorithm, Naïve Bayes classifier, an example: learning to classify text, Bayesian
belief networks, the EM algorithm.
Computational learning theory – Introduction, probably learning an approximately correct hypothesis, sample complexity
for finite hypothesis space, sample complexity for infinite hypothesis spaces, the mistake bound model of learning.
Instance-Based Learning- Introduction, k-nearest neighbour algorithm, locally weighted regression, radial basis
functions, case-based reasoning, remarks on lazy and eager learning.
UNIT- IV
Genetic Algorithms – Motivation, Genetic algorithms, an illustrative example, hypothesis space search, genetic
programming, models of evolution and learning, parallelizing genetic algorithms.
Learning Sets of Rules – Introduction, sequential covering algorithms, learning rule sets: summary, learning First-Order
rules, learning sets of First-Order rules: FOIL, Induction as inverted deduction, inverting resolution.
Reinforcement Learning – Introduction, the learning task, Q–learning, non-deterministic, rewards and actions, temporal
difference learning, generalizing from examples, relationship to dynamic programming.
UNIT - V
Analytical Learning-1- Introduction, learning with perfect domain theories: PROLOG-EBG, remarks on explanation-based
learning, explanation-based learning of search control knowledge.
Analytical Learning-2-Using prior knowledge to alter the search objective, using prior knowledge to augment search
operators.
Combining Inductive and Analytical Learning – Motivation, inductive-analytical approaches to learning, using prior
knowledge to initialize the hypothesis.

TEXT BOOK:
1. Machine Learning – Tom M. Mitchell, - MGH

REFERENCE:
1. Machine Learning: An Algorithmic Perspective, Stephen Marshland, Taylor & Francis

VIGNAN – VITS – CSE Page 2


IV B. Tech (CSE)
MACHINE LEARNING
COURSE OUTCOMES:

1. Understand the concepts of Concept learning in Computational intelligence


2. Understand the Neural Networks and its usage
3. Ability to get the skill to apply machine learning techniques to address the Computational and
Instance based Learning problems.
4. Familiar in Genetic algorithmic Approach and its complexity in Machine Learning.
5. Ability to get the skill to analyze the learning ability for search control knowledge

UNIT - I
Concept learning and the general to specific ordering
Decision Tree Learning
UNIT - II
Artificial Neural Networks
Evaluation Hypotheses
UNIT - III
Bayesian learning
Computational learning theory
Instance-Based Learning
UNIT- IV
Genetic Algorithms
Reinforcement Learning

UNIT - V
Analytical Learning
Combining Inductive Learning

TEXT BOOK:
1. Machine Learning – Tom M. Mitchell, - MGH

REFERENCE:
1. Machine Learning: An Algorithmic Perspective, Stephen Marshland, Taylor & Francis

Prerequisites

• Data Structures
• Knowledge on statistical methods

VIGNAN – VITS – CSE Page 3


IV B. Tech (CSE)
MACHINE LEARNING

What is Machine Learning?

Definition: Machine learning is a core sub-area of artificial intelligence enables computers to get the
ability into a mode of self-learning and improve from experience without being explicitly programmed.
 Machine learning focuses on the development of computer programs that can access data and
use it learn for themselves when exposed to new data, these computer programs are enabled
to learn, grow, change, and develop by them.
 The process of learning begins with observations or data, such as examples, direct experience,
or instruction, in order to look for patterns in data and make better decisions in the future
based on the examples that we provide.

What does it do? It enables the computers or the machines to make data-driven decisions rather
than being explicitly programmed for carrying out a certain task. These programs or algorithms are
designed in a way that they learn and improve over time when are exposed to new data.

  By building precise models, an organization


has a better chance of identifying profitable opportunities – or avoiding unknown risks.
 To better understand the uses of machine learning, consider some of the instances where machine
learning is applied: the self-driving Google car, cyber fraud detection, online recommendation
engines—like friend suggestions on Face book, Netflix showcasing.

Why Machine Learning is so important?

 Primary aim is to allow the computers learn automatically without human intervention or
assistance and adjust actions accordingly.
 A form of artificial intelligence, automatically produce models that can analyze bigger, more
complex data and deliver faster, more accurate results – even on a very large scale.
 By building precise models, an organization has a better chance of identifying profitable
opportunities – or avoiding unknown risks.

VIGNAN – VITS – CSE Page 4


IV B. Tech (CSE)
MACHINE LEARNING
 To better understand the uses of machine learning, consider some of the instances where
machine learning is applied: the self-driving Google car, cyber fraud detection, online
recommendation engines—like friend suggestions on Face book, Netflix showcasing..
 By making it possible to quickly, cheaply and automatically process and analyze huge volumes
of complex data, machine learning is critical to countless new and future applications.

Type of Machine Learning:


Supervised Learning:
Supervised learning relies on data where the true class
of the data is revealed. For example, if we want to teach
the computer to distinguish between pictures of cats
and dogs. We will run the algorithm on lots of pictures of
cats and dogs. In order to supervise the algorithm to
learn the right way to classify images, we will label the
pictures as cats and dogs. Once our algorithm learns
how to classify images we can use it on new data and
predict labels (cat or dog in our case) on unseen images.

Unsupervised Learning:
Unsupervised learning means that the learning algorithm does not have any labels attached to
supervise the learning. We just provide algorithm with a large amount of data and characteristic of
each observation. Imagine there are no labels to the images of cats and dogs in above example. In
such case the algorithm itself cannot decide what a face is, but it can divide the data into groups. You
can employ unsupervised learning (for e.g. clustering) to separate the images in two groups based on
some inherent features of the pictures like color, size, shape etc.

Reinforcement Learning:
Another well-known class of ML problems is called reinforcement learning. This class of problems focus
on the end outcome to learn. Let’s illustrate by an example of learning to play chess. As input to this
problem ML algorithm receives information about whether a game played was won or lost. So ML does
not have every move in the game labelled as successful or not, but only has the result of the whole
game. The more games the algorithm plays, the more it learns about the winning moves.

Machine Learning Process Flow:

Machine Learning with a constant evolution of the field there has been subsequent increase in the
model and development based on its demand and the influence of other fields like analytics and BIG
Data. Machine learning has also changed the way data extraction, and interpretation is done by
involving automatic sets of generic methods that have replaced traditional statistical techniques.

VIGNAN – VITS – CSE Page 5


IV B. Tech (CSE)
MACHINE LEARNING
The process flow depicted here represents how machine learning works Whatever the Data has to be
process for Machine Learning, It is important to build a Model.

There are mainly two phases in ML. they are 1. Learning 2. Prediction.
Under Learning, we have 3 stages are there.
1. Pre-Processing
2. Learning
3. Error Analysis.
Preprocessing requires the Data Pre-Processing. This required Training Data. A Training data is used to
train an algorithm. Generally, training data is a certain percentage of an overall dataset along with
testing set. As a rule, the better the training data, the better the algorithm or classifier performs.

Once a model is trained on a training set, it’s usually evaluated on a test set. Oftentimes, these sets
are taken from the same overall dataset, though the training set should be labeled or enriched to
increase an algorithm’s confidence and accuracy.

Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other
words, whenever the data is gathered from different sources it is collected in raw format which is not
feasible for the analysis.

Learning: Choosing a model


The next step in our workflow is choosing a
model. There are many models that researchers
and data scientists have created over the years.
Some are very well suited for image data, others
for sequences (like text, or music), some for
numerical data and others for text-based data.

Learning Task can be accomplished in different ways like Supervised Learning, Unsupervised Learning
and Inferential Learning.

Error Analysis and Tradeoffs


There are multiple types of errors associated with
machine learning and predictive analytics. The primary
types are in-sample and out-of-sample errors. In-sample
errors (aka re-substitution errors) are the error rate found
from the training data, i.e., the data used to build
predictive models.

One very important point to note is that prediction


performance and error analysis should only be done on
test data, when evaluating a model for use on non-training or new data (out-of-sample).

Generally speaking, model performance on training data tends to be optimistic, and therefore data
errors will be less than those involving test data. There are tradeoffs between the types of errors that a
machine learning practitioner must consider and often choose to accept.

For binary classification problems, there are two primary types of errors. Type 1 errors (false positives)
and Type 2 errors (false negatives). It’s often possible through model selection and tuning to increase
one while decreasing the other, and often one must choose which error type is more acceptable. This
can be a major tradeoff consideration depending on the situation.

VIGNAN – VITS – CSE Page 6


IV B. Tech (CSE)
MACHINE LEARNING
In the Phase 2: We have now ready with the model after making it ready for test the data and interpret
the computations. The Model with consider the Test data and process it for the Prediction, It depend on
the model for the output, in such a way that it can produce the prediction Data and results.

Some successful applications of Machine Learning:

Learning to recognize spoken words:


All of the most successful speech recognition systems employ machine learning in some form.
 For example, the SPHINX system (e.g., Lee 1989) learns speaker-specific strategies for
recognizing the primitive sounds (phonemes) and words from the observed speech signal.
 Neural network learning methods (e.g., Waibel et al. 1989) and methods for learning hidden
Markov models (e.g., Lee 1989) are effective for automatically customizing to, individual
speakers, vocabularies, microphone characteristics, background noise, etc. Similar techniques
have potential applications in many signal-interpretation problems.

Learning to drive an autonomous vehicle:


Machine learning methods have been used to train computer-controlled vehicles to steer correctly
when driving on a variety of road types.
 For example, the ALVINN system (Pomerleau 1989) has used its learned strategies to drive
unassisted at 70 miles per hour for 90 miles on public highways among other cars.
 Similar techniques have possible applications in many sensor-based control problems.

Learning to classify new astronomical structures:


Machine learning methods have been applied to a variety of large databases to learn general
regularities implicit in the data.
 For example, decision tree learning algorithms have been used by NASA to learn how to classify
celestial objects from the second Palomar Observatory Sky Survey (Fayyad et al. 1995).
 This system is now used to automatically classify all objects in the Sky Survey, which consists of
three terra bytes of image data.

Learning to play world-class backgammon:


The most successful computer programs for playing games such as backgammon are based on
machiie learning algorithms.
 For example, the world's top computer program for backgammon, TD-GAMMON(T esauro 1992,
1995). learned its strategy by playing over one million practice games against itself.
 It now plays at a level competitive with the human world champion. Similar techniques have
applications in many practical problems where very large search spaces must be examined
efficiently.

VIGNAN – VITS – CSE Page 7


IV B. Tech (CSE)
MACHINE LEARNING
WELL-POSED LEARNING PROBLEMS

Let us begin our study of machine learning by considering a few learning tasks. Put more precisely,

 Definition: A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E.

For example, a computer program that learns to play checkers might improve its performance as
measured by its ability to win at the class of tasks involving playing checkers games, through
experience obtained by playing games against itself.

A checkers learning problem:


1. Task T: playing checkers
2. Performance measure P: percent of games won against opponents
3. Training experience E: playing practice games against itself
We can specify many learning problems in this fashion, such as learning to recognize handwritten
words, or learning to drive a robotic automobile autonomously.

A handwriting recognition learning problem:


1. Task T: recognizing and classifying handwritten words within images
2. Performance measure P: percent of words correctly classified
3. Training experience E: a database of handwritten words with given classifications

A robot driving learning problem:


1. Task T: Driving on public four-lane highways using vision sensors
2. Performance measure P: Average distance traveled before an error (as judged by human
overseer)
3. Training experience E: A sequence of images and steering commands recorded while
observing a human driver

Choosing the Target Function:


In order to complete the design of the learning system, we must now choose
1. the exact type of knowledge to be learned
2. a representation for this target knowledge
3. a learning mechanism

Will the training experience provide direct or indirect feedback?


• Direct Feedback: system learns from examples of individual checkers board states and the
correct move
• for each
• Indirect Feedback: Move sequences and final outcomes of various games played
• Credit assignment problem: Value of early states must be inferred from the outcome

Degree to which the learner controls the sequence of training examples


• Teacher selects informative boards and gives correct move
• Learner proposes board states that it finds particularly confusing. Teacher provides correct
moves
• Learner controls board states and (indirect) training classifications

Let us understand with a checkers-playing program that can generate the legal moves from any board
state. The program needs only to learn how to choose the best move from among these legal moves.
In order to complete the design of the learning system, we must now choose
1. The exact type of knowledge to be learned
2. A representation for this target knowledge
3. A learning mechanism

A CHECKERS LEARNING PROBLEM:

VIGNAN – VITS – CSE Page 8


IV B. Tech (CSE)
MACHINE LEARNING
Assume that you can determine legal moves, and the Program needs to learn the best move from
among legal moves. Then we can define large search space known a priori and the ta rget function can
be defined as :
 ChooseMove : B → M ( M is a set of board states which are mapped as legal moves)
 ChooseMove is difficult to learn given indirect training

We use the Credit assignment for the Board Sequences


Alternative target function
– An evaluation function that assigns a numerical score to any given board state
– V : B → R ( where R is the set of real numbers)

– Let us therefore define the target value V(b) for an arbitrary board state b in B, as follows:
– if b is a final board state that is won, then V(b) = 100
– if b is a final board state that is lost, then V(b) = -100
– if b is a not a final state in the game, then V(b) = V(b’), where b' is the best final
board state that can be achieved starting from b and playing optimally until the
end of the game (assuming the opponent plays optimally, as well).
– V(b) gives a recursive definition for board state b
– Not usable because not efficient to compute except is first three trivial cases
– nonoperational definition
– Goal of learning is to discover an description of V
– ^ from Now on..
Let us Refer it as V

• Use linear combination of the following board features:


– x1: the number of black pieces on the board
– x2: the number of red pieces on the board
– x3: the number of black kings on the board
– x4: the number of red kings on the board
– x5: the number of black pieces threatened by red
(i.e. which can be captured on red's next turn)
– x6: the number of red pieces threatened by black

Estimating Training Values


• Need to assign specific scores to intermediate board states
• Approximate intermediate board state b using the learner's current approximation of the next
board state following b
¿
Adjusting the Weights:
• Choose the weights wi to best fit the set of training examples
• Minimize the squared error E between the train values and the values predicted by the
hypothesis

∑ ( V train ( b )−V^ ( b ) )
2
E≡
⟨b , V train ( b) ⟩ ∈training examples
• Require an algorithm that
– will incrementally refine
weights as new training
examples become available
– will be robust to errors in
these estimated training
values
• Least Mean Squares (LMS) is one
such algorithm

Final Design:

VIGNAN – VITS – CSE Page 9


IV B. Tech (CSE)
MACHINE LEARNING

• The Critic takes as input the history or trace of the game and produces as output a set of
training examples of the target function.
• The Generalizer takes as input the training examples and produces an output hypothesis that
is its estimate of the target function. In our example, the Generalizer corresponds to the LMS
algorithm, and the output hypothesis is the function f described by the learned weights
wo, . . . , W6.
• The Experiment Generator takes as input the current hypothesis (currently learned function)
and outputs a new problem (i.e., initial board state) for the Performance System to explore. Its
role is to pick new practice problems that will maximize the learning rate of the overall system.
• Together, the design choices we made for our checkers program produce specific instantiations
for the Performance System, critic; generalizer, and experiment generator.

Summary of Design Choices:

• Together, the design choices we made


for our checkers program produce
specific instantiations for the
performance system, critic; generalizer,
and experiment generator. Many
machine learning systems can-be
usefully characterized in terms of these
four generic modules.

Difference between Large Concept Space and Target Space.


Answer:
Concept learning can be viewed as the task of searching through a large space of hypotheses. Let X
be the set of arbitrary collection of well-defined objects or instances. Then the set of all possible
concepts that can be made is said to be as the “Large concept Space” whereas for any concept we
consider. That is, for any no. of training examples data set, there exist a specific Set of Hypothesis
consistent to the given training examples data sets. That we can call as the Target Space. To illustrate
more about this Let we consider the following example,
Suppose we have 4 Attributes, in which is a Boolean valued, and the concept is stated as “Malignant
Tumor”.
Instance space X Sky (Sunny/Cloudy/Rainy) AirTemp (Warm/Cold) Humidity (Normal/High) Wind
(Strong/Weak) Say for example:
For a given object I considered the following features:
Shape: Circular / Oval Size: Small / Large
Color: Dark / Lighter Surface : Smooth / Irregular
Then the no. of possible intendances or objects can be made from the above Boolean valued attributes
are (2 x 2 x 2 x 2 = 16 = 2^4). Hence there will be 16 possible featured objects exists
Then the no. of ways you can be divide the above all objects in to two subsets possible concepts:
Then no. of all possible concepts can be X = 2^16 (65,536) : => This is the size of Large Concept
Space.
i.e., If we have d binary features the total no. of all possible concepts can be 2^2^d.
In Conjunctive concept: (Hypothesis) we need to find the the possible alternatives for each attribute as
__ ^ __ ^ __ ^ __
Now the possible How many conjunctive concepts can there be? ( if you have 4 binary features)
(Either first feature or Second or no feature) –

VIGNAN – VITS – CSE Page 10


IV B. Tech (CSE)
MACHINE LEARNING
So 3 ^ d + 1 = 3^4 + 1 = 82 ( Look the possible concepts which are 2^2^d): This is the size of
Target space. We made our problem much simpler..

PERSPECTIVES AND ISSUES IN MACHINE LEARNING


PERSPECTIVES: (VIEWS)
• One useful perspective on machine learning is that it involves searching a very large space of
possible hypotheses to determine one that best fits the observed data and any prior knowledge
held by the learner.
• For example, In the checkers learner problem, this hypothesis space consists of all evaluation
functions that can be represented by some choice of values for the weights wo through w6.
• The LMS algorithm for fitting weights achieves this goal by iteratively tuning the weights,
adding a correction to each weight each time the hypothesized evaluation function predicts a
value that differs from the training value.

ISSUES IN MACHINE LEARNING:


Our checkers example raises a number of generic questions about machine learning. The field of
machine learning, and much of this book, is concerned with answering questions such as the following:
• What algorithms exist for learning general target functions from specific training examples?
• In what settings will particular algorithms converge to the desired function, given sufficient
training data?
• Which algorithms perform best for which types of problems and representations?
• How much training data is sufficient?
• What general bounds can be found to relate the confidence in learned hypotheses to the
amount of training experience in the learner's hypothesis space?
• Can prior knowledge be helpful even when it is only approximately correct?
• What is the best strategy for choosing a useful next training experience, and how does the
choice of this strategy alter the complexity of the learning problem?
• What is the best way to reduce the learning task to one or more function approximation
problems?
• How can the learner automatically alter its representation to improve its ability to represent
and learn the target function?

VIGNAN – VITS – CSE Page 11


IV B. Tech (CSE)
MACHINE LEARNING
CONCEPT LEARNING AND THE GENERAL-TO-SPECIFIC 0RDERING
What is a Concept and Concept Learning?
• Concept learning, also known as category
learning, concept attainment, and concept formation,
can be defined as
"The search for and listing of attributes that can be
used to distinguish exemplars from non-exemplars of various
categories".
• A Concept is a subset of objects or events defined over a larger set
– Example: The concept of a bird is the subset of all objects (i.e., the set of all things or all
animals) that belong to the category of bird.
• Alternatively, a concept is a Boolean-valued function defined over this larger set i.e., Acquire
the Ability to categorize a well-defined object belonging to a concept (Category) or not.
• Example: a function defined over all animals whose value is true for birds and false for every
other animal.

Concept Learning - Representation: Given a set of examples labeled as members or non-members of a


concept, i.e., concept-learning consists of approximating a Boolean-Valued function from training
examples of its input and output.

Example of a Concept Learning task:


 Concept: Good Days for Water Sports (values: Yes, No)
 Attributes/Features:
 Sky (values: Sunny, Cloudy, Rainy)
 AirTemp (values: Warm, Cold)
 Humidity (values: Normal, High)
 Wind (values: Strong, Weak)
 Water (Warm, Cool)
 Forecast (values: Same, Change)
 Example of a Training Point: (instance)
<Sunny, Warm, High, Strong, Warm, Same, Yes> # here ‘Yes’ is a class label

Concept Learning as Search:


 Concept Learning can be viewed as the task of searching through a large space of hypotheses
implicitly defined by the hypothesis representation.
 Selecting a Hypothesis Representation is an important step since it restricts (or biases) the space
that can be searched.
 [For example, the hypothesis “If the air temperature is cold or the humidity high then it is a good
day for water sports” cannot be expressed in our chosen representation.]
 The goal of this search is to find the hypothesis that best fits the training examples. It is important
to note that by selecting a hypothesis representation, the designer of the learning algorithm
implicitly defines the space of all hypotheses that the program can ever represent and therefore
can ever learn.

 Consider, for example, the instances X and hypotheses H in the EnjoySport learning task. Given
that the attribute Sky has three possible values, and that AirTemp, Humidity, Wind, Water, and
Forecast each have two possible values, the instance space X contains exactly 3 x 2 x 2 x 2 x 2 x 2
= 96 distinct instances.
 Now to distinguish these 96 instances either to be an exemplar or as a non-exemplar for a given
hypothesis, The number of possible ways that these 96 distinct instances can be represented in

VIGNAN – VITS – CSE Page 12


IV B. Tech (CSE)
MACHINE LEARNING
two different groups (Yes & No) can be 296 ways. (Think like the possible number of subsets that
can be generated with d no. of elements in a set is 2d).
 Hence the possible size of concept space for ‘Good Day for Water Sports’ is 2 96. Obvious
understanding will say the bigger and complexity of the Problem.
 One important point here we have to understand that only ONE combination of these 2 96
combinations will be our Hypothesis h, w.r.t D(the set of training examples) which is our Goal
infact.

Hypotheses:
•A hypothesis is a vector of constraints for each attribute
– indicate by a "?" that any value is acceptable for this attribute
– specify a single required value for the attribute
– indication by a "Ø" that no value is acceptable
If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive example
(h(x) = 1)
• Example hypothesis for ‘EnjoySport’: (also treat this as ‘Good Day for Water Sports’)
– Target concept c: EnjoySport : X → {0,1}
– Training Examples D: Positive or negative examples of the target function
• Determine
– A hypothesis h in H such that h(x) = c(x) for all x in X
Example of a Concept Learning task
Day Sky AirTemp Humidity Wind Water Forecast WaterSport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

Chosen Hypothesis Representation:


Conjunction of constraints on each attribute where:
• “?” means “any value is acceptable”
• “0” means “no value is acceptable”

Example of a hypothesis: <?,Cold,High,?,?,?>


(If the air temperature is cold and the humidity high then it is a good day for water sports)

Terminology and Notation:


• The set of items over which the concept is defined is called the set of instances (denoted by X)
• The concept to be learned is called the Target Concept (denoted by c: X--> {0,1})
• The set of Training Examples is a set of instances, x, along with their target concept value
c(x).
• Members of the concept (instances for which c(x)=1) are called positive examples.
• Nonmembers of the concept (instances for which c(x)=0) are called negative examples.
• H represents the set of all possible hypotheses. H is determined by the human designer’s
choice of a hypothesis representation.

Example of a Concept Learning task:


Goal: To infer the “best” concept-description from the set of all possible hypotheses (“best” means
“which best generalizes to all (known or unknown) elements of the instance space”. Concept-learning
is an ill-defined task.

The goal of concept-learning is to find a hypothesis h:X --> {0,1} such that h(x)=c(x) for
all x in X.
• Most General Hypothesis: Every day is a good day for water sports <?,?,?,?,?,?>
• Most Specific Hypothesis: No day is a good day for water sports hs = <Ø, Ø, Ø, Ø, Ø, Ø>

VIGNAN – VITS – CSE Page 13


IV B. Tech (CSE)
MACHINE LEARNING
General to Specific Ordering of Hypotheses (PARTIAL ORDERING):
“more-general-than-or-equal-to”:
 Many algorithms for concept learning organize the search through the hypothesis space by relying
on a very useful structure that exists for any concept learning problem: a general-to-specific
ordering of hypotheses.
 By taking advantage of this naturally occurring structure over the hypothesis space, we can design
learning algorithms that exhaustively search even infinite hypothesis spaces without explicitly
enumerating every hypothesis.

Definition: Let hj and hk be boolean-valued functions defined over X. Then hj is more-general-than-


or-equal-to hk iff For all x in X, i.e., [(hk(x) = 1) --> (hj(x)=1)]
• Example:
– Every instance that are classified as positive by h1 will also be classified as positive by
h2 in our example data set. Therefore h2 is more general than h1.
ALGORITHM:
1. Start with h = 0
2. Use Next input {x, c(x)}
3. If c(x) = 0, goto step 2
4. h<-h^x ( pairwise-and **)
5. If more samples: Goto step 2
6. Stop
** Pairwise rules has to be specified.

• “Given hypothesis hj and hk, hj is more_general_than_or_equal_to hk if and only if any instance


that satisfies hk also satisfies hj
• One learning method is to determine the most specific hypothesis that matches all the training
data.
• We also use the ideas of “strictly”-more-general-than, and more-specific-than (illustration)

FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS

VIGNAN – VITS – CSE Page 14


IV B. Tech (CSE)
MACHINE LEARNING
To be more precise about how the partial ordering is used, consider the FIND-S algorithm defined in

the figure.

AirTem EnjoySpor
Example Sky Humidity Wind Water Forecast
p t
x1 Sunny Warm Normal Strong Warm Same Yes
x2 Sunny Warm High Strong Warm Same Yes
x3 Rainy Cold High Strong Warm Change No
x4 Sunny Warm High Strong Cool Change Yes
Table: Set of Training Examples(D)

Find-SAlgorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
3. For each attribute constraint ai in h
IF the constraint ai in h is satisfied by x THEN
do nothing
ELSE
4. replace ai in h by next more general constraint satisfied by x
5. Output hypothesis h

 The first step of FIND-S is to initialize h to the most specific hypothesis in H.


hs = <Ø, Ø, Ø, Ø, Ø, Ø>
 Upon observing the first training example from fig, which happens to be a positive example, it
becomes clear that our hypothesis is too specific.
h1 <- <Sunny, Warm, Normal, Strong, Warm, Same>
 This h is still very specific; it asserts that all instances are negative except for the single
positive training example we have observed. Next, the second training example (also positive
in this case) forces the algorithm to further generalize h, this time substituting a "?' in place of
any attribute value in h that is not satisfied by the new example.
The refined hypothesis in this case is h <- <Sunny, Warm, ?, Strong, Warm, Same>

VIGNAN – VITS – CSE Page 15


IV B. Tech (CSE)
MACHINE LEARNING
 Upon encountering the third training example-in this case a negative example-the algorithm
makes no change to h. In fact, the FIND-S algorithm simply ignores every negative example!
While this may at first seem strange, notice that in the current case our hypothesis h is already
consistent with the new negative example (i-e., h correctly classifies this example as negative),
and hence no revision is needed.
 To see why, recall that the current hypothesis h is the most specific hypothesis in H consistent
with the observed positive examples. But the target concept c will never cover a negative
example, thus neither will h (by the definition of more-general~han). Therefore, no revision to h
will be required in response to any negative example.
 To complete our trace of FIND-S, the fourth (positive) example leads to a further generalization
of h
h <- (Sunny, Warm, ?, Strong, ?, ?)

 The FIND-S algorithm illustrates one way in which the more-general than partial ordering can be
used to organize the search for an acceptable hypothesis.
 The search moves from hypothesis to hypothesis, searching from the most specific to
progressively more general hypotheses along one chain of the partial ordering. Figure
illustrates this search in terms of the instance and hypothesis spaces.
 At each step, the hypothesis is generalized only as far as necessary to cover the new positive
example. Therefore, at each stage the hypothesis is the most specific hypothesis consistent
with the training examples observed up to this point (hence the name FIND-S). The literature on
concept learning is

Version Spaces and The Candidate-Elimination Algorithm:

• One limitation of the FIND-S algorithm is that it outputs just one hypothesis
consistent with the training data – there might be many.

• To overcome this, introduce notion of version space and algorithms to compute it.

Definition: Consistent:
A hypothesis h is Consistent with a set of training examples D of target concept c if and only if h(x) = c(x)

Consistent(h, D) ≡ (∀(x, c(x)) ∈ D) h(x) = c(x)


for each training example (x, c(x)) in D.

Definition: Version Space:


The Version Space, VSH,D, with respect to hypothesis space H and training examples D, is the subset of

VSH,D ≡ {h ∈ H|Consistent(h, D)}


hypotheses from H consistent with all training examples in D.

Note difference between definitions of consistent and satisfies:

– an example x satisfies hypothesis h when h(x) = 1, regardless of whether x is +ve or


−ve example of target concept

– an example x is consistent with hypothesis h iff h(x) = c(x)

Definition : Inductive bias:

• First make an assumption that would like to target and reduce the concept Size is
called Inductive bias Reduction large concept space to small Target concept space.

• Need to make assumptions - Experience alone doesn't allow us to make conclusions


about unseen data instances

Two types of Bias:

• - Restriction: Limit the hypothesis space

• - Preference: Impose ordering on hypothesis space

VIGNAN – VITS – CSE Page 16


IV B. Tech (CSE)
MACHINE LEARNING
A Compact Representation for Version Spaces

Instead of enumerating all the hypotheses consistent with a training set, we


can represent its most specific and most general boundaries. The hypotheses
included in-between these two boundaries can be generated as needed.

More compact representation for version spaces:

General boundary G
Definition: The general boundary G, with respect to hypothesis space H and
training data D, is the set of maximally general members of H consistent with

G ⇔ {g ∈ H | Consistent (g, D) ∧ (¬∃ g '∈ H) [ g’ >g g ) ∧ Consistent (g' , D)]}


D.

Specific boundary S
Definition: The specific boundary S, with respect to hypothesis space H and
training data D, is the set of minimally general (i.e., maximally specific)

S ≡ {s ∈ H |Consistent(s, D) ∧ (¬∃s'∈ H)[(s >g s') ∧Consistent(s', D)]}


members of H consistent with D.

VSH, D = {h∈ H | (∃s∈ S)(∃g ∈G)(g ≥g h ≥ g s)}


Version Space redefined with S and G

The CANDIDATE-ELIMINATION Algorithm:

“ The candidate-Elimination algorithm computes the version space containing


all (and only those) hypotheses from H that are consistent with an observed
sequence of training examples.”

• The Candidate-Elimination algorithm is similar to List-Then-Eliminate algorithm but


uses a more compact representation of version space.

• Represents version space by its most general and most specific members

1: Procedure CandidateEliminationLearner(X,Y,D,H)
2: Inputs
3: X: set of input features, X={X1,...,Xn}
4: Y: target feature
5: D: set of examples from which to learn
6: H: hypothesis space
7: Output
8: general boundary G⊆H
9: specific boundary S⊆H consistent with E
10: Local
11: G: set of hypotheses in H
12: S: set of hypotheses in H
13: Let G={true}, S={false};
14: for each x∈D do
15: if ( x is a positive example) then
16: Elements of G that classify e as negative are removed from G;
17: Each element s of S that classifies e as negative is removed and
replaced by the minimal generalizations of s that classify x as positive and are less general

VIGNAN – VITS – CSE Page 17


IV B. Tech (CSE)
MACHINE LEARNING
than some member of G;
18: Non-maximal hypotheses are removed from S;
19: else
20: Elements of S that classify e as positive are removed from S;
21: Each element g of G that classifies e as positive is removed and
replaced by the minimal specializations of g that classifies x as negative and are more
general than some member of S.
22: Non-minimal hypotheses are removed from G.

Training Examples:
T1: (Sunny,Warm, Normal, Strong,Warm, Same),Yes
T2: (Sunny,Warm, High, Strong,Warm, Same),Yes
T3: (Rainy,Cold, High, Strong,Warm,Change), No
T4: (Sunny,Warm, High, Strong,Cool,Change),Yes

The CANDIDATE-ELIMINATION algorithm computes the version space containing all hypotheses
from H that are consistent with an observed sequence of training examples.
It begins by initializing the version space to the set of all hypotheses in H; that is, by initializing the G
boundary set to contain the most general hypothesis in H
Go <- {(?, ?, ?, ?, ?, ?)}
Initializing the S boundary set to contain the most specific (least general) hypothesis
S0 <- <Ø, Ø, Ø, Ø, Ø, Ø>
These two boundary sets delimit the entire hypothesis space, because every other hypothesis in H is
both more general than so and more specific than Go. As each training example is considered, the S
and G boundary sets are generalized and specialized, respectively, to eliminate from the version space
any hypotheses found inconsistent with the new training example. After all examples have been
processed, the computed version space contains all the hypotheses consistent with these examples
and only these hypotheses.

Notice that the algorithm is specified in terms of operations such as computing minimal
generalizations and specializations of given hypotheses, and identifying non-nominal and non-maximal
hypotheses. The detailed implementation of these operations will depend, of course, on the specific
representations for instances and hypotheses. However, the algorithm itself can be applied to any
concept learning task and hypothesis space for which these operations are well-defined. In the
following example trace of this algorithm, we see how such operations can be implemented for the
representations used in the Enjoy-Sport example problem.

VIGNAN – VITS – CSE Page 18


IV B. Tech (CSE)
MACHINE LEARNING

Ranking Inductive Learners according to their Biases:

–Rote-Learner: This system simply memorizes the training data and their
classification--- No generalization is involved.

–Candidate-Elimination: New instances are classified only if all the hypotheses in the
version space agree on the classification

–Find-S: New instances are classified using the most specific hypothesis consistent with
the training data

VIGNAN – VITS – CSE Page 19


IV B. Tech (CSE)
MACHINE LEARNING

Decision Tree Classification Algorithm

 Decision Tree is a supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
 Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
 It is a graphical representation for getting all the possible solutions to a problem/decision based on given
conditions.
 It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes
are the output of those decisions and do not contain any further branches.

There are two main types of Decision Trees:


1. Classification trees (Yes/No types)
What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘fit’
or ‘unfit’. Here the decision variable is Categorical.
2. Regression trees (Continuous data types)
Here the decision or the outcome variable is Continuous, e.g. a number like 123.

Decision Tree Terminologies


Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf
node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given
conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes.
Decision Tree Representation:
 Each non-leaf node is connected to a test that splits its set of possible answers into subsets
corresponding to different test results.
 Each branch carries a particular test result's subset to another node.
 Each node is connected to a set of possible answers.
 Below diagram explains the general structure of a decision tree:

VIGNAN – VITS – CSE Page 20


IV B. Tech (CSE)
MACHINE LEARNING
 A decision tree is an arrangement of tests that provides an appropriate classification at every
step in an analysis.
 "In general, decision trees represent a disjunction of conjunctions of constraints on the
attribute-values of instances. Each path from the tree root to a leaf corresponds to a
conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions"
(Mitchell, 1997, p.53).
 More specifically, decision trees classify instances by sorting them down the tree from the
root node to some leaf node, which provides the classification of the instance. Each node in
the tree specifies a test of some attribute of the instance, and each branch descending from
that node corresponds to one of the possible values for this attribute.
 An instance is classified by starting at the root node of the decision tree, testing the attribute
specified by this node, then moving down the tree branch corresponding to the value of the
attribute. This process is then repeated at the node on this branch and so on until a leaf node is
reached.

Appropriate Problems for Decision Tree Learning


Decision tree learning is generally best suited to problems with the following characteristics:
 Instances are represented by attribute-value pairs.
o There is a finite list of attributes (e.g. hair colour) and each instance stores a value for
that attribute (e.g. blonde).
o When each attribute has a small number of distinct values (e.g. blonde, brown, red) it is
easier for the decision tree to reach a useful solution.
o The algorithm can be extended to handle real-valued attributes (e.g. a floating point
temperature)
 The target function has discrete output values.
o A decision tree classifies each example as one of the output values.
 Simplest case exists when there are only two possible classes (Boolean
classification).
 However, it is easy to extend the decision tree to produce a target function with
more than two possible output values.
o Although it is less common, the algorithm can also be extended to produce a target
function with real-valued outputs.
 Disjunctive descriptions may be required.
o Decision trees naturally represent disjunctive expressions.
 The training data may contain errors.
o Errors in the classification of examples, or in the attribute values describing those
examples are handled well by decision trees, making them a robust learning method.
 The training data may contain missing attribute values.
o Decision tree methods can be used even when some training examples have unknown
values (e.g., humidity is known for only a fraction of the examples).
After a decision tree learns classification rules, it can also be re-represented as a set of if-then rules in
order to improve readability.

How does the Decision Tree algorithm Work?


The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria are
different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The
creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say
that the purity of the node increases with respect to the target variable. The decision tree splits the
nodes on all available variables and then selects the split which results in most homogeneous sub-
nodes.

The algorithm selection is also based on the type of target variables. Let us look at some algorithms
used in Decision Trees:

VIGNAN – VITS – CSE Page 21


IV B. Tech (CSE)
MACHINE LEARNING
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing
classification trees)
MARS → (multivariate adaptive regression splines)

The ID3 algorithm builds decision trees using a top-down greedy search approach through the space of
possible branches with no backtracking. A greedy algorithm, as the name suggests, always makes the
choice that seems to be the best at that moment.

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:
 Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
 Step-3: Divide the S into subsets that contains possible values for the best attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.

Basic Decision Tree Learning Algorithm:


Now that we know what a Decision Tree is, we’ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3 Algorithm.
ID3 Stands for Iterative Dichotomiser 3.

Entropy:
Entropy is a measure of the randomness in the information being
processed. The higher the entropy, the harder it is to draw any
conclusions from that information. Flipping a coin is an example of an
action that provides information that is random.
From the graph, it is quite evident that the entropy H(X) is zero when the
probability is either 0 or 1. The Entropy is maximum when the probability
is 0.5 because it projects perfect randomness in the data and there is no
chance if perfectly determining the outcome.

Information Gain

Information gain or IG is a statistical property that measures


how well a given attribute separates the training examples
according to their target classification. Constructing a decision
tree is all about finding an attribute that returns the highest
information gain and the smallest entropy.

ID3 follows the rule — A branch with an entropy of zero is a leaf node and A branch with
entropy more than zero needs further splitting.

VIGNAN – VITS – CSE Page 22


IV B. Tech (CSE)
MACHINE LEARNING
Hypothesis space search in decision tree learning:
In order to derive the Hypothesis space, we compute the Entropy and Information Gain of Class and
attributes. For them we use the following statistics formulae:

Entropy of Class is:

For any Attribute,

Entropy of an Attribute is:

Illustrative Example:
Concept: “Play Tennis”:
Data set:
Temperatur
Day Outlook Humidity Wind Play Golf
e
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

VIGNAN – VITS – CSE Page 23


IV B. Tech (CSE)
MACHINE LEARNING
By using above formulae, the decision tree is derived as follows:

VIGNAN – VITS – CSE Page 24


IV B. Tech (CSE)
MACHINE LEARNING

VIGNAN – VITS – CSE Page 25


IV B. Tech (CSE)
MACHINE LEARNING

VIGNAN – VITS – CSE Page 26


IV B. Tech (CSE)
MACHINE LEARNING

VIGNAN – VITS – CSE Page 27


IV B. Tech (CSE)
MACHINE LEARNING

VIGNAN – VITS – CSE Page 28


IV B. Tech (CSE)
MACHINE LEARNING

VIGNAN – VITS – CSE Page 29


IV B. Tech (CSE)
MACHINE LEARNING

VIGNAN – VITS – CSE Page 30


IV B. Tech (CSE)
MACHINE LEARNING

VIGNAN – VITS – CSE Page 31


IV B. Tech (CSE)
MACHINE LEARNING

VIGNAN – VITS – CSE Page 32


IV B. Tech (CSE)
MACHINE LEARNING

VIGNAN – VITS – CSE Page 33


IV B. Tech (CSE)
MACHINE LEARNING

VIGNAN – VITS – CSE Page 34


IV B. Tech (CSE)
MACHINE LEARNING
ssss

VIGNAN – VITS – CSE Page 35


IV B. Tech (CSE)
MACHINE LEARNING
Inductive Bias in Decision Tree Learning:

Occam's Razor (Specialized to Decision Trees)

"The world is inherently simple. Therefore the smallest decision tree that is consistent with the samples is the one
that is most likely to identify unknown objects correctly."

Given m attributes, a decision tree may have a maximum height of m.

Rather than building all the possible trees, measuring the size of each, and choosing the smallest tree that
best fits the data, we use Quinlan's ID3 algorithm for constructing a decision tree.

VIGNAN – VITS – CSE Page 36

You might also like