0% found this document useful (0 votes)
20 views15 pages

Unit 1

1. The document discusses machine learning concepts including well-posed learning problems, designing a learning system, and perspectives and issues in machine learning. 2. A well-posed learning problem includes a task, performance measure, and experience. Examples of well-posed problems include playing checkers, handwriting recognition, and robot driving. 3. Designing a learning system involves choosing training experience, target function, representation of the target function, and a learning algorithm to approximate the function.

Uploaded by

Yash Gandharv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views15 pages

Unit 1

1. The document discusses machine learning concepts including well-posed learning problems, designing a learning system, and perspectives and issues in machine learning. 2. A well-posed learning problem includes a task, performance measure, and experience. Examples of well-posed problems include playing checkers, handwriting recognition, and robot driving. 3. Designing a learning system involves choosing training experience, target function, representation of the target function, and a learning algorithm to approximate the function.

Uploaded by

Yash Gandharv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 15

TEXT BOOK: MACHINE LEARNING BY TOM M.

MITCHELL, - MGH
REF BOOK: MACHINE LEARNING: AN ALGORITHMIC PERSPECTIVE BY STEPHE MARSHLAND, TAYLOR
& TRANCIS
===================================================================================
===================
UNIT - 1
INTRODUCTION - WELL POSED LEARNING PROBLEMS, DESIGNING A LEARNING SYSTEM,
PERSPECTIVES AND ISSUES IN
MACHINE LEARNING.
CONCEPT LEARNING AND THE GENERAL TO SPECIFIC ORDERING - INTRODUCTION, A CONCEPT
LEARNING TASK, CONCEPT
LEARNING AS SEARCH, FIND - S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS, VERSION
SPACES AND THE CANDIDATE
ELIMINATION ALGORITHM, REMARKS ON VERSION SPACES AND CANDIDATE ELIMINATION,
INDUCTIVE BIAS.

DECISION TREE LEARNING: INTRODUCTION, DECISION TREE REPRESENTATION, APPROPRIATE


PROBLEMS FOR DECISION
TREE LEARNING, THE BASIC DECISION TREE LEARNING ALGORITHM, HYPOTHESIS SPACE SEARCH
IN DECISION TREE
LEARNING, INDUCTIVE BIAS IN DECISIOIN TREE LEARNING, ISSUES IN DECISION TREE
LEARING.
===================================================================================
===================
INTRODUCTION:
WHAT IS MACHINE LEARNING?
FIELD OF STUDY THAT GIVES COMPUTERS A CAPABILITY TO LEARN WITHOUT BEING EXPLICITLY
PROGRAMMED.

The field of machine learning is concerned with the question of how to construct
computer programs that automatically improve with experience.

Some Applications of ML:


1) to detect fraudulent credit card transactions,
2) to information-filtering systems that learn users' reading preferences,
3) to autonomous vehicles that learn to drive on public highways
4) to detect user shopping preferences, once you enter online shopping data/
search for your
intrested products online.

AI - ML - DL
DL is the subset of ML
ML is the subset of AI

The goal of this subject is to present the key algorithms and theory that
form the core of machine learning.

Machine learning draws on concepts and results from many fields,


including statistics, artificial intelligence, philosophy,
information theory, biology, cognitive science(relative to or involve the process
of thinking and
reasoning), computational complexity, and control theory.

My belief is that the best way to learn about machine learning is


to view it from all of these perspectives and to understand the problem settings,
algorithms, and assumptions that underlie each.
"How does learning performance vary with the number of training examples
presented?"
and "Which learning algorithms are most appropriate for various types of learning
tasks?

The practice of machine learning is covered by presenting the major algorithms in


the field,
along with illustrative traces of their operation.

Online data sets and implementations of several algorithms are available via the
at http://www.cs.cmu.edu/-tom1mlbook.html.

These include neural network code and data for face recognition,
decision tree learning, code and data for financial loan analysis, and Bayes
classifier code
and data for analyzing text documents.

====================================================
WELL POSED LEARNING PROBLEM IN MACHINE LEARING:

Well Posed Learning Problem – A computer program(or Agent) is said to learn from
experience E in
context to some task T and some performance measure P, if its performance on T, as
was measured by P,
upgrades with experience E.

Any problem can be segregated as well-posed learning problem if it has three traits

Task
Performance Measure
Experience

EXAMPLES: 1. PLAY CHECKER PROBLEM


2. HAND WRITING RECONGNITION PROBLEM ( images & text)
3. ROBO DRIVING LEARNING PROBLEM

Certain examples that efficiently defines the well-posed learning problem are –

1. To better filter emails as spam or not

Task – Classifying emails as spam or not


Performance Measure – The fraction of emails accurately classified as spam or not
spam
Experience – Observing you label emails as spam or not spam

2. A checkers learning problem

Task – Playing checkers game


Performance Measure – percent of games won against opposer
Experience – playing implementation games against itself

3. Handwriting Recognition Problem

Task – Acknowledging handwritten words within portrayal


Performance Measure – percent of words accurately classified
Experience – a directory of handwritten words with given classifications

4. A Robot Driving Problem

Task – driving on public four-lane highways using sight scanners


Performance Measure – average distance progressed before a fallacy
Experience – order of images and steering instructions noted down while observing a
human driver

5. Fruit Prediction Problem

Task – forecasting different fruits for recognition


Performance Measure – able to predict maximum variety of fruits
Experience – training machine with the largest datasets of fruits images

6. Face Recognition Problem

Task – predicting different types of faces


Performance Measure – able to predict maximum types of faces
Experience – training machine with maximum amount of datasets of different face
images

7. Automatic Translation of documents

Task – translating one type of language used in a document to other language


Performance Measure – able to convert one language to other efficiently
Experience – training machine with a large dataset of different types of languages

===================================================================================
========

DESIGNING A LEARNING SYSTEM:


To get a successful learning system, it should be properly designed

Four steps to follow


1) Choosing the training experience
2) Choosing the target function
3) Choosing a representation for target function
4) Choosing a learning algorithm for approximating the target function
5) Final Design

Step 1: Choosing a training experience:


In choosing a training experience, 3 attributes are to be considered.
i) Type of feedback
ii) degree
iii) distribution of examples

i) Type of feedback: Whether the training experience provides direct or indirect


feedback regarding
the choices made by performance system
Direct feedback
Indirect feedback
Direct information (or training examples) consists of individual checkerboard
states and their
correct moves.
Indirect information consist of a move sequences and the final outcomes (win or
lose).
Examples: Checkers game & learning driving

ii) Degree: to which learner will control the sequence of training.


Example : Learning driving
a) with trainer's help
b) With trainer's partial help
c) Completely on your own.

iii) Distribution of examples: How well it represents the distribution of examples


over which the
performance of final system is measured.
more possible combinations or more situations, more examles
learning over distributed range of examples.
ex: learning driving: Facing all kinds of situations.

Step 2: Choosing the target funciton


What type of knowledge is learnt and how it is used by the performance of the
system

Ex: Chekers game


While moving diagonally, se of all possible moves are called legal moves.
select one of the moves(diagonally) then it is called target move.
i) Travel only in forward direction
ii) Only one move per chance
iii) Only in diagonal direction
iv) Jump over the opponent (killing the opponent coin)

Target function = v(b)


Board state = b
Legal moves set = B

1. b is the final board state that is won, then v(b) = 100


2. b is the final board state that is lost, then v(b) = -100
3. b is the final board state that is draw, then v(b) = 0
4. If b is not final state then v(b) = v(b')
b' is the best final state.

Step 3:Choosing a representation for target function

For any board state b, we calculate a function 'c' is a linear combination of


following board features i.e c(b).

Features:
x1 - No.of black pieces on a board
x2 - No.of red pieces on a board
x3 - No.of black kings on a board
x4 - No.of red kings on a board
x5 - NO.of black pieces threatened by red
( blacks which can be beaten by red)
x6 - No.of red pieces threatened by black

V(b) = w0 + w1x1 + w2x2 + w3x3 + w4x4 + w5x5 + w6x6


(representation of the target function :c(b))
w0 to w6 = numerical coefficients or weight of each feature
w0 is additive constant

Step 4: Choosing a learning algorithm for approximating the target function


To learn a target function f , we need a set of training examples.
Describe a particular board state b and a training value Vtrain(b).

Trining example representation


Ordered pair = (b, Vtrain(b))
Example:
Black won the game
(i.e x2 = 0 , which means no red
Vtrain(b) = + 100
b = (x1 = 3, x2 = 0, x3 = 1, x4 = 0, x5 = 0, x6 = 0)
ordered pair = <(x1 = 3, x2 = 0, x3 = 1, x4 = 0, x5 = 0, x6 = 0) , + 100>
We need to do 2 steps in this phase
i) Estimating the training values:
In every step, we consider successor
(depending on the next step of opponent)
Vtrian(b) <-- Vcap(successor(b))
represents the next board state
estimating that this move will help /destroy the opponent

vcap ---> represents approximation

ii) Adjusting the weights:


There are some algorithms to find weights of linear functions.
Here, we are using LMS(Least Mean Square Algorithm) used to minimise the error.
Error = (V(train(b) - vcap(b)) ^ 2

If error is zero, no need to change weights


If error is +ve, each weight is increased in proportion
If error is -ve, each weight is decreade in proportion

Step 5: Final Design of ML:


It has four different modules
1. Performance system
2. Critic
3. Generalizer
4. Experiment Generator

The performance system: The performance system solves the given performance task.

Critic: The critic takes the history of the game and generates training examples.

Generalizer: It outputs the hypothesis that is its estimate of the target function.

Experiment Generator: It creates a new problem after taking in the hypothesis for
the
performance system to explore.
===================================================================================
===========

PERSPECTIVES AND ISSUES IN MACHINE LEARNING :

Perspective of machine learning involves searching very large space of


possible hypotheses(provisional idea or temporary idea or imagination either true
or false) to
determine one that best fits the observed data and any prior knowledge held by
learner.

Based on the type of data, we choose a particular algorithm which gives us the best
solution. My point of view(perspective) a particular algorithm suits me but in your
case is a different
algorith suits you to solve the problem.

Issues in ML:
1. What algorithms should be used?
2. Which algorithms perform best for which types of problems?
3. How much training data is sufficient and testing data?
4. What kind of methods should be used?
5. What methods should be used to reduce learning overheads?
6. Which methods should be used for which type of data.

===================================================================================
====================
Concept Learning - Introduction, Concept Learning Task:
Different hypothesis have different concepts
Universal set: Gadgets , tablets, smart phones, Ear-phones, pc, desktop etc

features(Binary valued attributes)


Size: large, small - x1
colour: Black, Blue - x2
Screentype: Flat, Folded - x3
Shape: Square, Rectangle - x4

Now define the concept = < x1, x2, x3, x4>

It is the general representation of the concept


Individual representation of the concept for Tablet

Tablet = < large, black, flat, square>


Smartphone = < small, blue, folded, rectangle>
No. of possible instance = 2 pow d = 2 pow 4 = 16
d is the no.of features
Total possible concepts = 2 pow 2 pow d = 2 pow 16
From all these concepts, which concepts are choosed?
We are not worry about all the concepts, the concepts which are only consistent to
be considered.

circle: Entire space of concepts but we want only some part of it( target
conept/hypotheis space)

We have 4 features of the gadget tablet < L, B, F, S>


Based situations or hypothesis to consider all these features or some of the
features or not even one

If you reject all features < null, null, null, null>


then it is called Most specific hypothesis.

If you accept all the features < ?, ?, ?, ?> then


It is called Most general hypothesis.

Main Goal: finds all conepts/hypothesis that are consistent

Example:

Reality Vs Hypothesis:

In the year 2000 the features of the mobile phone: x1, x2, x3
All these features are assumtions/imaginations/hypotheis.People are expecting the
small size,
dual sim, wireless charging.

more facilities
In the year 2020: Some features are implemented which are consistent.
===================================================================================
====
Concept learning as search:
Main goal of this search is to find the hypothesis that best fits the training
examples.

Examples: Enjoysport learning task consists of 6 attributes.


(Sky, Air temperature, humidity, wind, water and forecast)

Sky has three values: Rainy, cloudy and sunny


Remaining Attributes have only 2 values

Find different instance possible = 3 x 2 x 2 x 2 x 2 x 2 = 96


Syntactically distinct hypothesis = 5 x 4 x 4 x 4 x 4 x 4 = 5120
(Additionally 2 more values - ? and null)
Ex: < null, Rainy, cloudy, sunny, ?>

Semantically distinct hypothesis = 1 + 1(4 x 3 x 3 x 3 x 3) = 973


(Null - taken as a common)

After finding all the syntactically and semantically distinct hypothesis


we search the best match from all these i.e much closer to our learning problem.
===================================================================================
==================

FIND - S Algorithm: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS

This algorithm consider only + ve sample records.


Most specific hypothesis = null
Most general hypothesis = ?

Algorithm:
Step 1: Initializes with the most specific hypothesis - Null phi
h0 = < null, null, null, null, null > 5 Attributes

Step 2: For each +ve sample, for each attribute


if( value = hypothesis value) ignore
else
replace with the most general hypothesis ?

Example:
-----------------------------------------------------------------------
Origin Manufacturer Colour Type Year Class
---------------------------------------------------------------------
JP HO BLUE 1980 ECO +VE (YES)
JP TO GREEN 1970 SPO -VE (NO)
JP TO BLUE 1990 ECO +VE
USA AU RED 1980 ECO -VE
JP HO WHITE 1980 ECO +VE
JP TO GREEN 1980 ECO +VE
JP HO RED 1980 ECO -VE
--------------------------------------------------------------------------

h0 = < null, null, null, null, null >


h1 = <JP, HO, BLUE, 1980, ECO >
h2 = h1
h3 = < JP, ?, Blue, ?, ECO >
h4 = h3
h5 = < JP, ?, ?, ?, ECO>
h6 = < JP, ?, ?, ?, ECO>

h6 is the last hypothesis.

Drawbacks/Disadvantages:
1. Consider only +ve samples
2. h6 may not be sole hypothesis that fits the complete data.
===================================================================================
===================
Vesrion Spaces: Algorithm to find Version Space with example

Subset of hypothesis H consistent with the training examples

VS H,D = { h belongs H | consistent ( h,D)}

H is the Hypotheis
D is the training examples

Check
h is consistent with the trainig example h(x) = c(x)

Algorithm to obtain a version space( List then eliminate Algorithm):

1. Version Space <- list containing every hypothesis in H


2. From this step, we keep on removing inconsistent hypothesis from version space
for each training example, <x, c(x)> remove any hypothesis that is h(x) is not
equal to C(x)
3. Output the list of hypothesis into version space after checking for all training
examples.

======================================
CANDIDATE ELIMINATION ALGORITHM:
Uses the concept of version space
Consider both +ve and -ve values (yes and no)
Consider both specific and general hypothesis.
For positive samples, move from specific to general
For negative samples, move from general to specific

S = {null,null,null,null,null} + down arrow


G = {?, ?, ?, ?, ?} - up arrow

Algorithm:

1. Initialize both general and specific hypothesis( S and G)


S = {null,null,null,null,null}
G = {?, ?, ?, ?, ?} depends on the no.of attributes
2. For each sample
if sample is +ve
make specific to general
else sample is -ve
make general to specific

Example: EnjoySport
Sky Temperature Humidity Wind Water Forecast Enjoy
-------------------------------------------------------------------------
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
-------------------------------------------------------------------------
S0 = {null,null,null,null,null,null} and G0 = {?,?,?,?,?,?}

1. For +ve sample


S1 = {'sunny', 'warm', 'normal', 'strong', 'warm', 'same'}
G1 = {?,?,?,?,?,?}

2. For +ve sample

Specific to general
S2 = {'sunny', 'warm', ? , 'strong', 'warm', 'same'}
G2 = { ?,?,?,?,?,?} same until you encounter -ve sample

3. -ve sample: Move from general to specific


Keep S3 as it is same as S2
S3 = { 'sunny', 'warm', ? , 'strong', 'warm', 'same'}

G3 = {<'sunny'?????>, <?,'warm'????>, <?????,'same'>}

4. for +ve sample


Specific to general
S4 = {<'sunny', 'warm', ?, 'strong', ?,?>}
G4 = {<'sunny'?????>, <?,'warm'????>}
Note: Remove this because s4 not holds same <?????,'same'>}

S4 and G4 are the final Hypothesis.


============================================

REMARKS ON VERSION SPACES AND CANDIDATE ELIMINATION, INDUCTIVE BIAS:

==========================================================================

Inductive Bias

Candidate-Elemination Algorithm

i) Biased Hypothesis Space


ii) Unbiased Hypothesis Space
iii) Inductive System
iv) Equivalent Deductive System

Remarks of CEA
The Candidate-Elimination Algorithm will converge toward the true target concept
provided
i) It is given accurate training example without errors and
ii) Its initial hypothesis space contains the target concept.

What if the target concept is not contained in the hypothesis space?

Can we avoid this difficulty by using a hypothesis sapce that includes every
possible hypothesis?
Hypothesis space contains the unknown target concept
To get the solution: Hypothesis space to include every possible hypothesis.
-----------------------------------------------------------------------------------
-----
To illustrate, consider the EnjoySport example in which we
restricted the hypothesis space to include only conjunctions of attribute values.
Sky
< Sunny, High, Normal, Strong, Cool, Same>
Here we consider the conjuctions of the attributes, definetly we get hypothesis
space
somecases we miss the target concept in the Hypothesis space is called a Biased
Hypothesis Space.

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport


------------------------------------------------------------------
1 Sunny Warm Normal Strong Cool Change Yes
2. Cloudy Warm Normal Strong Cool Change Yes
3. Rainy Warm Normal Strong Cool Change No
-----------------------------------------------------------

S0 = < null, null, null, null, null, null>


S1 = < Sunny, Warm, Normal, Strong, Cool, Change>
S2 = < ?, Warm, Normal, Strong, Cool, Change>
S3 = < ?, Warm, Normal, Strong, Cool, Change>

To overcome this problem

G2 = <
G1 = < ?,?,?,?,?,?>
G0 = < ?,?,?,?,?,?>
< "Sky = Sunny or Sky = Cloudy", Warm, Normal,Strong, Cool , Change>
It is called Unbiased(Extended) Hypothesis Space. We consider simple disjunctive
target concepts
such as "sky = Sunny or Sky = Cloudy."

Unbiased Learner: In general the set of all subsets of a set X is called the
powerset of X.
Example: EnjoySport
Total no.of instances for the 6 attributes = 96
|H'| = 2 power |X| or 2 pow 96 (vs |H| = 973 , strong bias)

<"Sky = Sunny or Sky = Cloudy", ?,?,?,?>


< Sunny, ?,?,?,?,?> V <Cloudy,?,?,?,?,?>
We guaranteed that the target concept exists.

===================================================================================
===

APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING:

Decision tree learning is generally best suited to problems with the following
characteristics:

i)Instances are represented by attribute-value pairs. Instances are described by


a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot). The
easiest situation for decision tree learning is when each attribute takes on a
small number of disjoint possible values (e.g., Hot, Mild, Cold).
ii) The targetfunction has discrete output values. The decision tree
assigns a boolean classification (e.g., yes or no) to each example.

iii) Disjunctive descriptions may be required. As noted above, decision trees


naturally represent disjunctive expressions.

iv) The training data may contain missing attribute values. Decision tree methods
can be used
even when some training examples have unknown values.

Many practical problems have been found to fit these characteristics.


Decision tree learning has therefore been applied to problems such as

learning to classify medical patients by their disease,

equipment malfunctions by their cause, and

loan applicants by their likelihood of defaulting on payments.

Such problems, in which the task is to classify examples into one of a discrete set
of possible
categories, are often referred to as classification problems.
===================================================================================
============
HYPOTHESIS SPACE SEARCH IN DECISION TREE LEARNING

ID3 can be characterized as searching a space of hypotheses for one that fits the
training examples.
The hypothesis space searched by ID3 is the set of possible decision trees.
ID3 performs a simple-to complex, hill-climbing search through this hypothesis
space,
beginning with the empty tree, then considering progressively more elaborate
hypotheses in search of
a decision tree that correctly classifies the training data.

The evaluation function that guides this hill-climbing search is the information
gain measure.
This search is depicted in Figure.

By viewing ID3 in terms of its search space and search strategy, we can get
some insight into its capabilities and limitations
-----------------------------------------------------------s

i) ID3's hypothesis space of all decision trees is a complete space of finite


discrete-valued functions, relative to the available attributes.

ii) ID3 maintains only a single current hypothesis as it searches through the
space of decision trees. ID3 loses the capabilities that follow from explicitly
representing
all consistent hypotheses.(Limitation)

iii) ID3 in its pure form performs no backtracking in its search.


Once it,selects an attribute to test at a particular level in the tree, it never
backtracks
to reconsider this choice. converging to locally optimal solutions that are not
globally optimal.
However, this locally optimal solution may be less desirable than trees that would
have been
encountered along a different branch of the search. Below we discuss an extension
that adds a form
of backtracking (post-pruning the decision tree). (Limitation)

iv) ID3 uses all training examples at each step in the search to make statistically
based decisions regarding how to refine its current hypothesis. One advantage of
using statistical properties of all the examples (e.g., information gain) is that
the resulting search is much less sensitive to errors in individual training
examples. ID3 can be easily extended to handle noisy training data by modifying its
termination
criterion to accept hypotheses that imperfectly fit the training data.
===================================================================================
=================

Consider the following set of training examples:

Instance Classification a1 a2

(a) What is the entropy of this collection of training examples with respect to the
target function classification?
(b) What is the information gain of a2 relative to these training examples?
===================================================================================
==============
INDUCTIVE BIAS IN DECISIOIN TREE LEARNING:

Given a collection of training examples, there are typically many decision trees
consistent with
these examples.
if you have only one tree there is no issue and we classify these new(test)
examples but
you have more than one tree which tree should be used to classify the new examples.
???
The solution for this is the Inductive Bias

Describing the inductive bias of ID3 therefore consists of describing the basis by
which chooses
one of these consistent hypotheses over the others.

Which of these decision trees does IDd3 choose?

It chooses the first acceptable tree it encounters in its simple to complex, hill
climbing search
through the space of possible trees.

Two types of Inductive bias of ID3:


i) Approximate inductive bias of ID3
Shorter trees are preferred over larger trees.

ii) A closer approximation to the inductive bias of ID3


Shorter trees are preferred over longer trees.
Trees that place high information gain attributes close to the root are
preferred over those
that do not have.

Why should one prefer shorter hypothesis than the longer one?

Arguments in favor
a) There are fewer short hypotheses than long ones
b) if a short hypothesis fits data unlikely to be a coincidence

Arguments against:
a) Not every short hypothesis is a reasonable one
What is the performance of the short tree on testing examples? performance is
less then we go for
longer one.

Occam's Razor: "The simplest explanation is usually the best one."

- The term razor refers to the act of shaving away unnecessary assumptions to get
to the simplest
explanation.
===================================================================================
==========================

ISSUES IN DECISION TREE LEARING:

1. Overfitting the data: if you depend on the training data too much then it works
on the training but not on the testing examples. This problem can be addressed with
the two techniques

i) The reduced error pruning technique


ii) The Rule post-pruning technique.

2. Incorporating continuous valued attributes


First we have to convert continuous values into discrete and then apply DTL

3. Handling training examples with missing attribute values


Fill those missing values with proper values.

4. Handling attributes with different costs


In core ID3 algorithm all attributes are given equal weightage but based on the
information gain of the attribute
that attribute should be given more weightage than the others.

5. Alternative measures for selecting attributes


Information Gain/ Gini Index while drawing DTs
===================================================================================
============
OVERFITTING IN DECESION TREES

Why Overfitting occurs in DTs?


Building trees that adapt too much on the training examples lead to overfitting.

Defination of Overfitting: Given Hypothesis H and h belongs to H is said to overfit


the training data
if there exists some alternative hypothesis h' belongs to H , such that h has a
small error than h'
over the training examples, but h' has a smaller error than h over the entire
distrubution of instances
(traing and validation)

Avoid Overfitting in DTs:

1) Stop growing the tree earlier, before it reaches the point where it perfectly
classifies the training
data.
2. Allow the tree to overfit the data, and then post-prune the tree
Split the training in two parts( training and validation) and use validation to
assess the utility
of post-pruning.

Reduced error pruning


Rule-post pruning

1) Reduced Error Pruning Technique:


i) Each node is a candiate for pruning
ii) Pruning consists in removing a subtree rooted in a node: the node becomes a
leaf and is assigned
the most common classification.
iii) Nodes are removed only if the resulting tree performs better on the
validation set.
iv) Nodes are pruned iteratively: At each iteration the node whose removedl most
increases accuracy
on the validation set is performed.
v) Pruning stops: When no pruning increases accuracy.

Check for Humidity/ Wind

2) Rule Post-Pruning technique ( do not remove nodes)


i) Create the decision tree from the training set
ii) Convert the tree into an equivalent set of ruels
a) Each path corresponds to a rule
b) Each node along a path corresponds to a pre-condition
c) Each leaf classification to the post-condition
iii) Prune each rule by removing those pre-conditions whose removal improves
accuracy over the validation
set.
iv) Sort the rules in estimated order of accuracy and consider these in sequence
when classifying
new instances.

How many paths are there in this DT?


5 paths = 5 Rules

1. Outlook = Sunny ^ Humidity = High -> No


2. Outlook = Sunny ^ Humidity = Normal -> Yes
3. Outlook = Overcast -> Yes
4. Outlook = rain ^ Wind = Strong -> No
5. Outlook = rain ^ Wind = Weak -> Yes

Find the performance of the tree?


Now take the first rule and divide into sub rules
Outlook = Sunny -> No
Outlook = High -> No
Use validation set to check the accuracy of the DT these rules one by one
Fix the best one for the accuracy and remove the worst one.

Converting to rules improves readability for humans


Node pruning affects all the rules - Globally affected
pre-Condition pruning affects only local.

You might also like