Learning From Examples Homework
Learning From Examples Homework
367
368 Part IV: Programming in Java
of data, and then delegates further decision making to child nodes based on
the value of that particular property (Luger 2009, Section 10.3). The leaf
nodes of the decision tree are terminal states that return a class for the
given data collection. We can illustrate decision trees through the example
of a simple credit history evaluator that was used in (Luger 2009) in its
discussion of the ID3 learning algorithm. We refer the reader to this book
for a more detailed discussion, but will review the basic concepts of
decision trees and decision tree induction in this section.
Assume we wish to assign a credit risk of high, moderate, or low to people
based on the following properties of their credit rating:
Collateral, with possible values {adequate, none}
Income, with possible values {“0$ to $15K”, “$15K to $35K”,
“over $35K”}
Debt, with possible values {high, low}
Credit History, with possible values {good, bad, unknown}
We could represent risk criteria as a set of rules, such as “If debt is low,
and credit history is good, then risk is moderate.” Alternatively, we can
summarize a set of rules as a decision tree, as in figure 27.1. We can
perform a credit evaluation by walking the tree, using the values of the
person’s credit history properties to select a branch. For example, using the
decision tree of figure 27.1, an individual with credit history = unknown,
debt = low, collateral = adequate, and income = $15K to $35K would be
categorized as having low risk. Also note that this particular categorization
does not use the income property. This is a form of generalization, where
people with these values for credit history, debt, and collateral qualify as
having low risk, regardless of income.
Figure 27.1 A Decision Tree for the Credit Risk Problem (Luger 2009)
Chapter 27 ID3: Learning from Examples 369
Now, assume the following set of 14 training examples. Although this does
not cover all possible instances, it is large enough to define a number of
meaningful decision trees, including the tree of figure 27.1 (the reader may
want to construct several such trees. See exercise 1). The challenge facing
any inductive learning algorithm is to produce a tree that both covers all
the training examples correctly, and has the highest probability of being
correct on new instances.
}
public String toString()
{
// to be defined by reader
}
public abstract Set<String> getPropertyNames();
}
This implementation of AbstractExample as an immutable object is
incomplete in that it does not include the techniques demonstrated in
AbstractProperty to enforce the immutability pattern. We leave this
as an exercise.
Implementing ExampleSet, along with AbstractDecisionTreeNode, is one of
ExampleSet
the most interesting classes in the implementation. This is because the
decision tree induction algorithm requires a number of fairly complex
operations for partitioning the example set on property values. The
implementation presented here is simple and somewhat inefficient, storing
examples as a simple vector. This requires examination of all examples to
form partitions, retrieve examples with a specific value for a property, etc.
We leave a more efficient implementation as an exercise.
In providing integrity checks on data, we have required that all examples be
categorized, and that all examples belong to the same class.
The basic member variables and accessors are defined as:
public class ExampleSet
{
private Vector<AbstractExample> examples =
new Vector<AbstractExample>();
private HashSet<String> categories =
new HashSet<String>();
private Set<String> propertyNames = null;
public void addExample(AbstractExample e)
throws IllegalArgumentException
{
if(e.getCategory() == null)
throw new IllegalArgumentException(
"Example missing categorization.");
// Check that new example is of same class
// as existing examples
if((examples.isEmpty()) ||
e.getClass() ==
examples.firstElement().getClass())
{
examples.add(e);
categories.add(e.getCategory());
378 Part IV: Programming in Java
if(propertyNames == null)
propertyNames =
new HashSet<String>(
e.getPropertyNames());
}
else
throw new IllegalArgumentException(
"All examples must be same type.");
}
public int getSize()
{
return examples.size();
}
public boolean isEmpty()
{
return examples.isEmpty();
}
public AbstractExample getExample(int i)
{
return examples.get(i);
}
public Set<String> getCategories()
{
return new HashSet<String>(categories);
}
public Set<String> getPropertyNames()
{
return new HashSet<String>(propertyNames);
}
// More complex methods to be defined.
public int getExampleCountByCategory(String cat)
throws IllegalArgumentException
{
// to be defined below.
}
public HashMap<String, ExampleSet> partition(
String propertyName)
throws IllegalArgumentException
{
// to be defined below.
}
}
Chapter 27 ID3: Learning from Examples 379
throws IllegalArgumentException
{
induceTree(examples, selectionProperties);
}
public boolean isLeaf()
{
return children.isEmpty();
}
public String getCategory()
{
return category;
}
public String getDecisionProperty()
{
return decisionPropertyName;
}
public AbstractDecisionTreeNode getChild(String
propertyValue)
{
return children.get(propertyValue);
}
public void addChild(String propertyValue,
AbstractDecisionTreeNode child)
{
children.put(propertyValue, child);
}
public String Categorize(AbstractExample ex)
{
// defined below
}
public void induceTree(ExampleSet examples,
Set<String> selectionProperties)
throws IllegalArgumentException
{
// defined below
}
public void printTree(int level)
{
// implementation left as an exercise
}
protected abstract double
Chapter 27 ID3: Learning from Examples 383
evaluatePartitionQuality(HashMap<String,
ExampleSet> part, ExampleSet examples)
throws IllegalArgumentException;
protected abstract AbstractDecisionTreeNode
createChildNode(ExampleSet examples,
Set<String> selectionProperties)
throws IllegalArgumentException;
}
Note the two abstract methods for evaluating a candidate partition and
creating a new child node. These will be implemented on 27.3.
Categorize categorizes a new example by performing a recursive tree
walk.
public String categorize(AbstractExample ex)
{
if(children.isEmpty())
return category;
if(decisionPropertyName == null)
return category;
AbstractProperty prop =
ex.getProperty(decisionPropertyName);
AbstractDecisionTreeNode child =
children.get(prop.getValue());
if(child == null)
return null;
return child.categorize(ex);
}
InduceTree performs the induction of decision trees. It deals with four
cases. The first is a normal termination: all examples belong to the same
category, so it creates a leaf node of that category. Cases two and three
occur if there is insufficient information to complete a categorization; in
this case, the algorithm creates a leaf node with a null category.
Case four performs the recursive step. It iterates through all properties that
have not been used in the decision tree (these are passed in the parameter
selectionProperties), using each property to partition the example
set. It evaluates the example set using the abstract method,
evaluatePartitionQuality. Once it finds the best evaluated
partition, it constructs child nodes for each branch.
public void induceTree(ExampleSet examples,
Set<String> selectionProperties)
throws IllegalArgumentException
{
// Case 1: All instances are the same
// category, the node is a leaf.
384 Part IV: Programming in Java
if(examples.getCategories().size() == 1)
{
category = examples.getCategories().
iterator().next();
return;
}
//Case 2: Empty example set. Create
// leaf with no classification.
if(examples.isEmpty())
return;
//Case 3: Empty property set; could not classify.
if(selectionProperties.isEmpty())
return;
// Case 4: Choose test and build subtrees.
// Initialize by partitioning on first
// untried property.
Iterator<String> iter =
selectionProperties.iterator();
String bestPropertyName = iter.next();
HashMap<String, ExampleSet> bestPartition =
examples.partition(bestPropertyName);
double bestPartitionEvaluation =
evaluatePartitionQuality(bestPartition,
examples);
// Iterate through remaining properties.
while(iter.hasNext())
{
String nextProp = iter.next();
HashMap<String, ExampleSet> nextPart =
examples.partition(nextProp);
double nextPartitionEvaluation =
evaluatePartitionQuality(nextPart,
examples);
// Better partition found. Save.
if(nextPartitionEvaluation >
bestPartitionEvaluation)
{
bestPartitionEvaluation =
nextPartitionEvaluation;
bestPartition = nextPart;
bestPropertyName = nextProp;
}
}
// Create children; recursively build tree.
this.decisionPropertyName = bestPropertyName;
Chapter 27 ID3: Learning from Examples 385
Set<String> newSelectionPropSet =
new HashSet<String>(selectionProperties);
newSelectionPropSet.remove(decisionPropertyName);
iter = bestPartition.keySet().iterator();
while(iter.hasNext())
{
String value = iter.next();
ExampleSet child = bestPartition.get(value);
children.put(value,
createChildNode(child,
newSelectionPropSet));
}
27.4 ID3: An Information Theoretic Tree Induction Algorithm
The heart of the ID3 algorithm is its use of information theory to evaluate
the quality of candidate partitions of the example set by choosing
properties that gain the most information about an examples
categorization. Luger (2009) discusses this approach in detail, but we will
review it briefly here.
Shannon (1948) developed a mathematical theory of information that
allows us to measure the information content of a message. Widely used in
telecommunications to determine such things as the capacity of a channel,
the optimality of encoding schemes, etc., it is a general theory that we will
use to measure the quality of a decision property.
Shannon’s insight was that the information content of a message depended
upon two factors. One was the size of the set of all possible messages, and
the probability of each message occurring. Given a set of possible
messages, M = {m1, m2 . . . mn}, the information content of any
individual message is measured in bits by the sum, across all messages in M
of the probability of each massage times the log to the base 2 of that
probability.
I(M) = Σ – p(mi) log2 p(mi)
Applying this to the problem of decision tree induction, we can regard a set
of examples as a set of possible messages about the categorization of an
example. The probability of a message (a given category) is the number of
examples with that category divided by the size of the example set. For
example, in the table in section 27.1, there are 14 examples. Six of the
examples have high risk, so p(risk = high) = 6/14. Similarly, p(risk =
moderate) = 3/14, and p(risk = low) = 5/14. So, the information in any
example in the set is:
I(example set) = -6/14 log (6/14) -3/14 log (3/14) -5/14 log (5/14)
= - 6/14 * (-1.222) - 3/14 * (-2.222) - 5/14 * (-1.485)
= 1.531 bits
We can think of the recursive tree induction algorithm as gaining
information about the example set at each iteration. If we assume a set of
386 Part IV: Programming in Java
Exercises
1. Construct two or three different trees that correctly classify the training
examples in the table of section 27.1. Compare their complexity using
average path length from root to leaf as a simple metric. What informal
heuristics would use in constructing the simplest trees to match the data?
Manually build a tree using the information theoretic test selection
algorithm from the ID3 algorithm. How does this compare with your
informal heuristics?
2. Extend the definition of AbstractExample to enforce the
immutable object pattern using AbstractProperty as an example.
3.The methods AbstractExample and AbstractProperty throw
exceptions defined in Java, such as IllegalArgumentException
or UnsupportedOperationException when passed illegal values
or implementers try to violate the immutable object pattern. An alternative
approach would use user-defined exceptions, defined as subclasses of
java.lang.RuntimeException. Implement this approach, and discuss its
advantages and disadvantages.
4. The implementation of ExampleSet in section 27.2.3 stores
component examples as a simple vector. This requires iteration over all
examples to partition the example set on a property, count categories, etc.
Redo the implementation using a set of maps to allow constant time
retrieval of examples having a certain property value, category, etc.
Evaluate performance for this implementation and that given in the
chapter.
5. Complete the implementation for the credit risk example. This will
involve creating subclasses of AbstractProperty for each property,
and an appropriate subclass of AbstractExample. Also, write a class
and methods to test your code.