0% found this document useful (0 votes)
40 views56 pages

DM Witten 03

Three key ways knowledge can be represented in machine learning outputs are: 1) Tables, where the output matches the input format; 2) Linear models, where the output is a linear combination of attribute weights and values; 3) Decision trees, which use a divide-and-conquer approach to recursively partition the data space into regions assigned class labels.

Uploaded by

Ayesha Waris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views56 pages

DM Witten 03

Three key ways knowledge can be represented in machine learning outputs are: 1) Tables, where the output matches the input format; 2) Linear models, where the output is a linear combination of attribute weights and values; 3) Decision trees, which use a divide-and-conquer approach to recursively partition the data space into regions assigned class labels.

Uploaded by

Ayesha Waris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Output: Knowledge Representation

• There are many different ways for representing the patterns that can be
discovered by machine learning, and…
• each one dictates the kind of technique that can be used to infer that
output structure from data.

• Once you understand how the output is represented, you have come a
long way toward understanding how it can be generated.
Tables
Tables
• The simplest, most rudimentary way of representing the output from
machine learning is to make it just the same as the input—a table.
Tables
• Decision table for the weather problem Outlook Humidity Play
Sunny high no
Sunny normal yes
Overcast high yes
• Main problem: selecting the right attributes Overcast normal yes
Rainy high No
Rainy normal yes
Linear Model
Linear Model
• Output is the sum of the attribute values, with weights applied to each
attribute before adding them together.

• The trick is to come up with good values for the weights—ones that
make the model’s output match the desired output.
• The output and the inputs—attribute values—are all numeric.
• Statisticians use the word regression for the process of predicting a
numeric quantity, and regression model is another term for this kind of
linear model.
Linear Model
• Easiest to visualize in two dimensions, where Chart Title
they are tantamount to drawing a straight line 1400
through a set of data points.
• Figure shows a line fitted to the CPU 1200
performance data described where only the 1000
cache attribute is used as input. 800
• The class attribute performance is shown on
the vertical axis, with cache on the horizontal 600
axis; both are numeric. 400
• The straight line represents the “best fit” 200
prediction equation
PRP + 37.06 + 2.47 CACH 0
0 50 100 150 200 250 300
Linear Model
• Easiest to visualize in two dimensions, where Chart Title
they are tantamount to drawing a straight line 1400
through a set of data points.
• Figure shows a line fitted to the CPU 1200
performance data described where only the 1000
cache attribute is used as input. 800
• The class attribute performance is shown on
the vertical axis, with cache on the horizontal 600
axis; both are numeric. 400
• The straight line represents the “best fit” 200

PRP
prediction equation
PRP + 37.06 + 2.47 CACH 0
0 50 100 150 200 250 300
CACH
Linear Model
• Can also be applied to binary classification problems. 2
• In this case, the line produced by the model separates 1.8
the two classes: 1.6
It defines where the decision changes from one class 1.4

Petal Width
value to the other. 1.2
• Such a line is often referred to as the decision 1
boundary. 0.8
• Figure shows a decision boundary for the iris data that 0.6
separates the Iris setosas from the Iris versicolors. 0.4
• In this case, the data is plotted using two of the input 0.2
attributes—petal length and petal width—and
0
• The straight line defining the decision boundary is a 0 20 40 60 80 100 120
function of these two attributes. Points lying on the Petal Length
line are given by the equation
• 2.0 − 0.5 PETAL-LENGTH − 0.8 PETAL-WIDTH = 0
Decision trees
Decision trees
• “Divide-and-conquer” approach produces tree

• Nodes involve testing a particular attribute


• Usually, attribute value is compared to constant
• Other possibilities:
▫ Comparing values of two attributes
▫ Using a function of one or more attributes

• Leaves assign classification, set of classifications, or probability distribution to instances


• To classify an unknown instance, it is routed down the tree according to the values of
the attributes tested in successive nodes.
Decision trees – Nominal Attributes
• The number of children is usually the number of possible values of the
attribute.
• In this case, because there is one branch for each possible value, the
same attribute will not be retested further down the tree.
Decision trees – Numeric Attributes
• The test at a node usually determines whether its value is greater or less
than a predetermined constant, giving a two-way split.
• Alternatively, a three-way split may be used, in which case there are
several different possibilities.
Decision trees – Missing Values
• If missing value is treated as an attribute value in its own right, that will
create a third branch.
▫ An alternative for an integer-valued attribute would be a three-way split
into less than, equal to, and greater than.
▫ An alternative for a real-valued attribute, for which equal to is not such a
meaningful option, would be to test against an interval rather than a single
constant, again giving a three-way split: below, within, and above.
▫ A numeric attribute is often tested several times in any given path down
the tree from root to leaf, each test involving a different constant
Rules
Rules
• Classification Rules
• Association Rules
• Rules with Exceptions
Classification Rules
• Popular alternative to decision trees
• Antecedent (pre-condition) of a rule: a series of tests (just like the tests
at the nodes of a decision tree)
▫ Tests are usually logically ANDed together (but may also be general logical
expressions)
• Consequent (conclusion): classes, set of classes, or probability
distribution assigned by rule
• Individual rules are often logically ORed together
• Conflicts arise if different conclusions apply
Classification Rules – From Trees to Rules
• Easy: converting a tree into a set of rules
▫ One rule for each leaf:
 Antecedent contains a condition for every node on the path from the root to
the leaf
 Consequent is class assigned by the leaf
• Produces rules that are unambiguous
▫ Doesn’t matter in which order they are executed
• But: resulting rules are unnecessarily complex
▫ Pruning to remove redundant tests/rules
Classification Rules – From Trees to Rules
• More difficult: transforming a rule set into a tree
▫ Tree cannot easily express disjunction between rules
• Example: rules which test different attributes

If a and b then x
If c and d then x
• Symmetry needs to be broken
• Corresponding tree contains identical subtrees
(known as “replicated subtree problem”)
Classification Rules – From Trees to Rules
If a and b then x
If c and d then x
The exclusive-or problem
x = 1? If x = 1 and y = 0 then class = a
If x = 0 and y = 1 then class = a
NO YES If x = 0 and y = 0 then class = b
If x = 1 and y = 1 then class = b
y = 1? y = 1?

NO YES NO YES

b a a b
Write Rules for this tree
x

1 2 3
Write Rules for this tree
y

1 2 3

1 2 3

w b b

1 2 3

a b b
x

1 2 3
Write Rules for this tree
y

1 2 3

1 2 3

w b b

1 2 3

a b b
Classification Rules – From Trees to Rules
If x = 1 and y = 1 then class = a
If z = 1 and w = 1 then class = a
Otherwise class = b
Executing a Rule Set
• Two ways of executing a rule set:
▫ Ordered set of rules (“decision list”)
 Order is important for interpretation
▫ Unordered set of rules
 Rules may overlap and lead to different conclusions for the
same instance
Executing a Rule Set - Problems
• What if two or more rules conflict?
▫ Give no conclusion at all?
▫ Go with rule that is most popular on training data?
▫…
• What if no rule applies to a test instance?
▫ Give no conclusion at all?
▫ Go with class that is most frequent in training data?
▫…
Executing a Rule Set – Boolean Class
• Assumption: if instance does not belong to class “yes”, it belongs to class “no”
• Trick: only learn rules for class “yes” and use default rule for “no”

If x = 1 and y = 1 then class = a


If z = 1 and w = 1 then class = a
Otherwise class = b

• Order of rules is not important. No conflicts!


• Rule can be written in disjunctive normal form
Rules
• Machine learning algorithms that generate rules invariably produce
ordered rule sets in multiclass situations, and

• this sacrifices any possibility of modularity because the order of


execution is critical
Association Rules
• Association rules…
▫ can predict any attribute and combinations of attributes
▫ are not intended to be used together as a set

• Problem: immense number of possible associations


▫ Output needs to be restricted to show only the most predictive
associations => only those with high support and high confidence
Support and Confidence of a rule
• Support: number of instances predicted correctly
• Confidence: number of correct predictions, as proportion of all
instances that rule applies to
• Example: 4 cool days with normal humidity
Support = 4, confidence = 100%

• Normally: minimum support and confidence pre-specified (e.g. 58 rules


with support ≥ 2 and confidence ≥ 95% for weather data)
Interpreting Association Rules
• Interpretation is not obvious:
If windy = false and play = no then outlook = sunny and
humidity = high

is not the same as


If windy = false and play = no then outlook = sunny
If windy = false and play = no then humidity = high

• It means that the following also holds:


If humidity = high and windy = false and play = no
then outlook = sunny
Interpreting Association Rules
• To reduce the number of rules that are produced, in cases where several
rules are related it makes sense to present only the strongest one to the
user.

• In the previous example, only the first rule should be printed


Rules with Exception (for Classification Rules)
• For classification rules, a natural extension is to allow them to have
exceptions
• Then incremental modifications can be made to a rule set by expressing
exceptions to existing rules rather than reengineering the entire set.
Rules with Exception (for Classification Rules)
• For example: Sepal Sepal Petal Petal Type
Length Width Length Width
• For the iris problem:
5.1 3.5 2.6 0.2 ?
• Suppose a new flower was found with the
dimensions given in Table, and
an expert declared it to be an instance of Iris
setosa.
• If this flower was classified by the rules given
earlier this problem, it would be misclassified
by two of them:
Sepal Sepal Petal Petal Type
Irises: Rule Set Length
5.1
Width
3.5
Length
2.6
Width
0.2 Iris
setosa
• If petal-length < 2.45 then Iris-setosa
• If sepal-width < 2.10 then Iris-versicolor
• If sepal-width < 2.45 and petal-length < 4.55 then Iris-versicolor
• If sepal-width < 2.95 and petal-width < 1.35 then Iris-versicolor
• If petal-length ≥ 2.45 and petal-length < 4.45 then Iris-versicolor
• If sepal-length ≥ 5.85 and petal-length < 4.75 then Iris-versicolor
• If sepal-width < 2.55 and petal-length < 4.95 and petal-width < 1.55 then Iris-versicolor
• If petal-length ≥ 2.45 and petal-length < 4.95 and petal-width < 1.55 then Iris-versicolor
• If sepal-length ≥ 6.55 and petal-length < 5.05 then Iris-versicolor
• If sepal-width < 2.75 and petal-width < 1.65 and sepal-length < 6.05 then Iris-versicolor
• If sepal-length ≥ 5.85 and sepal-length < 5.95 and petal-length < 4.85 then Iris-versicolor
• If petal-length ≥ 5.15 then Iris-virginica
• If petal-width ≥ 1.85 then Iris-virginica
• If petal-width ≥ 1.75 and sepal-width < 3.05 then Iris-virginica
• If petal-length ≥ 4.95 and petal-width < 1.55 then Iris-virginica
Rules with Exception (for Classification Rules)
• These rules require modification so that the Sepal Sepal Petal Petal Type
Length Width Length Width
new instance can be treated correctly.
5.1 3.5 2.6 0.2 Iris
• However, simply changing the bounds for the setosa
attribute–value tests in these rules may not
suffice because the instances used to create
the rule set may then be misclassified.
• Fixing up a rule set is not as simple as it
sounds.
Rules with Exception (for Classification Rules)
• Instead of changing the tests in the existing Sepal Sepal Petal Petal Type
rules, an expert might be consulted to explain Length Width Length Width
why the new flower violates them, giving 5.1 3.5 2.6 0.2 Iris
explanations that could be used to extend the setosa
relevant rules only.
• For example, the first of these two rules
misclassifies the new Iris setosa as an instance
of the genus Iris versicolor.
• Instead of altering the bounds on any of the
inequalities in the rule, an exception can be
made based on some other attribute
• If petal-length ≥ 2.45 and
petal-length < 4.45
then Iris-versicolor
EXCEPT if petal-width < 1.0 then
Iris-setosa
Rules with Exception (Iris Example)
Default: Iris-setosa
except if petal-length >= 2.45 and petal-length < 5.355
and petal-width < 1.75
then Iris-versicolor
except if petal-length >= 4.95 and petal-width < 1.55
then Iris-virginica
else if sepal-length < 4.95 and sepal-width >= 2.45
then Iris-virginica
else if petal-length >= 3.35
then Iris-virginica
except if petal-length < 4.85 and sepal-length < 5.95
then Iris-versicolor
Advantages of using exceptions
• Rules can be updated incrementally
▫ Easy to incorporate new data
▫ Easy to incorporate domain knowledge
• People often think in terms of exceptions
• Each conclusion can be considered just in the context of rules and
exceptions that lead to it
▫ Locality property is important for understanding large rule sets
▫ “Normal” rule sets don’t offer this advantage
Instance Based Representation
Instance Based Representation
• Instead of trying to create rules, work directly from the examples
themselves. This is known as instance-based learning.

• The instance-based knowledge representation uses the instances


themselves to represent what is learned, rather than inferring a rule set
or decision tree and storing it instead
Instance Based Representation
• In instance-based learning, all the real work is done when the time
comes to classify a new instance rather than when the training set is
processed.
• In a sense, then, the difference between this method and the others
that we have seen is the time at which the “learning” takes place
• Instance-based learning is lazy, deferring the real work as long as
possible, whereas other methods are eager, producing a generalization
as soon as the data has been seen.
Instance Based Representation
• In instance-based classification, each new instance is compared with
existing ones using a distance metric, and the closest existing instance is
used to assign the class to the new one. This is called the nearest-
neighbor classification method.
• Sometimes more than one nearest neighbor is used, and the majority
class of the closest k neighbors (or the distance-weighted average if the
class is numeric) is assigned to the new instance. This is termed the k-
nearest-neighbor method.
Instance Based Representation – Key Problems
• Identifying attributes that are important
• Deciding what instances to save and which to discard
Instance Based Representation
• They do not make explicit the structures that
are learned. 15
• The instances combine with the distance
metric to carve out boundaries in instance
space that distinguish one class from another, 10
and this is a kind of explicit representation of
knowledge.
5
• For example, given a single instance of each of
two classes, the nearest-neighbor rule
effectively splits the instance space along the
perpendicular bisector of the line joining the 10 20
instances
Instance Based Representation
• Given several instances of each class,
the space is divided by a set of lines that
represent the perpendicular bisectors of
selected lines joining an instance of one class
to one of another class.

• Figure illustrates a nine-sided polygon that


separates the filled-circle class from the open-
circle class.
• This polygon is implicit in the operation of the
nearest-neighbor rule.
Instance Based Representation
• When training instances are discarded, the
result is to save just a few critical examples of
each class.
• Figure shows only the examples that actually
get used in nearest-neighbor decisions: The
others (the light-gray ones) can be discarded
without affecting the result.
Instance Based Representation – slight danger
• It is worth pointing out a slight danger to the technique of visualizing instance based learning
in terms of boundaries in example space:
• It makes the implicit assumption that attributes are numeric rather than nominal.
• If the various values that a nominal attribute can take on were laid out along a line,
generalizations involving a segment of that line would make no sense:
• Each test involves either one value for the attribute or all values for it (or perhaps an
arbitrary subset of values).
• Although you can more or less easily imagine extending the examples in Figures above to
several dimensions, it is much harder to imagine how rules involving nominal attributes will
look in multidimensional instance space.
• Many machine learning situations involve numerous attributes, and our intuitions tend to
lead us astray when extended to high-dimensional spaces.
Clusters
Clusters
• When a cluster rather than a classifier is
learned, the output takes the form of a
diagram that shows how the instances fall into
clusters.

• In the simplest case this involves associating a


cluster number with each instance, which
might be depicted by laying the instances out
in two dimensions and partitioning the space
to show each cluster, as illustrated in Figure
Clusters
• Some clustering algorithms allow one instance
to belong to more than one cluster, so the
diagram might lay the instances out in two
dimensions and draw overlapping subsets
representing each cluster—a Venn diagram, as
in Figure
Clusters
• Some algorithms associate instances with
clusters probabilistically rather than
categorically.
• In this case, for every instance there is a
probability or degree of membership with
which it belongs to each of the clusters. This is
shown in Table.
• This particular association is meant to be a
probabilistic one, so the numbers for each
example sum to 1—although that is not always
the case.
Clusters
• Other algorithms produce a hierarchical
structure of clusters so that at the top level
the instance space divides into just a few
clusters,
• each of which divides into its own subcluster
at the next level down, and so on.
• In this case a diagram such as the one in
Figure is used, in which elements joined
together at lower levels are more tightly
clustered than ones joined together at higher
levels.
• Such diagrams are called dendrograms
Clusters
• Clustering is often followed by a stage in which
a decision tree or rule set is inferred that
allocates each instance to the cluster in which
it belongs.
• Then, the clustering operation is just one step
on the way to a structural description.

You might also like