AI Book3
AI Book3
.· , ""
...
PreiiD. ____,_ . . dia LV!?ow@fr® ~O[M]Dfr®~ ·
ew Delhi- 110 001 · .
2002 ' .:
This Thirteenth. lndl•n Reprlnt-Rs. ,50.00
(Original U.S. Edition--As. 3017.00)
C 1990 by Prentice-Hall, Inc., Englewood Cliffs, N.J., U.S.A. All rights reserved . No part of' this book
may be reproduced in any form, by mimeograph or any other means. without permission ih writing
from the publisher. · . · · . ' ·
The al,llhor and publisher of this book have used their best elforla In preparing this book. These efforts Include
the development, research, and tesUng of the theories and programo to determine their effecttv.ne .... The 8Uihort
and publisher make no warranty of any kind , ·expre,sed or implied, with regard to these programs or th.,
documentation con,ained in this book. The author and publisher. aha• not be liable in any evenr for incidental 01
consequential damages in connection with, ?' arising out of. the fumilhing. perlormance, or use of these progratT'IS'.
Personal Consultant and Personal Consultant Plus are Regl1tered Trademarks of Texas· Instruments.
Rulemaster Is a Registered Trademark.
Kee 11 a Regi~ered Trademark.
ISBN-81-203-G777-1
The export rights of this book · are vested solely with the publisher.
This Eastern Economy Edition is the authorized, complete and unabridged pholo·offset reproduction
of the latest American edition specially published "nd priced lor sale only in Bangladesh, Burma,
Cambodia, China, Fiji, Hong Kong, India, Indonesia, Laos, MalaYsia. Nepal, Pakistan, Philppines,
Singapore, South Korea, Sri Lanka, Taiwan, ·Thailand, and Vietnam . . ·
Reprinted in India by special arrangement with Prentice-Haft, Inc., Englewood Cliffs, N.J., t).S .A.
Published by Asoke K. Ghosh, Prentice·Hall of India PriviJ!fe · ~ed. M-97, Connaught Circus.
New Oe)hi-110001 and .Printed by Mohan Makhijani at Rekha Pr-inters Private Limited,
New Delhi·110020.
To Neslie
for her generous love and encouragement
..
- •.
Contents
PREFACE Ku,
Vi
Contents
vii
Contents
4.10 Summary 76
Exercises 77
80
5 DEALING WITH INCONSISTENCIES AND UNCERTAINTIES
5.1 Introduction 81
5.2 Truth Maintenance Systems 82
5.3 Default Reasoning and the Closed World
Assumption 87
Predicate Completion and Circumscription 90
5.4
5.5 Modal and Temporal logics 92
5.6 Fuzzy Logic and Natural Language Computations 97
5.7 Summary 104
Exercises 105
107
6 PROBABILISTIC REASONING
VIII
Contents
7.5 Summary 144
Exercises 145
Contents
ix
10.8 Summary 209
Exercises 209
18 LEARNING BY INDUCTION 381
18.1 Introduction 381
18.2 Basic Concepts 382
18.3 Some Definitions 383
18.4 Generalization and Specialization 385
18.5 Inductive Bias 388
18.6 Example of an Inductive Learner 390
18.7 Summary 398
Exercises 399
19 EXAMPLES OF OTHER INDUCTIVE LEARNERS 401
19.1 Introduction 401
19.2 The 1D3 System 401
19.3 The LEX System 405
19.4 The INDUCE System 409
19.5 Learning Structure Concepts 412
19.6 Summary 413
Exercises 414
xli Contents
REFERENCES 432
INDEX 441
Preface
A major turning point occurred in the field of artificial intelligence with the realization
that in knowledge lies the power. ' This realization led to the development of a
new class of systems: knowledge-based systems. Knowledge-based systems use spe-
cialized sets of coded knowledge to "reason" and perform limited intelligent tasks.
This is in constrast with more conventional type programs which rely on data and
general algorithms (weak methods) to solve less intelligent tasks. Knowledge-based
systems proved to be much more successful than the earlier, more general problem
solving systems. They proved to be more effective in most areas of Al including
computer vision, natural language understanding, planning, and problem solving
using the newly developed rule-based expert systems.
In concert with the knowledge-base theme, this book is mainly about knowledge
and the role it plays in creating effective Al programs. It focuses on all aspects
of 'knowledge: knowledge representation methods, knowledge acquisition tech-
niques, knowledge organization, and knowledge manipulation. It illustrates the basic
knowledge-system approach and emphasizes the important use of knowledge in
such systems.
This book was written as a text for my classes in artificial intelligence at the
University of Texas at El Paso. These classes are for upper division undergraduate
and first year graduate students. The courses assume prerequisites of basic computer
science courses (like programming languages) and a general maturity in mathematics.
xl"
Preface
xlv
with the notion of associative networks, conceptual graphs, and frames. Chapter 8
completes Part II with an introduction to systems which are based on object oriented
represefltaUOfl structures.
Part Ill covers topics related to the organization and manipulation of knowledge.
This part contains three chapters. Chapter 9 discusses the important problems associ-
ated with search and control. Chapter 10 presents a comprehensive treatment of
matching techniques, an essential function of most Al programs. This part concludes
with Chapter II which covers memory organization and management techniques.
Part IV contains three chapters related to perception and Communication. The
first chapter. Chapter 12 covers the subfield of natural language processing. Although
only a single chapter has been devoted to this subject, the treatment is thorough.
Chapter 13 presents a condensation of important topics from pattern recognition.
Chapter 14 presents a comprehensive treatment of the important topic of computer
vision. And, Chapter 15 has an introduction to Expert System architectures and
related topics.
Part V. the final section, presents an up-to-date, comprehensive view of knowl
edge acquisition/machine learning. All of the in-iportmnt learning paradigms are cov-
ered in this part. Chapter 16 begins with general concepts related to knowledge
acquisition. This is followed in Chapter 17 with a summary of early work in machine.
learning. Chapter 18 introduces inductive learning concepts and presents a detailed
example of an inductive learning system. Chapter 19 continues inductive learning
with examples of recent systems. Chapter 20, the final chapter, covers analogical
and explanation-based learning paradigms.
We hope the reader will experience many enjoyable and rewarding sessions
reading from the exciting material to be found in the text.
ACKNOWLEDGMENTS
In writing this text, a number of individuals have been helpful with their suggestions
and comments. They include the following students: Teow Kiang Chew, Teck Huat
Goh. Julie Lemen, Sergio Felix, Ricardo Martinez, Vincente Fresquez, 1-tun-Ming
Hsu, Rudy Velasquez, and Jose Najera-Mora. Special thanks are given to E. Louise
(Neslie) Patterson for proofreading most of the manuscript and offering many useful
suggestions. Thanks are also given to the following reviewers for their valuable
suggestions: Christopher K. Carlson, George Mason University; Daniel Chester,
University of Delaware; Karen L. McGraw, Cognitive Technologies; and Gordon
Novak, University of Texas, Austin. Finally, I wish to thank the Electrical Engineer-
ing and Computer Science Department of the University of Texas at El Paso for
the generous use of their facilities.
PART 1
Introduction to Artificial Intelligence
Overview of Artificial
Intelligence
2--
2 Overview of Artificial Ititelligence Chap. 1
Al is a branch of computer science concerned with the study and creation of computer
systems that exhibit some form of intelligence: systems that learn icss concepts and
tasks. systems that can reason and druss useful conclusions about the world around
us. systems that can understand a natural language or perceise and coniprchend a
',tsual scene, and s y stems that periorm other types of teats that require hurmian types
of intelligence
Like other definitions of complex topics, an understanding of Al requires an
understanding of related terms, such as intelligence, knowledge reasoning thought.
cognition, learning, and a number of computer-related terms. White e lack precise
scientific definitions for many of these terms, we can give general definiti o n s of
them And, of course, one of the objectives of this text is to impart special meaning
to allot the terms related to Al. including their operational meanings.
Dictionaries define intelligence as the ability to acquire, understand and apply
knowledge. or the' abilit y to exercise thought and reason. Of course, intelligence is
more than this. It embodies all of the knowledge and feats, both conscious and
unconscious, which we have acquired through study and experience: highly refined
sight and sound perception: thought: imagination: the ability to converse, read,
write, drive a car, memorize and recall facts, express and feel emotions: and much
more.
Intelligence is the integrated sum of those feats which gives us the ability to
remember a face not seen for thirty or more years, or to build and send rockets to
the moon. It is those capabilities which set Homo sapiens apart front forms
of living things. And, as we shall see, the food for this intelligence is knowledge.
Can we ever expect to build systems which exhibit these characteristics? The
answer to this question is yes! Systems have already been developed to pertorril
many types of intelligent tasks, and expectations are high for near term deselopment
of even more impressive systems. We now have systems which can learn from
examples, from being told, from past related experiences, and through reasoning.
We hase systems which can solve complex problems in mathematics, in scheduling
many diverse tasks, in finding optimal system configurations, in planning complex
strategies for the military and for business, in diagnosing medical diseases and
other complex systems, to name a few. We have systems which can understand''
large parts of natural languages. We have systems which can see well enough to
"recognize" objects from photographs. video cameras and other sensors. We have
systems which can 'reason with incomplete and uncertain facts. Clearly, with these
• developments, much has been accomplished since the advent of the digital computer.
In spite of these impressive achievements, we still have not been able to
produce coordinated, autonomous systems which possess some of the basic abilities
of a three-year-old child. These include the ability to recognize and remember numer-
ous diverse objects in a scene, to learn new sounds and associate them with objects
Sec. 1.2 The Importance of Al 3
and concepts, and to adapt readily to many diverse new situations. These are the
challenges now facing researchers in Al. And they are not easy ones. They will
require important breakthroughs before we can expect to equal the performance of
our three-year old.
To gain a better understanding of Al. it is also useful to know what Al is
not. Al is not the study and creation of conventional computer systems. Even though
one can argue that all programs exhibit some degree of intelligence, an Al program
will go beyond this in demonstrating a high level of intelligence to a degree that
equals or exceeds the intelligence required of a human in performing some task.
Al is not the study of the mind, nor of the body, nor of languages, as customarily
found in the fields of psychology, physiology, cognitive science, or linguistics. To
be sure, there is some overlap between these fields and Al. All seek a better understand-
ing of the human's intelligence and sensing processes. But in Al the goal is to
develop working computer systems that are truly capable of performing tasks that
require high levels of intelligence. The programs are not necessarily meant to imitate
human senses and thought processes. Indeed, in performing some tasks differently,
they may actually exceed human abilities. The important point is that the systems
all be capable of performing intelligent tasks effectively and efficiently.
Finally, a better understanding of Al is gained by looking at the component
areas of study that make up the whole. These include such topics as robotics,
memory organization, knowledge representation, storage and recall, learning models,
inference techniques, commonsense reasoning, dealing with uncertainty in reasoning-
and decision making, understanding natural language, pattern recognition and machine
vision methods, search and matching, speech recognition and synthesis, and a variety
of Al tools.
How much success have we realized in Al to date? What are the next big
challenges? The answers to these questions form a large part of the material covered
in this text. We shall be studying many topics which bear directly or indirectly on
these questions. in the following chapters. We only mention here that Al is coming
of an age where practical commercial products are now available including a variety
of robotic devices, vision systems that recognize shapes and objects, expert systems
that perform many difficult tasks as well as or better than their human expert counter-
parts, intelligent instruction systems that help pace a student's learning and monitor
the student's progress. ''intelligent" editors that assist users in building special
knowledge bases, and systems which can learn to improve their performance.
during the late 1970s. Leaders in those countries who recognized the potential for
Al were willing to seek approval for long-term commitments for the resources needed
to fund intensive research programs in Al. The Japanese were the first to demonstrate•
their commitment. The y launched a very ambitious program in Al research and
development. Known as the Fifth Generation, this plan was officially announced
in October 1981 It calls for the implementation of a ten-year plan to develop
intelligent supercomputers. It is a cooperative effort between government and private
companies having an interest in the manufacture of computer products. robotics.
and related ti-Ids. With a combined budget of about one billion dollars, the Japanese
are determined they will realize many of their goals, namely. to produce systems
that can converse in a natural language, understand speech and visual scenes, learn
and refine their knowledge, make decisions, and exhibit other human traits. If they
succeed, and many experts feel they will, their success as a leading economic power
is assured.
Following the Japanese, other leading countries of the world have announced
plans for some form of Al program. The British initiated a plan called the Alvey
Project with a respectable budget. Their goals are not as ambitious as the Japanese
but are set to help the British keep abreast and remain in the race. The European
Common Market countries have jointly initiated a separate cooperative plan named
the ESPRIT program. T he French too have their own plan. Other countries including
Canada. the Soviet Union, Italy, Austria, and even the Irish Republic and Singapore
have made some commitments in funded research and development.
The United States, although well aware of the possible consequences, has
made no formal plan. However, steps have been taken by some organizations to
push forward in Al research. First, there was the formation of a consortium of
private companies in 1983 to develop advanced technologies that apply Al techniques
(like VLSI). The consortium is known as the Microelectronics and Computer Tech nol-
ogy Corporation (MCC) and is headquartered in Austin, Texas. Second. the Depart-
ment of Defense Advanced Research Projects Agency (DARPA) has increased its
funding for research in Al. including development support in three significant pro-
grams: (I) development of an autonomous land vehicle (ALV) (a driverless military
vehicle) (2) the development of a pilot's associate (an expert system which provides
assistance to tighter pilots), and (3) the Strategic Computing Program (an Al based
military supercomputer project). In addition, most of the larger high-tech companies
such as IBM. DEC. AT&T. Hewlett Packard. Texas Instruments, and Xerox have
their own research programs. A number of smaller companies also have reputable
research programs.
Who will emerge as the principal leaders in this race for superiority -in the
production and sale of that commodity known as knowledge? If forward vision.
and commitment to purpose are to be the determining factors, then surely the Japanese
will be among the leaders if not the leader.
Just how the United States and other leading countries of the world will fare
remains to be seen. One thing is clear. The future of a country is closely tied to
the commitment it is willing to make in funding research programs in Al.
Sec. 1.3 Early Work in Al 5
As noted above, Al began to emerge as a separate field of study during the 1940s
and 1950s when the computer became a commercial reality. Prior to this time, a
number of important areas of research that would later help to shape early Al work
were beginning to mature. These developments all began to converge during this
period. First, there was the work of logicians such as Alonzo Church, Kurt Godel,
Emil Post, and Alan Turing. They were carrying on earlier work in logic initiated
by Whitehead and Russell. Tas'ski. and Kleene. This work began in earnest, during
the 1920s and 1930s. It helped to produce formalized methods for reasoning, the
form of logic known as propositional and predicate calculus. It demonstrated that
facts and ideas from a language such as English could be formally described and
manipulated mechanically in meaningful ways. Turing, sometimes regarded as the
father of Al. also demonstrated, as early as 1936, that a simple computer processor
('later named the Turing machine) could manipulate symbols as well as numbers.
Second, the new field of cybernetics, a name coined by Norbert Wiener.
brought together many parallels between human and machine. Cybernetics, the study
of communication in human and machine, became an active area of research during
the 1940s and 1950s. It combined concepts from information theory, feedback control
systems (both biological and machine), and electronic computers.
Third came the new developments being, made in formal grammars. This work
was an outgrowth of logic during the early 1900s. It helped to provide new approaches
to language theories in the general field of linguistics.
Finally, during the 1950s, the electronic stored program digital computer became
a commercial reality. This followed several years of prototype systems including
the Mark I Harvard relay computer (1944). the University of Pennsylvania Moore
School of Electrical Engineering's ENIAC electronic computer (1947). and subse-
quent develop ment of the Aberdeen Proving Ground's EDVAC and Sperry-Rands
UNIVAC.
Other important developments during this earlv,period which helped to launch
Al include the introduction of information theory due largely to the work of Claude
Shannon, neurological theories and models of the brain which were originated by
psychologists, as well as the introduction of Boolean algebra. switching theory.
and even statistical decision theory.
Of course Al is not just the product of this century. Much groundwork had
been laid by earlier researchers dating back several hundred years. Names like
Aristotle. Leibnitz, Babbage. Hollerith. and many others also played important roles
in building a foundation that eventually led to what we now know as Al.
During the 1950s several events occurred which marked the real beginning of Al.
This was a period noted for the chess playing programs which were developed by
researchers like Claude Shannon at MIT (Shannon. 1952, 1955) and Allen Newell
6 Overview of Artificial Intelligence Chap. 1
at the RAND Corporation (Newell and Simon. 1972) Other types of game playing
and simulation programs were also being developed during this time. Much effort
was being expended on machine translation programs, and there was much optimism
for successful language translation using computers (Weaver. 1955). It was felt
that the storage of large dictionaries in a computer was basically all that was needed
to produce accurate translations from one language to another. Although this approach
proved to be too simplistic, it took several years before such efforts were aborted.
The mid- 1950s; are generally recognized as the official birth date of Al when
a summer workshop sponsored by IBM was held at Dartmouth College. Attendees
at this June 1956 seminar included several of the early pioneers in Al includ-
ing Herbert Gelernter, Trenchard More, John McCarthy, Marvin Minsky, Allen
Newell, Nat Rochester, Oliver Selfridge, Claude Shannon, Herbert Simon. and Ray
Solomonoff (Newell and Simon, 1972). Much of their discussion focused on
the work they were involved in during this period, namely automatic theorem
proving and new programming languages.
Between 1956 and 1957 the Logic Theorist, one of the first programs for
automatic theorem proving, was completed by Newell. Shaw, and Simon (Newell
and Simon, 1972). As part of this development, the first list-processing language
called IPL (Information Processing Language) was also completed. Other important
events of this period include the development of FORTRAN (begun in 1954) and
Noam Chomsky's work between 1955 and 1957 on the theory of generative grammars.
Chomsky's work had a strong influence on Al in the area of computational linguistics
or natural language processing.
Important events of the late 1950s were centered around pattern recognition
and self-adapting systems. During this period Rosenblatt's perceptions (Rosenblatt.
1958) were receiving much attention. Perceptrons are types of pattern recognition
devices that have a simple learning ability based on linear threshold logic (described
in detail in Chapter 17). This same period (1958) marked the beginning of the
development of LISP by John McCarthy. one of the recognized programming lan-
guages of Al. It also marked the formation of the Massachusetts Institute of Technolo-
gy's Al laboratory. Several important programming projects were also begun during
the late 1950s, including the General Problem Solver (GPS) developed by Newell.
Shaw, and Simon (Ernst and Newe l l, 1969) written in IPL. Gelernters geometry
theorem-proving machine written in FORTRAN at the IBM Research Center, and
the Elementary Perceiver and Memorizer (EPAM) developed by Edward Feigenbaum
and Herbert Simon and written in IPL.
GPS was developed to solve a variety of problems ranging from symbolic
Integration to word puzzles (such as the missionary-cannibal problem). GPS used a
problem-solving technique known as means-end analysis discussed later in Chapter
9. The geometry theorem-proving machine of Gelernter was developed to solve
high-school level plane geometry problems. Frpm basic axioms in geometry. the
system developed a proof as a sequence of simple subgoals. EPAM was written to
study rote learning by machine. The system had a learning and performance component
where pair; of nonsense words, a stimulus-response pair, were first learned throuh
Sec. 1.4 -Al and Related Fields
repetitive presentations (in different orders). The performance component was then
used to demonstrate how well responses to the stimuli were learned.
Some significant Al events of the 1960s include the following.
Fields which are closely related to Al and overlap somewhat include engineering.
particularly electrical and mechanical engineering, linguistics, psychology. cogtiitie
science, and philosophy.. Robotics is also regarded by some researchers as a branch
of Al. but this view is not common. Many researchers consider robotics as a separate
interdisciplinary field which combines concepts and techniques from Al. electiical.
mechanical, and optical engineering.
Psychologists are concerned with the workings of the mind, the mental and
emotional processes that drive human behavior. As such, we should not he supriscd
to learn that researchers in Al have much in common with psschologists. t)urii
the past 20 years Al has 'adopted models of thinking and learning from psvcholiig,'
while psychologists in turn have patterned many of their experiments On questions
first raised by Al researchers. Al has given psychologists fresh ideas and enhanced
their ability to model human cognitive functions on the computer. In their hook
The Cognitive Computer. Schank and Childers 1984i estimate that
.. ....Al has
contributed more to psychology than any other discipline for s o nic tone.'
Because they share so many common interests, it has been claimed that ..\f
researchers think less like computer scientists than they do pschologisis and philiio-
8 ' Overview of ArtifIcll Intelligence Chap. 1
1.5 SUMMARY
In this introductory chapter, we have defined Al and terms closely related to the
field. We have shown how important Al will become in the future as it will form
the foundation for a number of new consumer commodities, all based on knowledge.
It was noted that countries willing to commit appropriate resources to research in
this field will emerge as the world's economic leaders in the not too distant future.
We briefly reviewed early work in Al. considering ftrst developments prior
to 1950, the period during which the first commercial computers were introduced.
We then looked at post-1950 de v elopments during which Al was officially launched
as a separate field of computer science. Fields which overlap with and are closely
related to Al were also considered, and the areas of commonality between the two
presented.
-- - .
rw
Knowledge:
General Concepts
2.1, INTRODUCTION
Early researchers in At believed that the best approach to solutions was through
the development of general purpose problem solvers, that is, systems powerful enough
to prove a theorem in geometry. to perform a complex robotics task, or to develop
a plan to complete a sequence of Intricate operations. To demonstrate their theorics.
several systems were developed including several logic theorem powers and a general
problem solver system (described in Chapter 9).
All of the systems developed during this period proved to be impotent as
9
10 Knowledge: General Concepts Chap. 2
general problem solvers. They required much hand tailoring of problem descriptions
and ad hoc guidance in their solution steps. The approaches they used proved to
be too general to be effective. The systems became effective only when the solution
methods incorporated domain specific rules and facts. In other words, they became
effective as problem solvers only when specific knowledge was brought to bear on
the problems. The realization that specific knowledge was needed to solve difficult
problems gradually brought about the use of domain specific knowledge as an integral
part of a system. It eventually led to what we now know as knowledge-based systems.
Since the acceptance of this important fact, successful problem solvers in many
domains have been developed.
Knowledge can be defined as the body of facts and principles accumulated by human-
kind or the act, fact, or state of knowing. White this definition may be true, it is
far from complete. We know that knowledge is much more than this. It is having
a familiarity with language. concepts. procedures. rules, ideas, abstractions, places,
customs, facts, and associations, coupled with an ability to use these notions effectively
in modeling different aspects of the world. Without this ability, the facts and concepts
are meaningless and, therefore, worthless. The meaning of knowledge is closely
related to the meaning of intelligence. Intelligence requires the possession of and
access to knowledge. And a characteristic of intelligent people is that they possess
much knowledge.
In biological organisms, knowledge is likely stored as complex structures of
interconnected neurons. The structures correspond to symbolic representations of
the krowtedge possessed by the organism, the facts, rules, and so on. The average
huni . n brain weighs about 3.3 pounds and contains an estimated number of 10
neurons. The neurons and their interconnection capabilities provide about 101* bits
of potential storage capacity (Sagan, 1977).
In computers. knowledge is also stored as symbolic structures, but in the
form of collections of magnetic spots and voltage states. State-of-the-artstorage in
computers is in the range of 10 12 bits with capacities doubling about every three to
four years. The gap between human and computer storage capacities is narrowing
-apidly. Unfortunately. there is still a wide gap between representation schemes
and efficiencies:
A common way to represent knowledge external to a computer or a human is
in the form of written language. For example, some facts and relations represented
in printed English are
Joe is tall.
Bill loves Sue.
Sam has learned to use recursion to manipulate linked lists in several program-
ruing languages.
Sec. 2.2 Definition and Importance of Knowledge it
The first item of knowledge above .expresses a simple fact, an attribute possessed
by a person. The second item expresses a complex binary relation between two
persons. The third item is the most complex, expressing relations between a person
and more abstract programming concepts. To truly understand and make use of
this knowledge, a person needs other world knowledge and the ability to reason
with it.
Knowledge may be declarative or procedural. Procedural knowledge is compiled
knowledge related to the performance of some task. For example, the steps used
to solve an algebraic equation are expressed as procedural knowledge. Declarative
knowledge. on the other hand, is passive knowledge expressed as statements of
facts about the world. Personnel data in a database is typical of declarative knowledge.
Such data are explicit pieces of independent knowledge.
Frequently, we will be interested in the use of heuristic knowledge, a special
ty pe of knowledge used by humans to solve complex problems. Heuristics are the
knowledge used to make good judgments, or the strategies, tricks, or "rules of
thumb'' used to simplify the solution of problems. Heuristics are usually acquired
with much experience. For example, in locating a fault in a TV set, an experienced
technician will not start by making numerous voltage checks when it is clear that
the sound is present out the picture is not, but instead will immediately reason that
the high voltage flyback transformer or related component is the culprit. This type
of reasoning may not always be correct, but it frequeitly is. and then it leads to a
quick solution.
Knowledge should not be confused with data. Feigenbaum and McCorduck
(1983) emphasize this difference with the following example. A physican treating
a patient uses both knowledge and data. The data is the patient's record, including
patient history, measurements of vital signs. drugs given, response to drugs, and
so on, whereas the knowledge is what the physician has learned in medical school
and in the years of internship, residency, specialization, and practice, Knowledge
is hat the physician now learns in journals. It consists of facts, prejudices. beliefs.
and most importantly, heuristic knowledge.
Thus, we can say that knowledge includes and requires the use of data and
information. But it is more. It combines relationships, correlations, dependencies.
and the notion of gestalt with data and information.
Even with the above distinction, we have been using knowledge in its broader
sense up to this point. At times, however, it will be useful or even necessary to
distinguish between knowledge and other concepts such as belief and hypotheses.
For such cases we make the following distinctions. We define belief as essentially
any meaningful and coherent expression that can be represented. Thus, a belief
may be true or false. We define a hypothesis as a justified belief that is not known
to be true. Thus, a hypothesis is a belief which is hacked up with some supporting
evidence, but it may still be false. Finally, we define knowledge as true justified
belief. Since these distinctions will be made more formal in later chapters., we
need not attempt to give any further definitions of truth or justification at this time.
Two other knowledge terms which we shall occasionally use are epistemology
12 -: Knowledge: General Concepts Chap. 2
Al has given new meaning and importance to knowledge. Now, for the first time,
it is possible to "package" specialized knowledge and : ..ell it with a system that
can use it to reason and draw conclusions. The potential of this important development
is only now beginning to be realized. Imagine being able to purchase an untiring,
reliable advisor that gives high level professional advice in specialized areas, such
as manufacturing techniques, sound financial strategies, ways to improve one's health,
top marketing sectors and strategies, optimal farming plans. and many other important
matters. We are not far from the practical realization of this, and those who create
Sec. 2.3 Knowledge-Based Systems 13
and market such systems will have more than just an economic advantage over the
rest of the world.
As noted in Chapter I. the Japanese recognized the potential offered with
these knowledge systems. They were the first to formally proceed with a plan to
commit substantial resources toward an accelerated program of development for
super-computers and knowledge-based systems. In their excellent hook on the Fifth
Generation. Feigenbaum and McCorduck (1983) present convincing arguments for
the importance that should be ascribed to such programs. They argue that the time
is right for the exploitation of Al and that the leaders in this field will become the
leaders in world trade. By forging ahead in research and the development of powerful
knowledge-based systems. the Japanese are assuring themselves of a leading rote
in the control and dissemination of packaged knowledge. .Feigenh!uin and NkCorduck
laud the Japanese for their boldness and farsightedness in moving ahead with this
ambitious program.
One of the important lessons learned in Al during the 1960s was that general purpose
problem solvers which used a limited number of laws or axioms were too weak to
be effective in solving problems of any complexity. This realization eventually led
to the design of what is now known as knowledge-based systems, systems that
depend on a rich base of knowledge to perform difficult tasks.
Edward Feigenbaum summarized this new thinking in a paper at the International
Joint Conference on Artificial Intelligence (IJCAI) in 1977. He emphasized the
fact that the real power of an expert system comes froth the knowledge it possesses
rather than the particular inference schemes and other formalisms it employs. This
new view of Al systems marked the turning point in the development of more
powerful problem solvers. It formed the basis for some of the new emerging expert
systems being developed during the 1970s including MYCIN. an expert system
developed to diagnose infectious blood diseases.
Since this reali,aticin. much of the work done in Al has been related to so-
called knowledge-based sstems. including work in vision. learning, general problem
'otving. and natural language understanding. This in turn has led to more emphasis
being placed on research related to knowledge representation. mernor y organization.
and the use and manipulation of knowledge.
Knowledge-based s y stems get their power front expert knowledge that
has been coded into (acts. rules, heuristics, and procedures. The knowledge is stored
in a know ledge base separate front control and inferencing components (Figure
2. Ii. This makes it possible to add new knowledge or refine existing knowledge
vvithout recompiling the control and ir.tcrencing programs. This greatly simplifies
the construction and maintenance of knowledge-based systems.
In the knowledge lies the power! This was the message learned by a few
farsighted researchers at Stanford University during the late 1960s and early 1970s.
14 Knowledge: General Concepts Chap. 2
KnOwea"1
Unit Figure 2.1 Comeneni of knowledge-
________________ Intere-cofltral based syem.
The proof of their message was provided in the first knowledge-based expert systems
which were shown to be more than toy problem solvers. These first systems were
real world problem solvers, tackling such tasks as determining complex chemical
structures given only the atomic constituents and mass spectra data from samples
of the compounds, and later performing medical diagnoses of infectious blood dis-
eases.
Given the fact that knowledge is important and in fact essential for intelligent behavior,
the representation of knowledge has become one of Al's top research priorities.
What exactly is meant by knowledge representation? As defined above, knowledge
consists of facts, concepts, rules, and so forth. It can be represented in different
form-s, as mental images in one's thoughts, as spoken or written words in some
language, as graphical or other pictures, and as character strings or collections of
magnetic spots stored in a computer (Figure 2 2). The representations we shall be
concerned with in our study of Al are the written ones (character strings, graphs,
pictures) and the corresponding data structures used for their internal storage.
Any choice of representation will depend on the type of problem to be solved
and the inference methods available. For example, suppose we wish to write a
program to play a simple card game using the standard deck of 52 playing cards.
We will need some way to represent the cards dealt to each player and a way to
express the rules. We can represent cards in different ways. The most straightforward
way is to record the Suit (clubs, diamonds, hearts, spades) and face values (ace. 2.
3......10. jack, queen, king) as a symbolic pair. So the queen of hearts might
Mental Images
Written text
Character Strings
Binary numbers
011011011011011011.
Clearly a representation in the proper base greatly simpl i fies finding the pattern
solution.
Sometimes, a state diagram representation will simplify solutions. For example,
the Towers of Hanoi problem requires that n discs (say n = 3), each a different
size, be moved from one of three pegs to a third peg without violating the rule a
disc may only be stacked on top of a larger disc. Here, the States are all the possible
disc-peg configurations, and a valid solution path can easily be traced from the
initial state through other connected states to the goal state.
Later we will study several representation schemes that have become popular
among Al practitioners. Perhaps the most important of these is first order predicate
logic. It has become important because it is one of the few methods that has a
well-developed theory, has reasonable expressive power, and uses valid forms of
inferring. Its greatest weakness is its limitation as a model for commonsense reasoning.
A typical statement in this logic might express the family relationship of fatherhood
as FATHERjohn, Jim) where the predicate father is used to express the fact that
John is the father of Jim.
Other representation schemes include frames and associative networks (also
called semantic and conceptual networks), fuzzy logic, modal logics, and object-
oriented methods. Frames are flexible structures that permit the grouping of closely
related knowledge. For example, an object such as a ball and its properties (size,
color, function) and is relationship to other objects (to the left of, on top of, and
so on) are grouped together into a single structure for easy access. Networks also
permit easy access to groups of related items. They associate objects with their
attributes, and linkages show their relationship to other objects.
Fuzzy logic is a generalization of predicate logic, developed to permit varying
degrees of some prope rty such as tall. In classical two-valued logic, TALL(john)
is either true or false, but in fuzzy logic this statement may he partially true. Modal
logic is an extension of classical logic. It was also developed to better represent
commonsense casoning by permitting conditions such as likely or possible. Object
oriented representations package an object together with its attributes and functions.
therefore, hiding these facts. Operations are performed by sending messages between
the objects.
Another representation topic covered more fully later is uncertainty. Not all
16 Know*dge: General Concepts Chap. 2
2.8 SUMMARY
EXERCISES
2.1 Define and describe the difference between knowledge, belief, hypotheses, and data.
2.2 What is the difference between declarative and procedural knowledge?
3-
I
is Knowledge: General Concepts Chap. 2
2.3 Look up the meaning of epistemology in a good encyclopedia and prepare a definition.
2.4 The Turing test has often been incorrectly interpreted as being a test of whether or not
a person could distinguish between responses from a computer and responses from a
person. How does this differ from the real Turing test? Are the two tests equivalent? If
not, explain why they are not?
2.5 What important knowledge products are currently being marketed like other commodities?
What are some new knoledge products likely to be sold within the next ten years?
2.6 Briefly describe the me inng of knowledge representation and knowledge acquisition.
2.7 Give four different ways to represent the fact that John is Bill's father.
LISP and Other Al
Programming Languages
The basic building blocks of LISP are the atom, list, and the string. An atom is a
number or string of contiguous characters, including numbers and special characters.
A list, is a sequence of atoms and/or other lists enclosed within parentheses. A
19
LISP and Other Al Programming Languages Chap. 3
20
Since a list may contain atoms as well as other lists, we will call the basic
unit members top elements. Thus, the top elements of the list (a b (C d) e (f)) are
a, b, (c d), e, and (f). The elements c and d are top elements of the sublist (c d).
Atoms, lists, and strings are the only valid objects in LISP They are called
symbolic-expressions or s-expressiOns. Any s-expression is potentially a valid pro-
gram And those, believe it or not, are essentially the basic syntax rules for LISP
Of course, to be meaningful, a program must obey certain rules of semantics, that
is, a program must have meaning.
LISP programs run either on an interpreter oras compiled code. The interpreter
examines source programs in a repeated loop, called the readevalUate- Priflt loop.
This loop reads the program code, evaluates it, and prints the values returned by
the program. The interpreter signals its readiness to accept code for execution by
printing a prompt such as the -> symbol. For example, to find the sum of the
three numbers 5, 6, and 9 we type after the prompt the following function call:
569)
20
Some dialects require that symfrIiC atoms begin with a letter and do not include pareils or
single quotes
Sec. 3.1 Introduction to LISP: Syntax and Numeric Functions 21
Note that LISP uses prefix notatiou. and the ± symbol is the function name
for the sum of the arguments that follow. The function name and its argument .' are
enclosed in parentheses to signify thaI it is to be evaluated as a function. The
read-evaluate-print loop reads this expression, evaluates it, and prints the saluc
returned (20). The interpreter then prints the prompt to signal its readiness to acce['t
the next input. More complicated computations can be written as a single embedded
expression. For example. to compute the centigrade equivalent of (h2 Fahrenheit
temperature 50, for the mathematical expression (5 * 9 / 5) + 32 we would write
the corresponding LISP function
Each function call is performed in the order in which it occurs 'Arilhin the
parentheses. Hut, in order to compute the sum, the argument (* (I 9 5) 50) must
first be evaluated. This requires that the product of 50 and 9/5 be computed, sshich
in turn requires that the quotient 9/5 be evaluated. The embedded function c 9
returns the quotient 1.8 to the multiply function to give (* 1.8 50). This is then
evaluated and the value 90 is returned to the top (sum) function to give (-F 90 32).
The final result is the sum 122 returned to the read-evaluate-print loop for printing.
The basic numeric operations are +, -. , and I. Arguments may be integers
or real values (floating point), and the number of arguments a function takes will.
of course, differ. For example, + and * normally take zero or more arguments.
while - and / take two. These and a number of other basic functions are predefined
in LISP. Examples of function calls and the results returned are given in lahie
3.. In addition to these basic calls, some LISP implementations include mnemonic
names for arithmetic operations such as plus and times.
LISP tries to evaluate everything, including the arguments of a fiinctton. But,
three types of.elements are special in that they are constant and always evaluate to
themselves, returning their own value: numbers, the letter t (for logical true), and
nil (for logical false). Nil is also the same as the empty list 0 . It is the only object
in LISP that is both an atom and a list. Since these elements return their own
value, the following are valid expressions.
->6
6
->1
T
.> NIL
NIL
Sometimes we wish to take atoms or lists literally and not have them evaluated or
treated as function calls as, for example, when the list represents data. To accomplish
this, we precede the atom or the list with a single quotation mark, as in 'man or
as in '(a b c d). The quotation mark informs the interpreter that the atom or list
should not be evaluated, but should be taken literally as an atom or list.
Variables in LISP are symbolic (nonnumeric) atoms. They may be assigned
values, that is, bound to values with the function setq. Setq takes two arguments,
the first of which must be a variable. It is never evaluated and should not be in
quotation marks. The second argument is evaluated (unless in quotation marks)
and the result is bound to the first argument. The variable retains this value until a
new assignment is made. When variables are evaluated, they return the last value
bound to them. Trying to evaluate an undefined variable (One not previously bound
to a value) results in an error. Some examples of the use of setq are as follows
note that comments in LISP code may be placed anywhere after a semicolon).
Some basic symbol processing functions are car, cdr, cons, and list. Examples
kit these functions, are given in Table 32. Car takes one argument, which must be
a list. It returns the first top element of the list. Cdr also takes a list as its argument,
and it returns a list consisting of all elements except the first. Cons takes two
Sec. 3.2 Basic List Manipulation Functions in LISP 23
arguments, an element and a list. It constructs a new list by making the element
the first member of the list. List takes any number of arguments, and makes them
into a list, with each argument a top member.
Note the quotation marks preceding the arguments in the function calls of
Table 3.2. As pointed Out above, an error will result if they are not there because
the interpreter will try to evaluate each argument before evaluating the function.
Notice the difference in results with and without the quotation marks.
where any number of arguments in\' be used. When a function is called, the arguments -
are -first evaluated front to right (unless within quotation marks) and then the
function is executed using the evaluated argument values. Complete the list manipula-
tion examples below. .
24 LISP and Other Al Programming Languages Chap. 3
Other useful list manipulation functions are append, last, m1i' .. and reverse.
Append merges arguments of one or more lists into a single list. Last takes one
argument, a list, and returns a list containing the last element. Member takes two
arguments, the second of which must be a list. If the first argument is a member
of the second one, the remainder of the seco'a list is returned beginning with the
member element. Reverse takes a list as its argument and returns a list with the
top elements in reverse order from the input list. Table 3.3 summarizes these opera-
tions.
Sec 3,3 •Delinlng Funo lonLPr,diatss. and Conditionals 25
Defining Functions
Now that we know how to call functions, we should learn how to define our own.
The function named defun is used to define functions. It requires three arguments:
(1) the new function name, (2) the parameters for the function, and (3) the function
body or LISP code which performs the desired function operations. The format is
Defun does not evaluate its arguments. It simply builds a function which
may be called like any other function we have seen. As an example, we define a
function named averagethree to compute the average of three numbers.
Note that defun returned the name of the function. To call averagethree. we
give the function name followed by the actual arguments
When a function is called, the arguments supplied in the call are evaluated
unless they are in quotation marks and bound to (assigned to) the function parameters.
The argument values are bound to the parameters in the same order they were
given in the definition. The parameters are actually dummy variables (n t , n 2 , n 3 in
averagethree) used to make it possible to give a function a general definition.
26 LISP tnaothel"AI Prdgrathmlng'Lihguagós Chap. 3
Predicate Functions
Predicates are functions that test their arguments for some specific condition. Except
for the predicate "member" (defined above), predicates return true (t) or false
(nil), depending on the arguments. The most common predicates are
atom >=
equal listp
evenp null
grenterp (or >) numberp
oddp
lesap tor <) zerop
Value
Function call returned Remarks
Predicates are one way to make tests in programs and take different actions based
on the outcome of the test. However, to -make use of the predicates, we need
some construct to permit branching. Cond (for conditona)) is like the if-then-else
construct.
The syntax for cond is
(<test> <actiorr>))
Each (<test 1 > < action>), i=l .....k. is called a clause. Each clause
consists of a test portion and an action or result portion' The first clause following
the cond is executed by evaluating <test 1 >. lfthis evaluates tonon nil, the <action1>
portion is evaluated, its value is rtturned. and the remaining clauses are skipped
over. If <test 1 > evaluates to nil, control passes to the second clause without evaluating
<action 1 > and the procedure is repeated. If all tests evaluate to nil, cond returns
nil.
We illustrate the use of cond in the following function maxirnum2 which
returns the maximum of two numbers.
->(defun maxirr,rn2 (a b(
(cond ((> a bI a)
It b)))
MAX) MU M2
When maxinium2 is executed, it starts with the first clause following the
cond. The test sequence is as follows: if (the argument bound to) a is greater than
(that hound to) b, return a. else return h. Note the t in the second clause preceding
h. This forces the last clause to be evaluated when the first clause is not.
->(rnaximurn2 234 320)
320
a
A slightly more challenging use of cond finds the maximum of three numbers
in the function maximum3.
.defun naxirnum3 a b C)
(cond ((> P b) (cond (( a c) a)
(I do
I)> b c) b)
(I cOO
->MAX(MUM3
28 LISP and Other Al Programming Languages Chap. 3
Common LISP also provides a form o' the more conventional if. then, else
conditional. It has the form
For this form, test is first evaluated. It it evaluates to non iii) the ': then-
action> is evaluated and the result returned>)therwise the else-action> is evaluated
and its value returned. The <else-action> is optional. It omitted, then when test
evaluates to nil, the if functn returns nil.
Logical Functions
Like predicates, logical functions may also be used for flow of control. The basic
logical operations are and, or, and not. Not is the simplest. It takes one argument
and returns t if the argument evaluates to nil, it returns nil if its argument evaluates
to non-nil. The functions and and or both take any number of arguments. For
both, the argumehts are evaluated front to right, . In the case of and, if all
arguments evaluate to non-nil, the value of the last argument is returned; otherwise
nil is returned. The argumints of or are evaluated until one evaluates to non-nil, in
which case it returns the argument value; otherwise it returns nil.
Some examples of the operators and, or and not in expressions are
->(setq x (a b co
(A B C)
->(not (atom x))
T
->)not (listp a))
NIL
-(Or (member e x) (member b x))
cB c
->(or (equal 'c (Car x)) (equal 'b (car xfl)
NIL
-s(and lhstp a) (equal 'c (caddr x)))
C
->tor (and (atom x) (equal a xl)
(and (not (atom x)) (atom (Car x))))
T
Sec. 3,4 Input, Output, and Local Variables 29
Without knowing how to instruct our programs to call for inputs and print nsessJges
or text on the monitor or a printer, our programs will be severely limited. The
Operations we need for this are performed with the input-output (I/O) functions
The most commonly used I/O functions are read, print, prini, princ, terpri. and
format.
Read takes no arguments. When read appears in a procedure, processing halts
until a single s-expression is entered from the keyboard. The s-expression is the4l
returned as the value of 'i',id and processing Continues. For example, if we include
a read in an arithmetic pression, an appropriate value should be entered when
the interpreter halts.
,>(.. 5 (read))
6
11
When the interpreter looked for the second argument for +, it found the read
statement which caused it to halt and wait for an input from the keyboard. If we
enter 6 as indicated, read returns this value, processing continues, and the sum II
is then returned.
Print takes one argument. It prints the argument as it is received, and then
returns the argument. This makes it possible to print something and also pass the
same thing on to another function as an argument. When print is used to print an
expression. its argument is preceded by the carnage-return and line feed characters
(to start a new line)' and is followed by a space. In the following example.. note
the double printing. This occurs because print first prints its argument and then
returns it. causing it to he printed by the read-evaluate-print loop.
•>(pint '(a b dl
(A B C)
IA BC)
->lprint "hello there")
hello there"
"hello there"
Notice that print even prints the double quotation marks defining the string.
Prini is the same as print except that the new-line characters and space are
not provided (this is not true for all implementations of Common LISP).
We can avoid the double quotation marks in the output by using the printing
function princ. It is the same as prini except it does not print the unwanted quotation
marks. For example. we use princ to print the following without the marks,
Princ eliminated the quotes, but the echo still remains. Again, that is because
princ returned its argument (in a form that LISP could read), and, since that was
the last tiling returned, it was printed by the read-evaluate-print loop. In a typical
program, the returned value would not be printed on the screen as it would be
absorbed (used) by another function.
The primitive function terpri takes no arguments. It introduces a new-line
(carriage return and line feed) wherever it appears and then returns nil. Below is a
program to compute the area of a circle which uses several I/O functions , , including
user prompts.
->)detun circle-area I)
(terpri)
(princ "Please enter the radius: "I
(setq radius (read))
(princ "The area of the circle is: "I
(princ ( 3.1416 radius radius))
(terpri))
CIRCLE-AREA
->(circle'area)
Please enter the radius: 4
The area of the circle is: 50,2656
Notice that princ permits us to print multiple Itenisn the same line and to introduce
a new-line sequence we use terpri.
The format function permits us to create cleaner output than is possible with
just the basic printing functions. It has the form (format <destination> <string>
arI arg2 ...): Destination specifies where the output is to be directed, like to the
monitor or some other external file. For our purposes, destination will always be t
to signify the default output, the monitor. String is the desired output string, but
intermixed with format directives which specify how each argument is to be repre-
sented. Directives appear in the string in the same order the arguments are to be
printed. Each directive is preceded with a tilde character () to identify it as a
directive. We list only the most common directives below.
The field widths for appropriate argument values are specified with an integer
nimediady following the tilde symbol; for example. 'SD specifies an integer field
of width 5. As an example ol foriltat . suppose .r and v have been hound to floating-
point numbers 3.0 and 9.42 respectively. Using format, these numbers can he embed-
ded within a string nt text.
(DEE)
(ABC)
The variable r in the defun is local in scope. It reverts hack to its previous value
Mier the function local-vtr is exited. The variable v. on the other hand. is global.
It is accessible from an y procedure and retains its value unless reset with sctq.
32 LISP and Other Al Programming Languages Chap. 3
The let and prog construct', also permit the creation of local variables. The
s\ ritas for the let function is
(let ( ( var sal t ) ( var, viii,)... I. <s
sshere each var, is a different variable name and val, is an initial value assigned to
each ar. respectively When let is executed. each val, is evaluated and assigned
to the corresponding vat,. and the s-expressions which follow are then evaluated in
order. The value of the last explession esiiluated is then returned If an initial
Is 1101 irichided
lilt
%k it al,, It is •issigtted nil. and the parentheses enclosing
II tut\ he iitrtille&l
- (let ox a)
lv h)
)z el)
eons ii leor's y I list l Ill
(ABC)
The prog Itinction is sun t lar to let in that the lirsI arguments billowing It are
a list of local variables \% , here each element is either it name or a list
containing a 'ariahle name and its initial value. This is followed by the bod y of
the proc. and any number of s-expressions.
Prog executes list 5-expressions in sequence and returns nil unless it encounters
it neil Oil call named mel u in. In [his is case the single argument 01 . return is evaluated
and returned. Pi i sg also ficrittils the use of unconditional go statements and lahels.
(atoril labels) to identity the go -to transfer locations. With the go and label statements.
prog permits the wri Ii nc of unstructured programs ari ,d . therefore, is not recommended
for general use. ,\ri e;irnplc of it function like riirmih ( iiieiiiher) which uses iteration.
will illustrate this usc he iiinn Iunitiorr iiietrmh requires two a qguillelu,. all clenicrit
and a list
Note that prog used here requires no local variables Also note the label start,
the transfer (loop back) point for the go statement. The second clause of tlic cond
executes when the first clause is skipped because setq is non-nil!
Sec. 3.5 Iteration and Recursion 33
Iteration Constructs
We saw one way to perform iteration using the prog construct in the previous
SCCI1OR. In this section. we introduce it structured turin of' iteration with the do
construct, which is somewhat like the while loop in Pascal.
The do statement hs the form
(<test, <re*urn-value'-)
ks-expressions>))
The val e are initial values which are all evaluated and then bound to the Corre-
sponding variables var, in parallel. Following each such statement are optional update
statements which define how the var, are to be updated with each iteration. After
the variables are updated during an iteration, the test is evaluated, and if it returns
non-nil (true). the rCturn-value is evaluated and returned. The s-expressions forming
the body of the construct are optional. If present, they are executed each iteration
until an exit test condition is encountered. An example of the factorial function
will illustrate the doconstruct.
(loop <s-expressions -(
where the s-expressions are evaluated repeatedly until a call to a return is encountered. -
4,-
34 LISP and Other Al Programming Languages Chap. 3
01 course, let and other functions can be embedded within the loop construct if
local variables are nccdcd
Recursion
For many problems, recursion is the natural method of solution. Such problems
occur frequently in mathematical logic, and the use of recursion will often result
in programs which are both elegant and simple. A recursive function is one which
calls itself successively to reduce a problem to a sequence of simpler steps. Recursion
requires a stopping condition and a recursive step.
We illustrate with a recursive version of factorial. The recursive step in factorial
is the product of ,,.and factorial(it-l). The stopping condition is reached when n =
0.
Note the slopping condition on the second line of the function definition, and
the recursive step on the last line.
We present another example of recursion which defines the member function
called newmemhcr.
lithe atom e and list (a J, c d) are given as the arguments in the calf to newniember,
c gets bound to ci and (a h ( d) is hound to lxi. With these bindings, the first cond
test fails, since lxi is not null. Consequently, the second test is executed. This also
The last test of
fails since el, bound to c, does not equal the car of 1st which is a.
the cond construct is forced to succeed because of the t test. This initiates a recursive
and the cdi of. 1st
call to ncwntciuhcr with (lie new arguments ('I (still hound to c)
which is (h e M. Again, a inalch fails during the ond tests; so another recursive
call is made, this time with arguments ci (still bound to c) and lxi now bound to
(c (i). When this calf is executed, a match is found in the second cond test so the
value of 1st (C d) is returned.
Sec.'S.G PrósèIythpfld An.j'
is
3.5 PROPERTy LIST8 AND ARRAYS
Property Usts
One of the unique and most Useful features of LISP as an Al language is the
ability to assign properties to atoms. For example, any object, say an atom which
represents a person, can be given a number of properties which in some way character-
ize the person, such as height, weight, sex, color of eyes and hair, address, profession,
family members, and so on. Property list functions permit one to assign such properties
to an atom, and to retrieve, replace, or remove them as required.
The function putprop assigns properties to an atom. It takes three arguments:
an object name (an atom), a property or attribute name, and property or attribute
value. For example, to assign properties to a car, we can assign properties such as
make, year, color, and style with the following statements:
->(putprop 'car 'ford 'make)
FORD
- '>(putprop 'car 1988 'year)
1988
'>(putprop 'car 'red 'color)
RED
->(putprop 'car 'fourdoor 'style)
FOUR-DOOR
where value is returned. The object, car, will retain these properties until they are
replaced with new ones or until removed with the remprop function which takes
two arguments, 1the object and its attribute, in other wot1s, properties are global
assignments. To retrieve a property value, such as the dolor of car, we use the
function get, ¶hich also takes the two arguments object and attribute.
->(get car 'color)
RED
->(get 'car 'make)
FORD
.>(pprop 'car 'blue 'color)
BLUE
.>(get 'car color)
BLUE
->(r,mprop 'car 'color)
BLUE
.>(get 'car 'color)
NIL
LISP and Other Al Programming Languages Chap. 3
36
The property value may be an atom or a list. For example. if Danny has pets
named Schultz. Penny. and Etoile, they can be assigned as
To add a new pet named Heidi without knowing the existing pets one can do
the following:
.>(putprOp 'danny (cons 'heidi (get '4ienny 'pets)) 'pets)
(HEIDI SCHULTZ PENNY ETOILE)
The new function self used in the above definition is like setq except it is
more general. It is an assignment function which also takes two arguments, the
first of which' may be either an atom or an access function (like car. cdr. and get)
and the second, the value to be assigned. When the first argument is an atom. setf
behaves the same as setq. It simply binds the evaluated second argument to the
first. When the first argument is an access function. self places the second argument,
(a b C)
the value, at the location accessed by the access function For example. if
has been bound to x. the expression (self (car x) 'd) will replace the a in (a b c)
Likewise, self can be used directly to assign or replace a property value.
with d.
expressed. in Figure 3.1 some facts about Tweety, the famous Al bird, have been
represented as a network using property lists.
Arrays
Note that the function returns the pound sign (#) followed by an A and the array
representation with its cells initially set to nil.
To access the contents of cells, we use the function aref which takes two
arguments, the name of the array and the index value. Since the cells are indexed
starting at zero, an index value of 9 must be used to retrieve the contents of the
tenth cell.
->(aref myarray 9) . . . . -.
NIL
To store items in the array, we use the function setf as we did above to store
properties on a property list. So, to store the items 25, red, and (sam sue linda) in
the first, second, and third cells of myarray,we write
38 USP and Other Al Progr.milng languages .Chap. 3
We complete our presentation of LISP in this section with a few additional topics,
including the functions mapcar, eva), lambda, trace and untrace, and a brief description
of the internal representation of atoms and lists.
Mapping Functions
Mapcar is one of several mapping functions provided in LISP to apply some function
successively to one or more lists of elements. The first argument of mapcar is a
function, and the remaining argument(s) are lists of elements to which the named
function is applied. The results of applying the function to successive members of
the lists are placed in a new list which is returned. For example, suppose we wish
to add I to each element of the list (5 10 15 20 25). We can do this quite simply
with mapcar and the function 1+.
->(mapcar 1+ '(510152025))
(6 11 1621 26)
Lambda Functions
When a function is defined with defun, its name and address must be stored in a
symbol table for retrieval whenever it is called in a program. Sometimes, however,
it is desirable to use a function only once in a program. This will be the case
Sec*, MneouirTbOiceedtO hna q2tj
when it is •uied in a mapping operation such as with lnapcar, which must take a
procedure as its first argument. LISP provides a method of wining unnamed or
anonymous functions that are evaluated only when they are encountered in a program.
Such functions are called lambda functions. They have the following form
Internal Storage.
As we have seen, lists are flexible data structures that can shrink or grow almost
without limit. This is made possible through the use of linked cell structures in
memory to represent lists. There can be visualized as storage boxes having two
components which corresponer to the car and cdr of a list. The cells are called
cons-cells, because they are constructed with the cons function, where the left compo-
nent points to the first element of a list (the car of the list) and the right component
Doints to the remainder of the list (the cdr of the list). An example of the representation
or the list (a (b c (d)) e f) is given in Figure 3.2.
The boxes with the slash in the figure represent nil. When cons is used to
construct a list, the cons-cells we created with pointers to the appropriate elements
as depicted in Figure 3.2. The use of such structures permits lists to be easily
extended or modified.
sister(sue,bi$I)
parent(ann.sam)
parent(joe.ann)
male(Joe)
female(nn)
The first fact is the predicate sister with arguments sue and bill. This predicate
has the intended meaning that Sue is the sister of Bill. Likewise, the next predicate
has the, meaning that Ann is the parent of Sam, and so on.
R ules in PROLOG are composed of a condition or "if' part, and a conclusion
or "then" part separated by the symbol :- which is read as "if' (conclusion if
conditions). Rules are used to represent general relations which hold when all of
the conditions in the if part are satisfied. Rules may contain variables, which must
begin with uppercase letters. For example to represent the general rule for grandfather.
we write
grandfather(X.Z) :- parent(X,V), parent(Y.Z). male(X)
Note that separate conditions in the rule are separated by commas. The commas
act as conjunctions, that is, like and statements where . all conditions in the right-
hand side must be satisfied for the rule to be true.
Given a data base of facts and rules such as that above, we may make queries
by typing af,er the query symbol -? statements such as
7 parent(X,ssm)
X-ann
?- male(joe)
yes
?-grandtather(X,Y)
Xjoe, Ysam
?-female(joé)
no
Note that responses to the queries are given by returning the value a variable can
take to satisfy the query or simply with yes (true) or . no (false).
Queries such as these set up a sequence of one or more goals that are to be
satisfied. A goal is satisfied if it can be shown to logically follow from the facts
and rules in the data base. This means that a proper match must be found between
predicates in the query and the database and that all subgoals must be satisfied
through consistent substitutions of constants and variables for variable arguments.
To determine if a consistent match is possible. PROLOG searches the data base
and tries to make substitutions until a permissible match is found or failure, occurs.
For example, when the query
is given, a search is made until the grandfather predicate is found in the data base.
In this case, it is the head of the above grandfather rule. The constant Sam is
substituted for Z, and an attempt is then made to satisfy the body of the rule. This
requires that the three conditions (subgoals) in the body are satisfied. Attempts to
satisfy these conditions are made from left to right in the body by searching and
finding matching predicates and making consistent variable substitutions by (I) substi-
tuting Joe forX in the first subgoal parent(X.Y)and later in the third subgoal male(X.
(2) substituting Sam for Z in the second subgoal.parenflY.Z). and (3) substituting
Ann for Y in the first two subgoals parent(X.Y) and parent(Y.Z). Since consistent
variable substitutions can be made in this case, PROLOG returns X = Joe, in
response to the unknown X in the query. PROLOG will continue searching through
the database until a consistent set of substitutions is found. If all substitutions cannot
be found, failure is reported with the printout no.
42 LISP and Other Al Programming Languages Chap. 3
Lists in PROLOG are similar to list data structures in LISP. A PROLOG list
is written as a sequence of items separated by commas, and enclosed in square
brackets. For example, a list of the students Tom. Sue, Joe, Mary, and Bill is
written as
lIom.sue.loe,mary.biHI
itt
PROLOG provides a notation to separate the head and tail, the vertical bar
as in IHeadiTaill. This permits one to define the Head of a list as any number of
items followed by a j and the list of remaining items. Thus, the list [a,b,c.d] may
be written as (a,b,c,d] I a R b , c .d l) = Ia.bIIc,dI] a,b,c,dll]J.
Matching with lists is accomplished as follows:
?•lHeadjTaill = ltorn,suejoe.maryl.
Head = torn
Tail lsue,joe,maryl
member(X.IX TaHD.
membejX,l Head jTajIl)
member(X,Tail).
Sec. 3.9 Summary
The first condition States that X is a member of the list 'L if X is the head of L.
Otherwise, the rule States that X is a member. of L if X is a member of the tail of
L. Thus,
7- memberlc,la,b,c.dI)
yea
7- rnember(b.(a,lb,cl,dl) . . .
no
PROLOG has numeric functions and relations, as well as list handling capabili-
ties, which give it some similarity to LISP. In subsequent chapters, we will see
examples of some PROLOG programs. For more details on the syntax, predicates,
and other features of PROLOG the reader is referred to the two texts, Bratko (1986),
and Clocksin and Mellish .(1981).
(iher programming languages used in Al include C, object oriented extensions
to LIS. such as Flavors, and languages like Smalltalk. The language C has been
used by some practictioners in Al because of its popularity and its portability.
Object oriented languages, although they have been introduced only recently, have
been gaining much popularity. We discuss these languages in Chapter 8.
3.9 SUMMARY
EXERCISES
3.5 Define a function newlist that takes one argument and returns it as a list, lithe argument
is already a list, including the empty list, newlist returns it without change. If the
argument is an atom, it returns it as a list.
3.6 Define a function named addlist that takes two arguments, an item and a list. If the
item is in the list, the function returns the list unaltered. If the item is not in the list.
the function returns the list with the item entered as the first element. For example.
3,7 Define a function construct-sentence which takes two lists as arguments. The lists are
simple sentences such as
The function should check to see if the subject of the sentence is the same in both
sentences. If so, the function should return the compound sentence
If the two sentences do not have the same subject, the function should return DII.
3.8 Write a function called word-member which takes one list as argument. The function
should print a prompt to the user
If the word typed is a member of the list the function should return i, otherwise nil.
For example,
3.9 Write a function talk that takes no arguments. It prints the prompt (without quotation
- marks or parentheses)
What is your name?
The function should then read an input from the same line two spaces following the
question mark (e.g., Susan), issue a line feed and carriage return, and then print
As before, the user should type a name (e.g.. Joe). The program should then respond
with
Very interesting Joe is my best friend too!
3.10 Write an iterative function named nth-item that takes two arguments, a poitve integer
and a list. The function returns the item in the nth position in the list. If the integer
exceeds the number of elements in the list, it returns nil. For example.
3.11 Write a recursive function named power that takes two numeric arguments, it and m.
The function computes the nth power of m (mj"). Be sure to account for the the case
where it =0, that is m0 = I. For example.
(power 4 3) returns 43 = 64.
3.12 Define a function called intersection which takes two lists as arguments. The functioty
should return a list containing single occurrences of all elements which appear in both
input lists. For example,
3.14 Write an iterative function named sum-all using do that takes an integer n as argument
and n-turns the sum of the integers from I to it. For example,
3.15 Write a function called sum-squares which uses mapca, to find the sum of the squares
of a list of integers. The function takes a single list as its argument. Write a lambda
function which mapcaz uses to find the square of the integers in the list. For example,
3.16 Write a PROLOG program that answers questions about family members and relation-
ships. Include predicates and rules which define sister, brother, father, mother, grand-
child, grandfather, and uncle. The program should be able to answer queries such as
the following:
7- father(X, bob).
7. grandson(X, Y).
7- uncle(bill, sue).
7- mother(mary, X).
3.17 Trace the search sequence PROLOG follows in satisfying the following goal:
7- member(c,Ia,b.c.dJ).
3.18 Write a function called match that takes two arguments, a pattern and a clause. if the
first argument, the pattern, is a variable identified by a question mark followed by a
lowercase letter (like ?x or ?y), the function should return a list giving the variable
and the corresponding clause, since a variable matches anything. If the pattern is not
a variable, it should return r only if the pattern and the clause are identical. Otherwise,
the function should return nil.
PART
Knowledge Representation
Formalized
Symbolic Logics
Starting with this chapter, we begin a study of some basic tools and methodologies
used in Al and the design of knowledge-based systems. The first representation
scheme we examine is one of the oldest and most important, First Order Predicate
Logic (FOPL). It was developed by logicians as a means for formal reasoning,
primarily in the areas of mathematics. Following our study of FOPL, we then investi-
gate five additional representation methods which have become popular over the
past twenty years. Such methods were developed by researchers in Al or related
fields for use in representing different kinds of knowledge.
After completing these chapters, we should be in a position to best choose
which representation methods to use for a given application, to see how automated
reasoning can be programmed, and to appreciate how the essential parts of a system
fit together.
4.1 INTRODUCTION
The use of symbolic logic to represent knowledge is not new in that it predates the
modern computer by a number of decades. Even so, the application of logic as a
practical means of representing and manipulating knowledge in a computer was
not demonstrated until the early 1960s (Gilmore, 1960). Since that time, numerous
47
Flexible for humans
FOPL
4$ Formalized Symbolic Logics Chap. 4
logically correct
systems have been inWiemented with varying degrees of success. Today, First Order
Predicate Logic (FOPI) or Predicate Calculus asit is sometimes called, has assumed
one of the most important roles in Al for the representation of knowledge.
A familiarity with FOPL is important to the student of Al for several reasons.
First, logic offers the only formal approach to reasoning that has a sound theoretical
foundation. This is especially important in our attempts to mechanize or automate
the reasoning process in that inferences should be correct and logically Sound. Second,
the structure of FOPL is flexible enough to permit the accurate representation of
natural language reasonably well. This too is important in Al systems since most
knowledge must originate with and be consumed by humans. To be effective, transfor-
mations between natural language and any representation scheme must be natural
and easy. Finally. FOPL is widely accepted by workers in the Al field as one of
the most useful representation methods. It is commonly used in program designs
and widely discussed in the literature. To understand many of the Al articles and
research papers requires a comprehensive knowledge of FOPL as well as some
related logics.
Logic is a formal method for reasoning. Many concepts which can be verbalized
can be translated into symbolic representations which closely approximate the meaning
of these concepts. These symbolic structures can then be manipulated in programs
to deduce various facts, to carry out a form of automated reasoning.
In FOPL, statements from a natural language like English are translated into
symbolic structures comprised of predicates, functions, variables, constants, quantifi-
ers, and logical connectives. The symbols form the basic building blocks for the
knowledge, and their combihation into valid structures is accomplished using the
syntax (rules of combination) for FOPL. Once structures have been created to represent
basic facts or procedures or other types of knowledge, inference flues may then be
applied to compare, combine and transform these "assumed" structures into new
• 'deduced" structures. This is how automated reasoning or inferencing is performed.
As a simple example of the use of. logic, the statement 'All employees of
the Al-Software Company are, programmers" might be written in FOPL as
Here, Yx is read as "for all x" and-. is read as "implies" or "then" The
predicates AI-SOFTWARE-CO-EMPLOYEE(x), and PROGRAMMER(x) are read
as "if x is an Al Software Company employee," and "x is a programmer" respec-
tively. The symbol x is a variable which can assume a person's name.
If it is also known that urn is an employee of Al Software Company,
AISOFTWARE-CO-EMPLOYEE(jim)
PROGRAMMER4jim)
I?
Sec. 4.2 Syntax and Semantics for Propositional Logic
The above uegests how knowledge in the form of English sentences can be
translated into FOPL statements. Once translated, such statements can be typed
into a knowledge base and subsequently used in a program to perform inferencing.
We begin the chapter with an introduction to Propositional Logic, a special
case of FOPL. This will be constructive since many of the concepts which apply
to this case apply equally well to FOPI. We then proceed in Section 4.3 with a
more detailed study of the use of FOPL as a representation scheme. In Section 4.4
we define the syntax and semantics of FOPL and examine equivalent expressions.
inference rules, and different methods for mechanized reasoning. The chapter con-
cludes with an example of automated reasoning using a small knowledge base.
It is raining.
My car is painted silver.
John and Sue have live children.
Snow is white.
People live on the moon.
Compound propositions are formed from atomic formulas using the logical
connectives not and or if . . . then, and if and only if. For example, the41Iowing
are compound formulas.
Ne will use capital letters. sometimes followed by digits, to stand for proposi-
tions; T and F are special symbols having the values true and false, respectively.
The following symbols will also be used for logical connectives
V for or or disjunction
-+ for if ... then or implication
--- for if and only if or double implication
In addition, left and right parentheses, left and right braces, and the period
will.bc used as delimiters for punctuation. So, for example. to represent the compound
sentence ''It is raining and the wind is blowing" we could write (R & B) where R
and B stand for the propositions ''It is raining" and "the wind is blowing," respec-
tively. If we write (R V 8) we mean ''it is raining or the wind is blowing or
both'' that is. V indicates inclusive disjunction.
Syntax
I-Pt
(P & 0)
(P V 0)
(P -.0)
(P -.0)
All formulas are generated from a finite number of the above operations.
((P & (0 V RI -. lQ -. SO
When there is no chance for ambiguity, we will omit parentheses for brevity:
Q). When omitting parentheses, the precedence
('(P & ' Q)can be written as (P &
given to the connectives from highest to lowest is , &, V. '. and . So, for
exanpIe, to add parentheses correctly to the sentence
P &'a V A -. S.-. U V W
we write
Semantics
The semantics or meaning of a sentence is just the value true or lake: that is. it i
an assignment of a truth value to the sentence. The values true and ltke should
not be confused with the symbols T and F which can appear within a sentence.
Note however, that we are not concerned here with philosophical issues related to.
meaning but only in determining the truthfulness or falsehood of formulas when a
particular interpretation is given to its propositions An interpretation for a sentence
or group of sentences is an assignment of a truth value to each propositional ssmbol.
As an example, consider the statement P & :Q. One interpretation I assign"
true to P and false to Q. A different interpretation (I.) assigns true to P and true
to Q. Clearly. there are four distinct interpretations for this sentence.
Once an interpretation has been given to ;I its truth value can he
determined. This is done by repeated application of semantic rules to larger and
larger parts of the statement until a single truth value is determined. The semantic
rules are summarizd in Table 4.1 where I. and I denote any true stalenlenis. .1.
andf' denote any false statements, and a is any statement.
I . T F
2. '1 'I
3. r&t'
5. ,Vi fyi
6.
7.
9.
We can now find the meaning of any statement given an interpretation I for
the statement. For example. let I assign true no P. false to Q and false no R in the
statement
((P & 0( -. RI V 0
Application of rule 2 then gives Q as true, rule 3 gives (P & 'Q) as true, rule 6
gives (P & Q),- R as false, and rule 5 gives the statement value as false.
Properties of Statements
Equivalence. Two sentences are equivalent if they have the same truth value
under every interpretation
Inference Rules
The infernce rules of PL provid the means to perform logical proofs or deductions.
The problem is, given a set of sentences S = {s.......} (the premises), prove
the truth of s (the conclusion); that is, show that St-s. The use of truth tables to do
this is a form of semantic proof. Other syntactic methods of inference or deduction
are also possible. Such methods do not depend on truth assignments but on syntactic
relationships only; that is. It is possible to derive new sentences which are logical
consequences of s ...... .s,, using only syntactic operations. We present a few
such rules now which will be referred to often throughout the text.
for example
given: (joe.is a father)
and: tJoe is a father) - (Joe has a child)
conclude: (Joe has a child)
For example.
given: (programmer likes LISP) -. (programmer hates COBOL)
and: (programmer hates COBOL) - Programmer likes recursion)
conclude: (programmer likes LISP) - (programmer likes recursion)
Soundness. Let <S. L> be a formal system. We say the inference procedures
that can be derived from <S., L> is a
L are sound if and only if any statements s
logical consequence of <S. L>.
Completeness. Let <S. L> be a formal system. Then the inference proce-
L> can
dure L is complete if and only if any sentence s logically implied by <S.
be derived using that procedure.
As an example of the above definitions, suppose S = {P. P - Q} and L is
can be derived
the modus ponens rule. Then <S, L> is a formal system, since Q
from the system. Furthermore, this system is both sound and complete for the
reasons üiven above.
We will see later that a formal system like resolution permits us to perform
computational reasoning. Clearly, soundness and completeness are desirable proper-
ties of such systems. Soundness is important to insure that all derived senteilces
are true when the assumed set of sentences are true. Completeness is important to
guarantee that inconsistencies can be found whenever they exist in a set of sentences.
introduction of predicates in place, of propositions, the use of functions and the use
of variables together with variable quantifiers. These concepts are formalized below.
The syntax for FOPL, like P1, is determined by the allowable symbols and
rules of combination. The semantics of FOPL are determined by interpretations
assigned to predicates, rather than propositions. This means that an interpretation
must also assign values to other terms including constants, variables and functions,
since predicates may have arguments consisting of any of these terms. Therefore,
the arguments of a predicate must be assigned before an interpretation can be made.
Syntax of FOPL
The symbols and rules of combination permitted in FOPL are defined as follows.
Connectives. There are five connective symbols: (not or negation), & (and
or conjunction), V (or or inclusive disjunction, that is, A or B or both A and B),
-. (implication), -* (equivalence or if and only if).
Variables. Variables are terms that can assume different values over a given
domain. They are denoted by words and small letters near the end of the alphabet,
such as aircraft-type, individuals, x, y, and z.
El: All employees earning $1400 or more per year pay taxes.
E2: Some employees are sick today.
E3: No employee earns more than the president.
To represent such expressions in FOPL, we must define abbreviations for the predicates
and functions. We might, for example, define the following-
In the above, we read El' as "for all x if x is an employee and the income
of x is greater than or equal to $1400 then .r pays taxes." More naturally, we read
El' as El, that is as "all employees earning $1400 or more per year pay taxes."
E2' is read as "there is an employee and the employee is sick today" or
"some employee is sick today." E3' reads "for all x and for all y if .r is an
employee and. y is president, the income of x is not greater than or equal to the
income of y." Again, more naturally, we read E3' as, "no employee earns more
than the president."
The expressions El'. E2', and E3' are known as well-formed formulas or
wffc (pronounced woofs) for short. Clearly, all of the wffs used above would be
more meaningful if full names were used rather than abbreviations.
Wffs are defined recursively as follows: -
The above rules state that all wffs are formed from atomic formulas and the
proper application of quantifiers and logical connectives.
Some examples of valid wffs are
MAN(john)
PILOT)father-of) bri Of
3xyz ((FATHER,( x.y) & FATHER(y.z)) -. GRANDFATHER(x.z))
Va NUMBER (x) -. (3y GREATER-THAN(y.x))
Vs ty (PIX) & 0(y))— (R(a) V 0(b))
VP Fix) -- Clx)
MANijohn)
I ether-of (0(x))
MARRIEO(MAN,WOMAN)
The first group of examples above are all wffs since they are properly formed
expressions composed of atoms, logical connectives, and valid quantifications. Exam-
ples in the second group fail for different reasons. In the first expression, universal
quantification is applied to the predicate P(x). This is invalid in FOPL.' The second
expression is invalid since the term John, a constant, is negated. Recall that predicates,
and not terms are negated. The third expression is invalid since it is a function
with a predicate argument. The last expression fails since it is a predicate with two
predi ate arguments.
When considering specific wfis. we always have in mind some domain D. If not
stated explicitl y . 1) will be understood from the context. D is the set of all elements
or objects from which fixed assignments are made to constants and from which the
domain and range of functions are defined. The arguments of predicates must be
terms (constants, variables, or functions). Therefore, the domain of each n-place
predicate is also defined over D.
For example, our domain might be all entities that make up the Computer
Science Department at the University of Texas. In this case, constants would he
professors (Bell. Cooke. Gelfond. and so on), staff (Martha. Pat. Linda, and so
on). books, labs, offices, and so forth. The functions we may choose might he
Predicates may be quantified in .ceeond order predicate logic as indicated in the example. but
never ip first order logic.
Syntax and Semantics for FOPL 59
Sec. 4.3
Since the whole expression E is quantified with the universal quantifier Vs.
Thus.
it will evaluate to true onl y if it evaluates to true for all x in the domain 0.
to complete our example. suppose E is interpreted as follows: Define the domain
1) = (1.2l and from D let the interpretation / assign the following values:
=2
ill) = 2. f(2) = I
A(2.1) = true. A(2.2) = false
B) I) true. B(2) false
C(l( = true. C(2) = false
Dl I) = fake. D(2) = true
As in the case f PL, the evaluation of complex formulas in FOPL can often be
facilitated through the substitution of equivalent formulas. Table 4.3 lists a number
of equivalent expressions. In the table F, G and H denote wffs not containing
variables and F (x) denotes the wif F which contains the variable x. The equivalences
can easily be verified with truth tables such as Table 4,3 and simple arguments for
the expressions containing quantifiers. Although Tables 4.4 and 4.3 are similar,
there are some notable differences, particularly in the wffs containing quantifiers.
For example, attention is called to the last four expressions which govern substitutions
involving negated quantifiers and the movement of quantifiers across conjunctive
and disjunctive connectives.
We summarize here some definitions which are similar to those of the previous
section. A wif is said to be valid if it is true under every interpretation. A wif that
is false under every interpretation is said to be inconsistent (Or unsatisfiable). A
wif ihat is not valid (one that is false for some interpretation) is invalid. Likewise,
a wif that is not inconsistent (one that is true for some interpretation) is satisfiable.
Again, this means that a valid wif is satisfiable and an inconsistent wif is invalid,
but the respective converse statements do not hold. Finally, we say that a wIT Q is
a logical consequence of the wffs P 1 . P2 ,..., P if and only if whenever P 1 &
P, & . & P,, is true under an interpretation, Q is also true.
To illustrate some of these concepts, consider the following examples:
CIEVERIbjII) and
Vx CLEVER(x) SUCCEED(x)
CLEVER(bill) and
Y, CLEVER(x) - SUCCEED(x)
CLEVER(bill) -. SUCCEED(bill)
is certainly true since the wif was assumed to be true for all x, including
x = bill. But, -
CLEVER(bill) -. SUCCEED(bill)
'CLEVER(bill) V SIJCCEED(biII)
These steps are described in more detail below. But first, we describe the
process of eliminating the existential quantifiers through a substitution process. This
process requires that all such variables be replaced by something called Skoleni
functions, arbilráry functions which can always assume a correct value required of
an existentially quantified variable.
For simplicity in what follows, assume that all quantifiers have been properly
moved to the left side of the expression, and each quantifies a different variable.
Skolemization. the replacement of existentially quantified variables, with Skolem
functions and deletion of the respective quantifiers, is then accomplished as follows:
I. If the first (leftmost) quantifier in an expression is an existential quantifier.
replace all occurrences of the variable it quantifies with an arbitrary constant not
appearing elsewhere in the expression and delete the quantifier. This same procedure
63
Sec. 4.5 Conversion to Clausal Form
should be followed for all other existential quantifiers not preceded by a universal
quantifier. in each case, using different constant symbols in the substitution.
2. For each existential quantifier that is preceded by one or more universal
quantifiers (is within the scope of one or more universal quantifiers), replace all
occurrences of the existentially quantified variable by a function symbol not appearing
elsewhere in the expression- The argumentsassigned to the function should match
all the variables appearing in each universal quantifier which precedes the existential
quantifier. This existential quantifier should then be deleted. The same process should
be repeated for each remaining existential quantifier using a different function symbol
and choosing function arguments that correspond to all universally quantified variables
that precede the existentially quantified variable being replaced.
An example will help to clarify this process. Given the espresslon
u Vv Vx 3y P(f(u), v, IC -. Q(u.v.y)
Vv Vx P(f(a).v.x,g(v,X)) - Q(a,vgtv,d).
In making the substitutions, it should be noted that the variable u appearing after
the first existential quantifier has been replaced in the second expression by the
arbitrary constant a. This constant did not appear elsewhere in the first expression.
The variable y has been replaced by the function symbol g having the variables r'
and x as arguments. since both of these variables are universally quantified to the
left of the existential quantifier for y. Replacement of y by an arbitrary function
and .v. may be
with arguments v and x is justified on the basis that Y. following v
functionally dependent on them and, if so, the arbitrary function g can account for
this dependency. The complete procedure can now be given to convert any FOPL
sentence into clausal form.
Step 6. Eliminate all universal quantifiers and conjunctions since they are
retained implicitly. The resulting expressions (the expressions previously connected
by the conjunctions) are clauses and the set of such expression is said to be in
clausal form.
P(f(a),y.g(y)) V O(a,h(y))
P(f(a),y.g(y) V R(y.l(y))
The last two clauses of our final form are understood to be universally quantified
in the variable y and to have the conjunction symbol connecting them.
It should be noted that the set of clauses produced by the above process are
not equivalent to the original expression, but satisfiability is retained. That is, the
set of clauses are satisfiable if and only if the original sentence is satisfiable.
Having now labored through the tedious steps above, we point out that it is
often possible to write down Statements directly in clausal form without working
through the above process step-by-step. We illustrate how this may be done in
Section 4.7 when we create a sample knowledge base.
Sec. 4.6 Inference Rules 65
Like PL, a key inference rule in FOPL is modus ponens. From the assertion 'Leo
is a lion" and the implication 'all lions are ferocious" we can conclude that Leo
is ferocious. Written in symbolic form we have
assertion: LION(leo)
implication: Vx LION(x) -. FEROCIOUS(r)
conclusion: FEROCIOVS(leo)
In general, if a has property 1' and all objects that have property P also have
property Q, we conclude that a has property Q.
P( a)
V.i P(x) -, Q(x)
Q(a)
Unification
Any substitution that makes two or more expressions equal is called a unifier for
the expressions. Applying a substitution to an expression £ produces an instance
E' of E where E' = £ 13. Given two expressions that are unifiable, such as expressions
C 1 and C with a i,iifer 13 with C 1 f3 = C2 , we say that 13 is a most general unier
(mgu) if any other unifer cs is an instance of 13. For example two unifiers for the
literals P(u,b,v) and P(a,x,y) are ot = a/u,b/x,u/v) and 13 = {alu,blx,clv,c/y}. The
former is an mgu whereas the latter is not since it is an instance of the foriiier.
Unification can sometimes be applied to literals within the same single clause.
When an mgu exists such that two or more literals within a clause are unified, the
clause remaining after deletion of all but one of the unified literals is called a
66 Formalized Symbolic Logics Chap. 4
Ju tar of the original clause ihus, given the clause C = P(.r) V Q(.v) V P(jl:))
the factor C' = (13 = P(f (:)) V Q)fl:l,v) is obtained where 3 =
Let S be a set of expressions. We define the disagreement set of S as the set
obtained by comparing each s y mbol of all expressions in S from left to right and
extracting from S the subexpressions whose first symbols do not agree. For example,
let S = P(J(.r),'lv).a), 1'J) i),:,al, !'lJ(.r),b,h(u)). I-or the set S. the disagreement
set is (i(v)a.h,:,hl u)} We call state a unification algorithm ss hieh returns the
m2u for a given set of expressions S
Unification algorithm:
not unifiable.
We are now ready to consider the resolution principle, a syntactic inference procedure
which, when applied to a set of clauses, determines if the set is unsatisfiable. This
procedure is similar to the process of obtaining a proof by contradiction. For example,
suppose we have the set of clauses (axioms) C l . C,, . . . , C. and we wish to
deduce or prove the clause 0, that is. to show that 0 is a logical consequence of
C 1 & C & . . . & C,,. First, we negate 0 and add !) to the set of clauses Cl,
C,, . . . . C_ Then, using resolution together with factoring, we can show that
the set is unsatisfiable by deducing a contradiction. Such a proof is called a proof
by refutation which, if successful. yields the empty clause denoted by 11. 2 Resolution
with factoring is eu,np/eu' in the sense that it will alssays generate the empty clause
front set of unsatisfiable clauses.
Resolution is very simple Given two clauses ( and C, with no variables in
common, if there is a literal I I in C I which is a complement of a literal l in C2.
both I I and l are deleted and a disjuncted C is formed from the remaining reduced
clauses. The new clause C is called the resolvent of C 1 and C,. Resolution is the
process of generating these resolvents from a set of clauses. For example. to resolve
the two clauses
( P V Q) and IQ VP)
'The empty clause I ) isaiwas false since no inierprciation can. saiisfy it It is denved 11001
combining contradictory clauses such as P and P,
Sec. 4.7 The Resolution Principle 67
we write
PVQ.QVR
RVR
Several types of resolution are possible depending on the number and types
of parents. We define a few of these types below.
The substitution {b/x} was made in the two parent clauses to produce the
complementary literals Q(b) and - Q(b) which were then deleted from the disjunction
of the two parent clauses.
where the substitution 0 = {sue/x, joe/v, bill!-) is used, results in the Unit clause
MOTHER(sue,biil).
can be intolerably inefficient. Randomly resolving clauses in a large set can result
in inefficient or even impossible proofs. Typically, the curse of combinatorial explo-
sion occurs. So methods which constrain the search in some way must be used.
When attempting a proof by resolution, one ideally would like a minimally
unsatisfiable set of clauses which includes the conjectured clause. A minimall y unsatis-
fiable set is one which is satisfiable when any member of the set 'is omitted. The
reason for this choi r e is that irrelevant clauses which are not needed in the proof
but which par1icipte are unnecessary resolutions. They contribute nothing toward
the proof. Indeed, they can sidetrack the search direction resulting in a dead end
and loss of resources. Of course, the set must be unsatisfiable otherwise a proof is
impossible.
A minimally unsatisfiable set is ideal in the sense that all clauses are essential
and no others are needed. Thus, if we wish to prove B, we would like to do so
with a set of clauses S = (A ' , A,,.....Ak} which become minimally unsatistiable
with the addition of B.
Choosing the order in which clauses are resolved is known as a search strategy.
While there are many such strategies now available, we define only one of the
more important ones, the set-of-support strategy. This strategy separates a set which
is unsatisfiable into subsets, one of which is satisfiable.
Example of Resolution
The example we present here is one to which all Al students should be exposed at
some point in their studies. It is the famous "monkey and bananas problem,"
another one of those complex real life problems solvable with Al techniques. We
envision a room containing a monkey, a chair, and some bananas that have been
hung from the center of the ceiling, out of reach from the monkey. If the iflonkey
is clever enough, he can reach the bananas by placing the chair directly below
them and climbing on top of the chair. The problem is to use FOPL to represent
this monkey-banana world and, using resolution, prove the monkey can reach the
bananas.
In creating a knowledge base, it is essential first to identify all relevant objects
which will play some role in the anticipated inferences. Where possible, irrelevant
objects should be omitted, but never at the risk of incompleteness. For example,
in the current problem, the monkey, bananas, and chair are essential. Also needed
is some reference object such as the floor or ceiling to establish the height relationship
between monkey and bananas. Other objects such as windows, walls or doors are
not relevant.
The next step is to establish important properties of objects, relations between
Sec. 4.7 The Resolution Principle
them, and any assertions likely to be needed. These include such facts as the chair
is tall enough to raise the monkey within reach of the bananas, the monkey is
dexterous, the chair can be moved under the bananas, and so on. Again, all important
properties, relations, and assertions should 1e, included and irrelevant ones omitted.
Otherwise, unnecessary inference steps may be taken.
The important factors for our problem are described below, and all items
needed for the actual knowledge base are listed as axioms. These are the essential
facts and rules. Although not explicitly indicated, all variables are universally quanti-
fied.
j LI ' I
(x, Y. 4
AXIOMS
(insoom(bananas)
in-room(chair)
iruoom(monkey)
dexterous(monkey)
tall(chair)
close(bananas,floor)
can..move(monkey,chair,bananas)
can..rl.imb(monkey,chair)
(dexterous(x) & close(x.y) -. can-reach(x,y)
((get..on(x.y) & under(y,bananas) & tall(y) -.
close(x,bananas))
((in,room(x) & in-room(y) & inroom(z) & canmove(x,y,z))
close(z.floor) V undor(y,z))
(can..climb(x,y) -. get.on(x,y)))
70 Formalized Symbolic Logics Chap. 4
Using the above axioms, a knowledge base can be written down directly in
the required clausal form. All that is needed to make the necessary substitutions
are the equivalences
P Q = V 0
and De Morgan's laws To relate the clauses to a LISP program, one may prefer
to think of each clause as being a list of items. For example, number 9, below,
would be written as
1. in.room(monkey)
2. in-room(bananas)
3• in-room(chair)
4 tali(char)
5. dexterous(monkey)
6. can.move(monkey, chair, bananas)
7. canclimb(monkey.chair)
8. close(bananas,floor)
9. canchmb(x.y) V geton(x,y)
10. dexterous(x) V close(x,y) V canreach(x,y)
11. get.on(x,y) V under(y,bananas) V -tall(y) V close(x,bonanas)
12. ir.roon,(x) V inroon(y) V inroom)z) V canmoe(x.y,z) V close(y.Itoor) V under(y,z)
13. caneach(monkey.bananas)
Resolution proof. A proof that the monkey can reach the bananas is sumnia-
nzed below. As can be seen, this is a refutation proof where the statement to be
proved (can_reach(monkey, bananas y) has been negated and added to the knowledge
base (number 13). The proof then follows when a .ontradiction is found (see number
23, below).
% Constants:
% )floor, chair, bananas, monkey)
% Variables:
% IX, V. Z
• Predicates:
• (can-reach(X,Y) X can reach V
• dexterous(X) X is a dexterous animal
• close(X,V) X is close to V
72 Formalized Symbolic Logics Chap. 4
% Axioms:
in-room(bonanas).
in-room(chair).
in-roommonkey).
dextarous(monkey).
tall(chair).
can-climb(monkey, their).
can-reach(X.Y)
dexterous(X}, dose(X.Y).
close(X,Z)
get-on(XX).
under(Y.Z),
tall(y).
get-on(X.Y)
can-ctimb(X,Y).
UnderV,Z) :-
in-room(X),
in-room(Y).
in-room(Z).
can-move(X.V,Z).,
Sec. 4.8 Nondeductive Inference Methods
This completes the data base of facts and rules required. Now we can pose various
queries to our theorem proving system.
I?- can-reach(X,Y).
X = monkey.
V = bananas
?- can-reach(X,bananas).
X monkey
I 7- can-reach(monkey.V).
V = bananas
?- can-reach(monkey,bananas).
yes
7- can-reach(Iion,bananas).
no
I ? - can-reach(monkey,apple).
no
In this section we consider three nondeductive forms of inferencing. These are not
valid forms of inferencing, but they are nevertheless very important. We use all
three methods often in every day activities where we draw conclusions and make
decisions. The three methods we consider here are abduction, induction, and analogical
inference.
Abductive Inference
conclusion. People may stagger when they walk for other reasons, including dizziness
from twirling in circles or from some physical problem.
We may represent abductive inference with the following, where the c over
the implication arrow is meant to imply a possible causal relationship.
assertion Q
implication P -. Q
conclusion P
.Abductive inference is useful when known causal relations are likely and deduc-
tive inferencing is not possible for lack of facts.
Inductive Inference
Analogical Inference
-i Q
Analogical inference, like abductive and inductive is a useful but invalid form
of commonsense inference,
Rules can be considered a subset of predicate logic. They have become a popular
representation scheme for expert systems (also called rule-based systems). They
were first used in the General Problem Solver system in the early 1970s (Newell
and Simon. 1972). -
Rules have two component parts: a left-hand side (LHS) referred to as the
,
antcedent. premise, condition, or situation, and a right-hand side (RHS) known
as the consequent, conclusion, action, or response. The LHS is also known as the
if part' and the RHS as the then part of the rule Some rules also include an else
part. Examples of rules which might be used in expert systems are given below
• IF: A&B&C
THEN: D
A & B & (C V D) - D
INTERNAL FORM
RU(.E047
Premise: ((Sand (same cntxt site blood)
(notdefinite contxt ident)
(same cntxt morph rod)
(same cntxt burn I))
Action: (conclude crttxt ident pseudonomas 0.4))
ENGLISH TRANSLATION
If: 1) The Site of the culture is blood, and
2) The identity of the organism is not known with certainty, and
3) The stain of the organism is gramneg, and
4) The morphology of the organism is rod, and
5) The patient has been seriously burned
THEN: There is weakly suggestive evidence (0.4) that the identity of the organism
is pseudonomas.
Figure 4.1 A rule from the MYCIN system.
4.10 SUMMARY
We have considered propositional and first order predicate logics in this chapter as
knowledge representation schemes. We learned that while PL has a sound theoretical
foundation, it is not expressive enough for many practical problems. FOPL, on the
MYCIN was one of the earliest expert systems. It was.dcvcioped at Stanford University in the
mid-1970s to demonstrate that a system could successfully perform diagnoses of patients having infectious
blood diseases.
Sec. 4.10 Summary 77
other hand, provides a theoretically sound basis and permits a great latitude of
expressiveness In FOPL one can easily code object descriptions and relations among
objects as well as general assertions about classes of similar objects. The increased
generality comes from the joining of predicates, functions, variables, and quantifiers.
Perhaps the most difficult aspect in using FOPL is choosing appropriate functions
and predicates for a given class of problems.
Both the syntax and semantics of PL and FOPL were defined and examples
given. Equivalent expressions were presented and the use of truth tables was illustrated
to determine the meaning of complex formulas. Rules of inference were also presented,
providing the means to derive conclusions from a basic set of facts or axioms.
Three important syntactic inference methods were defined: modus ponens, chain
rule and resolution. These rules may be summarized as follows.
EXERCISES
4.1 Construct a truth table for the expression (A & (A V B)). What single term is this
expression equivalent to?
4.2 Prove the following rules from Section 4.2.
(a) Simplification: From P & Q. infer P
(b) Conjunction: From P and Q. infer & Q
(C) Transposition: From P -. Q. infer Q 1-'
78 Formalized Symbolic Logics Chap. 4
4.3 Given the following PL expressions, place parentheses in the appropriate places to
form fully abbreviated wffs.
(a)7'VQ&R—'S--U&Q
(b) P & Q V P -. (I -. R
(c)QVPVR&S—U&PR
4.4 Translate the following axioms into predicate calculus wffs. For eample, A 1 below
could be given as
Vx,y,z CONNECTED(X,y,z) & Bikesok(z) -. GETTO(x.y)
S1:(P&Q)V(P&Q) S:(PVQ)-.(P&Q)
S:(P&Q)—.RVQ . S4:(PVQ)&(PVQ)Vp
Sç:P . Q—. P S5:PVQ&PVQ&P
4.11 Given the following wlfs P -. Q. Q. and P. show that P is a logical consequene
of the two preceding wufs
(a) Using a truth table
(b) Using Theorem 4.1
4.12 Given formulas S 1 and S 2 below, show that Q(a) is a logical consequence of the two.
S (Vol p (.1) - Q(.r() S Pi l l)
4.13 Transform the following formula to prenex normal form
In other words, strict classical logic formalisms do not provide realistic representations
of the world in which we live. On the contrary, intelligent beings are continuously
required to make decisions under a veil of uncertainty.
Uncertainty can arise from a variety of sources. For one thing, the information
80
Sec. Li Introduction a.,
we have available may be incomplete or highly volatile. Important facts and details
which have a bearing on the problems at hand may be missing or may change
rapidly'. In addition, many of the "facts' • available may be imprecise, vague, or
fuzzy. indeed, some of the available information may be contradictory or even
unbelievat.'lC. However, despite these shortcomings, we humans miraculously deal
with uncerLlinties on a daily basis and usually arrive at reasonable solutions. If it
were otherwise, we would not be able to cope with the continually changing situations
of our world.
In this and the , following chapter, we shall discuss methods with which to
accurately represe,.'t and deal with different forms of inconsistency, uncertainty,
possibility, and beliL'fs. In other words, we shall be interested in representations
and inference methods related to what is known as commonsense reasoning.
5.1 INTRODUCTION -
Consider the following real-lift' situation. Timothy enjoys shopping in the mall
only uEen the stores are not crov 'dod. He has agreed to accompany Sue there on
the following Friday evening since 'his is normally a time when few people shop.
Before the given date, several of tht , larger stores announce a one-time, special
sale starting on that Friday evening. ,mothy, fearing large crowds, now retracts
the offer to accompany Sue, promising to go on some future date. On the Thursday
before the sale was to commence, weather forecasts predicted heavy snow. Now.
believing the weather would discourage most shoppers, Timothy once again agreed
to join Sue But, unexpectedly, on the given Friday, the forecasts proved to be
fIse; so Timothy once again declined to go;
This anecdote illustrates how one's beliefs can c hange in a dynamic environment.
And, while one's beliefs may not fluctuate as much as Timothy's. in most situations.
this form of belief revision is not uncommon. lndeea." fl is common enough that
we label it as a form of commonsense reasoning, that is. reasoning with uncertain
knowledge.
Nonmonotonic Reasoning
The logics we studied in the previous chapter are known as mon )tOniC logics. The
conclusions derived using such logics are valid deductions, and they remain so.
Adding new axioms increases the amount of knowledge contained in the knowledge
base. Therefore, the set of facts and inferences in sucb..,systems ch ,n only grow
larger; they can not be reduced; that is. they increase monotonically. T he form of
reasoning performed above by Timothy, on the other hand, is nonmonot)nic. New
facts became known which contradicted and invalidated old knowledge. The old
knowledge .was retracted cau.4 othigir dependent knowledge to become nvalid.
thereby requiring further retracdbns. The retractions led to a shrinkage or nom °°
tonic growth in the knowledge at times.
1-
82 Dealing with Inconsistencies and Uncertainties Chap. 5.
More formally, let KBL be a formal first order system consisting of a kjlowl'.dge
base and some logic L. Then, if KB and K82 are knowledge bases where
KB1 - KBL
KB2 Kifi. U F, for some wif F, then
KB1 ç K82
In other words, a first order KB system can only grow 11100onjcaIIy with
added knowledge.
When building knowledge-based systems, it is not re&co' iable to expect that
all the knowledge needed for a Set of tasks could be acquired, validated, and loaded
into the system at the outset. More typically, the initial knowl'lge will be incomplete,
contain redundancies, inconsistencies, and other sources oi uncertainty. Even if it
were possible to assemble complete, valid knowledge initially, it probably would
not remain valid forever, not in a continually changin environment.
In an attempt to model real-world, commonse ,ise reasoning, researchers have
proposed extensions and alternatives to traditiolv .d logics such as PL and FOPL.
The extensions accommodate different forms of, uncertainty and nonmoflOtOflY. In
some cases, the proposed methods have been implemented. In other cases they are
still topics of research. In this and the folIo' .ing chapter, we will examine some of
the more important of these methods.
We begin in the next section with a description of truth maintenance systems
(TMS), systems which have been iIT4".iemented to permit a form of nonmonotonic
reasoning by permitting the addition of changing (even contradictory) statements to
a knowledge base. This is followed in Section 5.3 by a description of other methods
which accommodate nonmonoton c reasoning through default assumptions for incom-
plete knowledge bases. The it ,sumptions are plausible most of the time, but may
have to be retracted if other ',onflicting facts are learned. Methods to constrain the
knowledge that must be cor sidered for a given problem are considered next. These
methods also relate to nor ,monotonic reasoning. Seciion 5.4 gives a brief treatment
of modal and temporal l gics which extend the expressive rower of classical logics
to p mit repre.sentatir jus and reasoning about necessary and possible situations,
temporal, and other related situations. Section 5.5 concludes the chapter with a
ri presentation 0' (a relatively new method for dealing with vague and imprecise
information, nansi 21y fuzzy logic and language computation.
Truth maintenance systems (also known as belief revision and revision maintenance
s ystems ') are companion components to inference ..ystcms. The main job of the
TMS i', to maintain consistency of the knowledge being used by the problem solver
and r at to perform any inference functions. As such, it frees the problem solver
frcn' any concerns of consistency and allows it to concentrate on the problem solution
Sec. 5.2 Truth Maintenance Systems 83
aspects. The TMS also gives the inference component the latitude to perform nonmono-
tonic inferences. When new discoveries are made, this more recent information
can displace previous conclusions that are no longer valid. In this way, the set of
beliefs available to the problem solver will continue to be current and consistent.
Figure 5.1 illustrates the role played by the TMS as part of the problem
solver. The inference engine (IE) solves domain problems based on its current belief
set, while the TMS maintains the currently active belief set. The updating process
is incremental. After each inference, information is exchanged ltween the two
components. The IE tells the TMS what deductions it has made. The TMS, in
turn, asks questions about current beliefs and reasons for failures. It maintains a
consistent set of beliefs for the IE to work with even if new knowledge is added
and removed.
For example, suppose the knowledge base (KB) contained only the propositions
P, P -. Q, and modus ponens. From this, the IE would rightfully conclude Q and
add this conclusion to the KB. Later, if it was learned that P was appropriate, it
would be added to the KB resulting in a contradiction. Consequently, it would be
necessary to remove P to eliminate the inconsistency. But, with P now removed,
Q is no longer a justified belief. It too should be removed. This type of belief
revision is the job of the TMS.
Actually, the TMS does not discard conclusions like Q as suggested. That
could be wasteful, since P may again become valid; which would require that Q
and facts justified by Q be rederived. Instead, the TMS maintains dependency records
for all such conclusions. These records determine which set of beliefs are current
(which are to be used by the IE). Thus, Q would be removed from the current
belief set1 by making appropriate updates to the records and not by erasing Q . Since
Q would not be lost, its rederivation would not be necessary if P became valid
once again.
The TMS maintains complete records of reasons or justifications for beliefs.
Each proposition or statement having at least one valid justification is made a part
of the current belief set. Statements lacking acceptable justifications are excluded
from this set. When a contradiction is discovered, the statements responsible for
the contradiction are identified and an appropriate one is retracted. This in turn
may result in other retractions and additions. The procedure used to perform this
process is called dependency-directed backtracking. This process is described later.
The TMS maintains records to reflect retractions and additions so that the 1E
P,obl.m o4e
'I
I Infwenc. l______TIII I
I wgin. L.,__.A*
will always know its current belief set. The records are maintained in the form of
a dependency network. The nodes in the network represent KB entries such as
premises, conclusions, inference rules, and the like. Attached to the nodes are justifica-
tions which represent the inference steps from which the node was derived. Nodes
in the belief set must have valid justifications. A premise is a fundamental belief
which is assumed to be always true. Premises need no justifications. They form a
base from which all other currently active nodes can be explained in terms of valid
justifications.
There are two types of justification records maintained for nodes: support
lists (SL) and conceptual dependencies (Cl?). SLs are the most common type. They
provide the supporting justifications for nodes. The data structure used for the SL
contains two lists of other dependent node names, an in-list and an out-list. It has
the form
In order for a node to be active and, hence, labeled as IN the belief set, its
SL must have at least one valid node in its in-list, and all nodes named in its out-
list, if any, must be marked OUT of the belief set. For example, a current belief
set that represents Cybil as a nonflying bird (an ostrich) might have the nodes and
justifications listed in Table 5.1.
Each IN-node given in Table 5.1 is part of the current belief set. Nodes n I
and n5 are premises. They have empty support lists since they do not require justifica-
tions. Node n2, the belief that Cybil can fly is out because n3, a valid node, is in
the out-list of n2.
Suppose it is discovered that Cybil is not an ostrich, thereby causing nS to
be retracted (marking its status as OUT). Then n3, which depends on n5, must
also be retracted. This, in turn, changes the status of n2 to be a justified node.
The resultant belief set is now that the bird Cybil Can fly.
To represent a belief network, the symbol conventions shown in Figure 5.2
are sometimes used. The meanings of the nodes shown in the figure are (I) a
premise is a true propositon requiring no justification. (2,) an assumption is a current
belief that could change, (3) a datt,imjs tither a currently assumed or IE derived
belief, and (4) justifications are the belief (node) supports, consisting of supporting
antecedent node links and a consequent node link.
0
Sec. 5.2 Truth Maintenance Systems 85
Promises Assumptions D.lum Justifications
KIDI_lcD
Flpre 5.2 Belief network node meanings.
IN
ou
IN OUT
IN
IN
IN
//
C(
,I
/
// \
\F) F £ F E F
the source of a contradiction (the dashed line from E to A), extra search time is
avoided.
CP justifications are used less frequently than the SLs. They justify a node
as a type of valid hypothetical argument. The internal form of a CP justification is
as follows:
where the justifications for n is n2 (with OUT status) and for n3 is nodes nS,
n22 which relate to the availability of a captain, copilot, and other crew
members qualified as class A. Now suppose it is learned that a full class A crew is
not available. To complete a schedule for the flight, the system must choose an
Default Reasoning and the Closed World Assumption 87
Sec. 5.3
alternative aircraft, say an L400. To do this the IE changes the status of n2 to IN.
But this results in a contradiction and, hence, creation of the node
The contradiction now initiates the DDB procedure in the TMS to locate the offending
assumptions. Since there is only one such node. n I. its retraction is straightforward.
For this, the TMS creates a 'nogood" node with a CP justification as
This in turn causes n I to become OUT (as an assumption n I has a nonempty out-
list). Also, since n4 was justified by nI, it too must become OUT. This gives the
following set of nodes
Note that a CP justification was needed for the 'nogood" node to prevent a
circular retraction of n2 from occurring. Had an SL been used for n5 with an n4
node in-list justification, n5 would have become OUT after n4, again causing n2
to become OUT.
The procedures for manipulating CPs are quite complicated, and, since they
are usually converted into SIs anyway, we only mention their main functions here.
For a more detailed account see Doyle (1979).
We have briefly described the JTMS here since it is the simplest and most
widely used truth maintenance system. This type of TMS is also known as a nonmono-
tonic TMS (NMTMS). Several other types have been developed to correct some of
the deficiencies of the JTMS-end to meet other requirements. They include the
logic-based TMS (the LTMS), and the assumption-based TMS (the AIMS), as
well as others (de Kleer, I986a and 1986b).
Default Reasoning
• Transitivity can also be a problem in KBs with many default rules. Rule
interactions can make representations very complex. Therefore caution is needed
in implementing such systems.
P(b)
P(a) - Q(a)
and modus ponens is not complete, since neither Q(b) nor Q(b) is infenable from
the KB. This KB can be completed, however, by adding either Q(b) or
In general, a KB augmented with CWA is not consistent. This is easily' seen
by considering the KB consisting of the clause
P(a) V Q(b)
Now, since none of the ground literals in this clause are derivable, the augmented
KB becomes
P(a) V Q(b)
P(a), Q(b)
which is inconsistent.
It can be shown that consistency can be maintained for a special type of
90 Dealing with Inconsistencies and Uncertainties Chap. 5
Completion Formulas
Completion formulas are axioms whic)' are added to a KB to restrict the applicability
of specific predicates. If it is know', that only certain objects should satisfy given
predicates, formulas which make this knowledge explicit (complete) are added to
the KB. This technique also re4uires the addition of the unique-names assumption
(UNA); that is, formula wh'ch state that distinguished named entities in the KB
are unique (different).
As an example of p redicate completion, suppose we have the .following KB:
OWNS(joe,tord)
STUDENT(IoO)
OWNS(jiII,chevy)
STUDENT(jilI)
OWNS(sam.bike)
PROGRAMMER(sam)
STUDENT(mary)
If it iF, known that Joe is the only person who owns a Ford, this fact can be made
explicit with the following completion formula:
EQUAUa,joe) (5.3)
which has the meaning that this is true for all constants a which are different from
Joe.
Sec. 5.4 Predicate Completion and Circumscription 91
Likewise, if it is known that Mary also has a Ford, and only Mary and Joe
have Fords, the completion and corresponding inequality formulas in this case would
be
Yx OWNStford,x —. EQUAUx,joe V EOUAL(x.mary)
tQUAL(a.joe)
EOUAL(a,mary)
Once completion formulas have been added to a KB, ordinary first order
proof methods can be used to prove statements such as OWNS( ji11, ford). For example,
to obtain a refutation proof using resolution, we put the completion and inequality
formulas 5.2 and 5.3 in clausal form respectively, negate the query OWNS(jill.ford)
and add it to the KB. Thus, resolving with the following clauses
1. TWNS(x,tord) V EQUAUx.joe)
2. TUAUa,joe)
3. OWNS(jiII,ford)
4. EQUAL( jiII,joe)
and from 2 and 4 (recall that a is any constant not = Joe) we obtain the empty
clause ti. proving the query.
The need for the inequality formulas should he clearer now. They are needed
to complete the proof by restricting the objects which satisfy the completed predicates.
Predicate completion performs the, same function as CWA but with respect to
the completed predicates only. However with predicate completion. it is possible
todefault both negative as well as positive statements.
Circumscription
CSSTUDENT(a)
CSST(JOENT(b)
Modal logics were invented to extend the expressive power of traditional logics.
The original intent was to add the ability to express the necessity and possibility of
propositions P in PL and FOPL. Later, other modal logics were introduced to help
capture and represent additional subjective mood concepts (supposition, desire) in
addition to the standard indicative mood (factual) concept representations given by
PL and FOPL
With modal logics we can also represent possible worlds in addition to the
actual world in which we live. Thus, unlike traditional logics, where an interpretation
of a wif results in an assignment of true or false (in one world only), an interpretation
in modal logics would be given for each possible world. Consequently, the wif
may be true in some worlds and false in others.
Modal logics are derived as extensions of PL and FOPL by adding modal
operators and axioms to express the meanings and relations for concepts such as
consistency, possibility, necessity, obligation, belief, known truths, and temporal
situations, like past, present, and future. The operators take predicates as arguments
and are denoted by symbols or letters such as L (it is necessary that). M (it is
possible that), and so on. For example. MCOLDER-THAN(denvcr.portland) would
be used to represent the statement "it is possible that Denver is colder than Portland."
Sec. 5.5 Model and Temporal Logics
Modal logics are classified by the type of modality they express. For example,
alethic logics are concerned with necessity and possibility, deontic logics with what
is obligatory or permissible, epistemic logics with belief and (known) knowledge,
and temporal logics with tense modifiers like sometimes, always, what has been,
what will be, or what is. In what follows, we shall be primarily concerned with
alethic and temporal logics.
It is convenient to refer to an agent as the conceptualization of our knowledge-
based system (a robot or some other KB system). We may adopt different views
regarding the world or environment in which the agent functions. For example.
one view may regard an agent as having a set of basic beliefs which Consists of all
statements that are derivable from the knowlege base. This was essentially the view
taken in P1 and FOPL. Note, however, that the statements we now call beliefs are
not the same as known factual knowledge of the previous chapter. They may, in
fact be erroneous.
In another view, we may treat the agent as having a belief set that is determined
by possible worlds which are accessible to the agent. In what follows, we will
adopt this latter view. But before describing the modal language of our agent, we
should explain further the notion of possible worlds.
Possible Worlds
LIi-tIII
LJ
If R is symmetric, then add P —' LMP. This states that if P is true in w,,
then MP is true in all w1 accessible from Wj.
If R is an equivalence relation, add LP -, P and MP - LMP.
D. Assertions:
1. (sam is a man)
2. M(sam is a child)
3. L((sam is a child) -+ L(sam is a child)]
4. L[(sam is a man) - (sam is a child)]
A simple proof that (sam is a child) would proceed as follows. From C3 and D4
infer that
(sam is a man) -. (sam is a child)
From Dl and El using modus ponens conclude
(sam is a child)
Temporal Logics
Temporal logics use modal operators in relation to concepts of time, such as past.
present, future, sometimes, always, preceeds, succeeds, and so on. An example of
two operators which correspond to necessity and possibility are always (A) and
96 Dealing with Inconsistencies and Uncertainties Chap. 5
sometimes (S). A propositional temporal logic system using these operators would
include the propositional logic assumptions of Chaptç 4, the operators A and S,
and appropriate axioms. Typical formulas using the a and S operators with the
predicate Q would be written as
It is also possible to define other modalities from the ones given above. For
example, to express the concepts it has always been and it will always be the
operators H and G may be defined respectively as
The four operators P. F. H, and Ci were used by the logician Lemmon (1965)
to define a propositional tense logic he called KT. The language used consisted of
propositional logic assumptions, the four tense operators P. F. H, and 6, and the
following axioms:
To complete the formal system, inference rules of propositional logic such as modus
ponens and the two rules
Q Q
HQ GQ
Sc; 5.5 Fuzzy logic and Natural Language Computations 97
were added. These rules state that if Q is true, infer that Q has always been the
case and Q will always be the case, respectively.
The semantics of a temporal logic is closely related to the frame problem
The frame problem is the problem of managing changes which occur from one
situation to another or from frame to frame as in a moving picture sequence.
For example. a KB which describes a robot's world must know which facts
change as a result of some action taken by the robot. If the robot throws out the
cat, turns off the light, leaves the room, and locks the door, some, but not all
facts which describe the room in the nw situation must be changed.
Various schemes have been proposed for the management of temporally chang-
ing worlds including other modal operators and time and date tokens. This problem
has taken on increased importance and a number of solutions have been offered.
Stilt much work remains to find a comprehensive solution.
We have already noted several limitations of traditional logics in dealing w ith uncertain
and incomplete knowledge. We have now considered methods which extend the
expressive power of the traditional logics and permit different forms of nonmonotonic
reasoning. All of the extensions considered thus far have been based on the truth
functional features of traditional logics. They admit interpretations which are either
true or false only. The use of two valued logics is considered by some practitioners
as too limiting. They fail to effectively represent vague or fuzz> concepts.
For example, you no doubt would be willing to agree that the predicate "TALL"
is true for Pole, the seven foot basketball player, and false for STrTtge the midget.
But, what value would you assign for Tom, who is 5 foot tO inches' What about
Bill who is 6 foot 2. or Joe who is 5 foot 5? If you agree 7 foot is tall, then is 6
foot Ii inches also tall? What about 6 foot 10 inches? If we continued this process
of incrementally decreasing the height through a sequence of applications ot inodus
ponens, we would eventually conclude that a three foot person is tall. Intuitively.
we expect the inferences should have failed at some point, but at what point? In
FOPL there is no direct way to represent this type of concept. Furthermore, it is
not easy to represent vague statements such as "slightly more beautiful.' "not
quite- as young as .. ..not very expensive, but of questionable reliability,"
"a little bit to the left," and so forth.
8-
98 Dealing with Inconsistencies and Uncertainties Chap. 5
The single quotation mark denotes the complement fuzzy set. A'. Note that the
intersection of two fuzzy sets A and is the largest fuzzy subset that is a subset of
Sec. 5.6 Fuzzy Logic and Natural Language Computations 99
both. Likewise, the union of two fuzzy sets A and A is the smallest fuzzy subset
having both A and B as subsets.
With the above definitions, it is possible to derive a number of properties
which hold for fuzzy sets much the same as they do for standard Sets. For example.
we have
fl (A U C)
A U (B fl C) = ( A U B) distributivity
AnEUC=AflUAflC
(A U B) U C = A U (B UC) associativity
(An B) flC=A fl (B fl C)
AflAflA. AUUA commutativit3.
A n A A, A U A = A idempotency
UAuA . ( x) = max[a.l - al 0 I
UA( x) = minla.l - al 0
The universe from which a fuzzy set is constructed may also be uncountable.
For example, we can define values of u for the fuzzy set A = { young} as
I
1.0 for 0x:520
u4(x) = x - 20
[I .+ (- for >20
--) ]-
LS
A 0ILt) A
CONIAI NORM
0
"I (et
developed, and a fuzzy VLSI chip has been produced by Bell Telephone Laboratories,
Inc., of New Jersey (Togai and Watanabe, 1986).
The characteristic function for fuzzy sets provides a direct linkage to fuzzy logic.
The degree of membership of x in A corresponds to the truth value of the statement
x is a member of A where A defines some propositional or predicate class. When
uA (x) = I, the proposition A is completely true, and when uA (x) = 0 it is completely
false. Values between 0 and I assume corresponding values of truth or falsehood.
In Chapter 4 we found that truth tables were useful in determining the truth
value of a statement or wif. In general this is not possible for fuzzy logic since
there may be an infinite number of truth values. One could tabulate a limited number
of truth values, say those corresponding to the terms false, not very false, not
true, true, very true, and so on. More importantly, it would be useful to have an
inference rule equivalent to a fuzzy modus ponens.
Generalized modus ponens for fuzzy sets have been proposed by a number
of researchers. They differ from the standard modus ponens in that statements which
are characterized by fuzzy sets are permitted and the conclusion needi not be identical
to the implicand in the implication. For example, let A. A I, B, and f t be statements
characterized by fuzzy sets. Then One form of the generalized modes ponens reads
Premise: xis Al
Implication: If x is A then y is B
Conclusion: y is B I
An example of this form of modus ponens is given as
Premise: This banana is very yellow
Implication: If a banana is yellow then the banana is ripe
Conclusion: This banana is very ripe
(0 for a
UR(a.b) =01(1 + (a - b2' fora > b
p..
102 Dealing with Inconsistencies and Uncertainties Chap. 5
Now let X and I be two universes and let Aand 11 be fuzzy sets in X and X
x Y respectively. Define fuzzy relations RA (x). R 8(x,y), and R(y) in X, X X Y
and Y, respectively. Then the compositional rule of inference is thç solution of the
relational equation
'c(Y) = RA(x) o R9(x,y) = max, min{u(4,u8(x,y)1
where the symbol o signifies the composition of A and A. As an example, let
X = y={l,2,3,4}
A = { little} = {( I / 1),(2 / .6),(3 / .2),(4 I 0)
R approximately equal, a fuzzy relation defined by
y1234
I.I .5 0 0
: 2 .5 I .5 0
X 3 0 .5 I .5
4 00. 511
Then applying the-max-min composition rule
= max, rain (UA(x),UR(X.y))
= max, {min[( I,
minl(I..5),(.6.l),(.2,.5)0.0)1.
mm ((l,0),(.6,.5),(.2,1),(0,.5)1.
min[( I ,0),(.6,0),(.2,.5).(0, 1)))
= max,{(1,.5,0,01,(.5,.6,.2,0],(O..5,.2,01.(0,0..2,0]}
= {( 1 1,1 .6 1,1 .51,E.21}
Therefore the solution is
= {(l /1),(2 /.6),(3 /.5),(4/.2)}.
Stated in terms of a fuzzy modus ponens, we might interpret this as the inference
Premise: x is little
Implication: x and y are approximately equal
Conclusion: y is more or less little
The above notions can be generalized to any number of universes by taking the
Cartesian product and defining relations on various subsets.
formally define a linguistic variable and show how they are related to fuzzy logic.
Informally, a linguistic variable is a variable that assumes a value consisting
of words or sentences rather than numbers. For example, the variable AGE might
assume values of very young, young, not young, or not old. These values, which
are themselves linguistic variables. may in turn, each be given meaning through a
base universe of elapsed time (years). This is accomplished through appropriately
defined characteristic functions.
As an example, let the linguistic variable AGE have possible values {very
young, young, not young, middle-aged, not old, old, very old}. To each of these
variables, we associate a fuzzy set consisting of ordered tuples {(xIuA(x))} where x
E U = (0.110] the universe (years). The variable AGE, its values, and their corre-
sponding fuzzy sets are, illustrated in Figure 5.8.
A formal, more elegant definition of a linguistic variable is one which is
based on language theory concepts. For this, we define a linguistic variable as the.
quintuple
(x,T(x),U,G,M)
where
The grammar G is further defined as the tuple (VN .Vr.P.S) where VN is the set of
nonterminal symbols, V- the set of terminal symbols from the alphabet Of G, P is
AGE
o to 20 30 40 50 00 70 80 90 100 110
a set of rewrite (production) rules, and S is the start symbol. The language L(G)
generated by the grammar G is the set of all strings w derived from S consisting
of symbols in V. Thus, for example, using Vv = {A,B,C,D,E,S} and the following
rules in P. we generate the terminal string "not young and not very old."
P: S-*A A -'Aandfi C-.D D -iyoung
S -'SorA B -.0 C - . veryC E-"old
B -'notC
This string is generated through application of the following rules:
S-'A -. A andB-.A and not C-A and not very C-+A and not very E-'. A
and not very old -p B and not very old -. not C and not very old - not D and
not very old --' not young and not very old
The semantic rule M gives meaning to the values of AGE. For example, we might
have M(old) = {(x I u0(x)) J xe [0,11011 where u0 (x) is defined as
0 for Ox50
u04(x) = ( + (X _50)_2)_I
for >50
In this section we have been able to give only a brief overview of some of
the concepts and issues related to the expanding fields based on fuzzy set theory.
The interested reader will find numerous papers and texts available in the literature.
We offer a few representative references here which are recent and have extensive
bibliographies: (Zadeh, 1977, 1981, 1983), (Kandel, 1982). (Gupta et al., (1985),
and (Zimmerman. 1985).
5.7 SUMMARY
is true. Default reasoning is based on the use of typical assumptions about an object
or class of objects which are plausible. The assumptions are regarded as valid unless
new contrary information is learned.
Predicate completion and circumscription are methods which restrict the values
a predicate or group of predicates may assume. They allow the predicates to take
on only those values the KB says they must assume. Both methods are based on
the use of completion formulas. Like CWA, they are also a form of nonmonotonic
reasoning.
Modal logics extend the expressiveness of classical logics by permitting the
notions of possibility, necessity, obligation, belief, and the like. A number of different
modal logics have been formalized, and inference rules comparable to propositional
and predicate logic are available to permit different forms of nonmonotonic reasoning.
.Ake modal logics, fuzzy logic was introduced to generalize and extend the
expressiveness of traditional logics. Fuzzy logic is based on fuzzy set theory which
permits partial set membership. This, together with the ability to use linguistic
variables, makes it possible to represent such notions as Sue is not very tall, but
she is quite pretty.
EXERCISES
5.1 Give an example of nonmonotonic reasoning you have experienced at some time.
5.2 Draw the TMS belief network for the following knowledge base of facts. The question
marks under output status means that the status must be determined for these datum
nodes.
INPUTS
Premises Status Assumptions Status
Q —.S IN Q OUT
IN P IN
P.R —.T IN
P IN
OUTPUTS
Datum Status Conditions
S IIQ,Q —.SthenS
U ? IfQ;Q,R—.U.R then U
T ? IIR.P.R—.'T;P then T
5.3 Draw a TMS belief network for the aircraft example described in Section 5.2 and
show how the network changes with the selection of an alternate choice of aircraft.
5.4 Write schemata for default reasoning using the following statements:
(a) If someone is an adult and it is consistent to assume that adults can vote, infer
that that person can vote.
106 Dealing with Inconsistencies and Uncertainties Chap. 5
(b) If one is at least 18 years old and it is consistent to assume that one who is
physically ht and who passes a test may obtain a pilots license, infer that such a
person can obtain a pilots license.
5.5 Show that a Horn clause data base that is consistent is consistent' under the CWA
assumption. Give an example of a simple Horn clause CWA data base to illustrate
consistency.
5.6 For the following database facts, write a completion formula that states that Bill is
the only person that lives in Dallas.
LIVESIN(bjIl4aIIas)
LIVESIN(joe,rjenver)
LIVESIN(sue,phoenix)
OWNS(bilt,computer)
STUDENT(sue)
5.7 Determine whether the following modal statements have accessibility relations that
are reflexive, transitive, or symmetric:
(a) Bill Brown is alive in the current world.
(b) In the current world and in all worlds in the future of the current world, if Jim
Jones is dead in that world, then he will be dead in all worlds in the future of that
world.
(c) In the current world or in some world in the future of the current world, John
Jones is dead.
5.8 Write modal propositional statements for the following using the operators L and M
as described in Section 5.4.
(a) It is necessarily true that the moon is made of green cheese or it is not made of
green cheese.
(b) It is possible that if Kennedy were born in Spain Kennedy would speak Spanish.
(C) It is necessarily true that if n is divisible by 4 then n is divisible by 2.
5.9 Show that the dilation of the fuzzy set A CON(B) is the fuzzy set 8.
5.10 Give three examples of infereneing with English statements using fuzzy modus ponens
(see the example in Section 5.6 under Reasoning with Fuzzy Logic).
5.11 Draw a pictorial definition for the linguistic variable TALL (similar to the variable
AGE of Figure . 5.8) giving your own subjective values for TALL variables and their
values.
5.12 Define a reasonable, real valued fuzzy function . r the linguistic variable SHORT
(see the functioq for UId(X)).
12
The previous chapter considered methods of representation which extend the expres-
siveness of classical logic and permit certain types of nonmonotoniC reasoning
fuzzy
Representations for vague and imprecise concepts were also introduced with
set theory and logic. There are other types of uncertainty induced by random phenom-
ena which we have not yet considered. To round out the approaches which are
available for commonsense reasoning in Al. we continue in this chapter with theory
and methods used to represent probabilistic uncertainties.
6.1 INTRODUCTION
We saw in the previous chapter that a TMS deals with uncertainty by permitting
new knowledge to replace old knowledge which is believed to be outdated or errone-
ous. This is not the same as inferring directly with knowledge that can be given a
probability rating based on the amount of uncertainty present. In this chapter. we
want to examine methods which use probabilistic representations for all knowledge
and which reason by propagating the uncertainties from evidence and assertions to
conclusions. As before, the uncertainties can arise from an inability to predict outcomes
due to unreliable, vague, incomplete, or inconsistent knowledge.
Theprobability of an uncertain event A is a measure of the degree of likelihood
107
108
Probabilistic Reasoning Chap. 6
Of occurrence of that event. The set of all possible events is called the sample
space, S. A probability measure is a function P(') which maps event ou(coñies
E1 E1 .
, . . - , from S into real numbers and which satisfies the following axioms of
probability:
From these three axioms and the rules of set theory. the basic laws of probability
can be derived. Of course, the axioms are not sufficient to compute the probability
Of an outcome. That requires an understanding of the und
erlyingdistributions which
must be established through one of the following approaches:
In all of the above cases, the level of confidence placed in the hypothesized
conclusions is dependent on the availability of reliable knowledge and the experience
of the human prognosticator. Our objective in this chapter is to describe some
approaches taken in Al systems to deal with reasoning under similar types of uncertain
conditions.
The form of probabilistic reasoning described in this section is based on the Bayesian
method introduced by the clergyman Thomas Bayes in the eighteenth century. This
form of reasoning depends on the use of conditional probabilities of specifia events
when it is known that other events have occurred. For two events H afl'l E with
the probability P(E) > 0, the conditional probability of event H.
given that event
E has occurred, is defined as
P(HIE)=p(H&E),p(E) (6.1)
This expression can be given a frequency interpretation by considering a random
expenthent which is repeated a large number of times, a. The number of occurrences
of the event E, say No. (E), and of the joint event F!
and E. No. (H & E), are
recorded and their relative frequencies rf computed as
,f(E)N0
n (6.2)
a
When n is large, the two expressions (6.2) approach the corresponding probabili-
ties respectively, and the ratio
rf(H&E)/rf(E)p(H&L),p(E)
then represents the proportion of times event H occurs relative to the occurrence of
E. that is, the approximate conditional occurrence of H with
E.
The conditional probability of even(E given that event H occurred can likewise
be written as
P(EIH)=p(H&E)/p(H) (6.3)
Solving 6.3 for P(H & E) and substituting this in equation 6.1 we obtain one torrn
otBa y es' Rule
Suppose now it is known from previous experience that the prior (unconditional)
P(Dl) = 0.05. and
probabilities P(Dl) and P(E) for randomly chosen patients are
P(E) = 0. 15, respectively. Also, we assume that the conditional probability of the
observed symptom given that a patient has disease Dl is known from experience
P(DIE) as
to be P(EIDI) = 0.95. Then, we easily determine the value of
P(DljE)P (EJDI)P(Dl)/P(E) = (0.95 xO.05/0.15
=0.32
It may be the case that the probability P(E) is difficult to obtain. If that is
the case, a different form of Bayes' Rule may be used. To. see this, we write
equation 6.4 with R substituted in place of H to obtain
P( - 111 E)
Next, we divide equation 6.4 by this result to eliminate P(E) and get
Note that equation 6.5 has two terms that are ratios of a probability of an
The
event to the probability of its negation. P(HIE) I P(-HIE) and P(H) / PIN).
ratio of the probability of an event E divided by the probability of its negation is
of the event and are denoted as 0(E). The remaining ratio P(EIH)
called the odds
E with respect to H.
P(EIH) in equation 6.5 is known as the likelihood ratio of
We denote this quantity by L(EIH). Using these two new terms, the odds-likelihood
form of Bayes' Rule for equation 6.5 may be written as
0(HIE) = L(EIH )0(H)
This form of Bayes' Rule suggests how to compute the posterior odds 0(tIIE)
from the prior odds on H. 0(H). That value is proportional to the likelihood L(EIH).
true has no effect on the
When L(EIH) is equal to one, the knowledge that E is
odds of H. Values of L(EIH) less than or greater than one decrease or increase the
odds correspondingly. When L( E I H ) cannot be computed. estimates may still be
made by an expert having some knowledge of H and E. Estimating the ratio rather
than the individual probabilities appears to be easier for individuals in such cases.
This is sometimes done when developing expert systems where more reliable probabili-
ties are not available.
P( D fl E ) is the
In the example cited above, Dl is either true or false, and
Dl is true when it is
interpretation which assigns a measure of confidence that
true. There is a similarity between E, P(DI IE) and modus ponens
known that E is
Dl and F
discussed in Chapter 4. For example. when E is known to be true and
are known to be related, one concludes the truth of Dl with a confidence level
P(DI!E).
One might wonder if it would not be simpler to assign probabilities to as
Sec. 6.2 Baye.ian Probabilistic Inference 111
and hence,
P(EH1)P(H)
P(H,IE) -
- k (6.7)
>P(ElH,)P(H)
Bayesian Networks
v/N Ni
In Section 5.4 the notion of possible worlds was introduced as a formalism through
which an agent could view a set of propositions as being true in some worlds and
false in others. We use the possible world concepts in this section to describe a
method proposed by Nilsson ( f986) which generalizes first order logic in the modelling
of uncertain beliefs. The method assigns truth values ranging from 0 to I to possible
Chap. 6
114 Probabilistic Reasoning
Consistent Inconsistent
P Q P-.Q P Q —.Q
q = Vp (6.9)
0 0 1 P11
q. = I .0 I I
1010
P4
where Pt. P2, p,. and p4 are the probabilities for the corresponding W. Thus, the
sentence probabilities are computed as
q 1 = p(S 1 ) = Pt +
q2 p(S2) = p 1 +p+p4
= p(S3) = Pt +
Given a KB of sentences with known probabilities (obtained from an expert
or other source), we wish to determine the probability of any new sentence S deduced
from KB. Alternatively, we may wish to recompute some sentence probabilities in
KB if new information has been gained which changes one or more of the original
sentences in KB. To compute the probability of S requires that consistent truth
values first be determihed for S for all sets of possible worlds. A new augmented
matrix V can then be formed by adding a bottom row of ones and zeros to the
original V where the ones and zeros correspond to the truth assignments.
No methods have been developed for the computation of exact solutions for
the KB sentence probabilities, although methods for determining approximations
were presented for both small and large matrices V. We do not consider those
methods here. They may be found in Nilsson (1986). They are based on the use of
the probability constraints
l. and ,p= I.
and the fact thatconsistcnt probability assignments are bounded by the hyperplanes
of a certain convex hull. Suggestions have also been made for the partitioning of
larger matrice into smaller ones to simplify the computations.
between ignorance and uncertainty. These are distinctly different concepts and should
be treated as such. For example, suppose we are informed that one of three terrorist
groups. A, B, or C has planted a bomb in a certain government building. We may
have some evidence to believe that group C is the guilty one and be willing to
assign a measure of this belief equal to P(C) = 0.8. On the other hand, without
more knowledge of the other two groups, we would not want to say that the probability
is Q. I that each one of them is guilty. Yet, traditional theory would have us distribute
an equal amount of the remaining probability to each of the other groups. In fact,
we may have no knowledge to justify either the amount of uncertainty nor the
equal division of it.
Finally, with classical probability theory, we are forced to regard belief and
disbelief as functional opposites. That is, if some proposition A is assigned the
probability P(A) = 0.3, then we must assign A the probability P( -A) 0.7 since
we must have P(A) + P(A) = I. This forces us- to make an assignment that may
be conflicting since it is possible to both believe and disbelieve some propositions
by the same amount, making this requirement awkward.
In an attempt to remedy the above problems, a generalized theory has been
proposed by Arthur Dempster (1968) and extended by his student Glenn Shater
(1976). It has come to be known as the Dempster-Shafer theory of evidence. The
theory is based on the notion that separate probability masses may be assigned to
all subsets of a universe of discourse rather than just to indivisible single members
as required in traditional probability theory. As such, it permits the inequality P(A)
+ P(A) 1.
In the Dempster-Shafer theory, we assume a universe of discourse U and a
set corresponding to n propositions, exactly one of which is true. The propositions
are assumed to be exhaustive and mutually exclusive. Let 2' denote all subsets of
U including the empty set and U itself (there are 2" such subsets). Let the set
function m (sometimes called a basic probability assignment) defined on 2 l , be a
mapping to (0,11,
m:2°— 10,11, be such that for all subsets A C U
M(0) = 0
m(A) =
The function m defines a probability distribution on 2' (not just on the singletons
of U as in classical .theory). it represents the measure of belief committed exactly
to A. in other words, it is possible to assign belief to each subset A of U without
assigning any to anything smaller.
A belief function, Bel, corresponding to a specific m for the set A, is defined
as the sum of beliefs committed to every subset of A by m. That is. Bel(A) is a
measure of the total support or belief committed to the set A and sets a minimum
value for its likelihood. It is defined in terms of all belief assigned to A as well as
to all proper subsetsof A. Thus.
Sec. 6.4 Dempster-Shafer Theory ui
Bel(A) m(B)
RA
For example, if U contains the mutually exclusive subsets A, 8, C, and D then
P1(0) = 0. P1(U) =
For all A
P1(A) ^ Hel(A),
Bel(A) + BeIIA) I.
P1(A) + PICA) I, and
For A ç B.
m(A1)m2(81) (6,9)
A,flD,C
m10m2 (6.10)
= rn1(A,)m2(B,)
A. 8. C
/N
{A,8.D
A. 8. C. DI
A.c.o:
A'
The so-called ad hoc methods of dealing with uncertainty are methods which have
no formal theoretical basis (although they are usually patterned after probabilistic
concepts).. These methods typically have an intuitive, if not a theoretical. appeal.
120 Probabilistic Reasoning Chap. 6
They are chosen over formal methods as a pragmatic solution to a particular problem.
when the formal methods impose difficult or impossible conditions.
Different ad hoc procedures have been employed successfully in a number of
Al systems, particularly in expert systems. We illustrate the basic ideas with the
belief measures used in the MYCIN system, one of the earliest expert systems
developed to diagnose meningitis and infectious blood diseases (Buchanan and Shott-
liffe, 1984).
MYCIN's knowledge base is composed of if . then rules which are ued
to assess various forms of patient evidence with the ultimate goal being the formulation
of a correct diagnosis and recommendation for a suitable therapy. A typical rule
has the form
This is a rule that would be used by the inference mechanism to help identify
the offending organism. The three conditions given in the IF part of the rule refer
to attributes that help to characterize and identify organisms (the stain, morphology,
and growth conformation). When such an identification is relatively certain, an
appropriate therapy may then be recommended.
The numeric value (0.7) given in the THEN part of the above rule corresponds
to an expert's estimate of degree of belief one can place in the rule conclusion
when the three conditions in the IF part have been satisfied. Thus, the belief associated
with the rule may be thought of as a (subjective) conditional probability P(111 E 1 .E..E)
= 0.7, where H is the hypothesis that the organism is streptococcus, and
E 1 , E.,
and E3 correspond to the three pieces ofjoint evidence given in the IF
part, respectively.
MYCIN uses measures of both belief and disbelief to represent degrees of
confirmation and disconfirmation respectively in a given hypothesis. The basic flea-
sure of belief, denoted by MB(H,E), is actually a measure of the increased
belief
in hypothesis H due to the evidence E. This is roughly equivalent to the estimated
increase in probability of P(HIE) over P(H) given by an expert as a result of the
knowledge gained by E. A value of 0 corresponds to no increase in belief and I
corresponds to maximum increase or absolute belief. Likewise, MD(I-f .E) is measure
of the increased disbelief in hypothesis H due to evidence E. MD ranges from 0 to
+ I. also, with + I representing maximum increase in disbelief. (total disbelief) and
0 representing no increase. In both measures, the evidence E may be absent or
may be replaced with another hypothesis, MB(11 1 .112 ). This represents the increased
belief in H 1 given H2 is true.
In an attempt to formalize the uncertainty measure in MYCIN, definitions of
MB and MD have been given in terms of prior and conditional probabilities. It
Sec. 6.5 Ad Hoc Methods 121
should be remembered, however, the actual values are often subjective probability
estimates provided by a physician. We have for the definitions
if P(H) = I
MB(H,E)
=pmaxI P(H I E),PJ1)I - P(H)
otherwise (6.11)
ma4I,OJ - P(H)
ifP(H)=0
1
MD(H,E) = M
1minIPHIE,PHJ -
otherwise (6.12)
min[ 1,01 - P(H)
Note that when 0 < P(H) <I, and E and H are independent (so
P(HIE) =
P(M), then MB = MB = 0. This would be the case if E provided no useful
information.
The two measures MB and MD are combined into a single measure called
the certainty factor (CF), defined by
CF(H.E)=MB(H,E)—MD(H,) (6.13)
Note that the value of CF ranges from - I (certain disbelief) to + I (certain
belief). Furthermore, a value of CF 0 will result if E neitherconfirms nor unconfirms
H (E and H are independent).
In MYCIN, each rule hypothesis H, has an associated MR and MD initially
set to zero. As evidence is accumulated, they are updated using intermediate combining
functions, and, when all applicable rules have been executed, a final CF is calculated
for each H,. These are then compared and the largest cumulative confirmations or
disconfirmations are used to determine the appropriate therapy. A threshold value
of ICFI > 0.2 is used to prevent the acceptance of a weakly supported hypothesis.
In the initial assignment of belief values an expert will consider all available
confirming and disconfirming evidence, E,, . . . , Ek , and assign appropriate, consis-
tent values to both. For example, in the assignment process, a value of I should
be made if and only if a piece of evidence logically implies H (11) with certainty.
Additional rules related to the assignment process must also be carefully followed
when using such methods.
Ad hoc methods have been used in a large number of knowledge-based systems.
more so than have the more formal methods. This is largely because of the difficulties
encountered in acquiring large numbers of reliable probabilities related to the given
domain and to the complexities of the ensuing calculations. But, in bypassing the
formal approaches-one should question what end results can be expected. Are they
poorer than would be obtained using formal methods? The answer to this question
seems to be not likely. Sensitivity analyses (Buchanan et al.. 1984) seem to indicate
that the Outcomes are not too sensitive to either the method nor the actual values
used for many systems. However, much work remains to be done in this area
before a useful theory can be formulated.
122 Probabilistic Reasoning Chap. 6
IF: Client income need is high and net worth is medium to high.
THEN: Risk-tolerance level is medium.
IF: Client tax bracket is high and risk-tolerance level is tow.
THEN: Tax-exempt mutual-funds are indicated.
IF: Client age is high and income needs are high and retirement income
is medium,
Sec. 6.7 Summary 123
Endorsements are used to control the reasoning process in at least two different
ways. First, preference is given to rules whicn are strongly supported. Second.
endorsements permit the condition or left-hand side of a rule to be satisfied (or
rejected) without finding an exact match in the KB. The SOLOMON system is
goal driven and uses a form of backward rule chaining. A goal is achieed when
all of the inference conditions in the left-hand side of the goal rule have been
proved. This requires proving subgoals and sub-subgoals until the chain of inferences
is completed.
The control structure in SOLOMON is based on the use of an agenda where
tasks krived from rules are ordered for completion on the strength of their endorse-
ments. Strongly endorsed tasks are scheduled ahead of weakly endorsed ones. And
when a task is removed for execution, endorsements are checked to see if they are
still worth completing.
Endorsements are propagated over inferences P -. Q by combining, replacing.
or dropping endorsements Ep associated with antecedents P. endorsements of the
implication itself, and other evidential relationships between Q and conclusions in
the KB.
The SOLOMON system borrowed several design features from another heuristic
reasoning system developed by Douglas Lenat called AM (Davis and Lenat. 1982.
AM discovers basic concepts in mathematics by investigating examples of a newly
generated conjecture and looking for regularities and extreme or boundary values
in the examples. With an exponential number of available tasks, the system is
always uncertain about what to work on next. For this, AM also uses an agenda to
schedule its tasks for further investigation. Here again, heuristics are used to control
the reasoning process. The system does this by developing a numerical 'interest"
factor rating for tasks which is used to determine the task's position on the agenda.
Like the SOLOMON system. AM gives more strength to rules which have supportive
evidence.
Although both AM and SOLOMON take into account the importance of the
evidence. SOLOMON differs in one respect. SOLOMON also accounts for the
accurac y of the evidence just as the testimony of an eyewitness is more convincing
than circumstantial evidence. AM is unable to assess accuracy as such.
6.7 SUMMARY
EXERCISES
6.1 Find the probability of the event A when it is known that some event B occurred.
From experiments it has been determined that P8 IA ) = 0.84. P(A) = 0.2. and P(B)
0.34.
6.2 Prove that iffl and fi are independent. PA IB ) = PtA). (Note that and are independent
if and only if P(A & B) P(A)P(B)).
6.3 From basic set theory prove that P( A) I - PIA). and that Pt 8'v4) I P(B,4).
6.4 Is it possible to compute P>A B) when you are only given PIA). P(8A). and PIB),I
Explain your answer.
6.5 Write the Joint distribution of ., ,. ..........and x1, as a product of the chain
conditional probabilities for the following causal network:
v/N Xe
Chop. 6 Exercises 125
A
/
Structured Knowledge:
Graphs, Frames,
and Related Structures
7.1 INTRODUCTION
The representations studied in Chapter 4 are suitable for the expression of fairly
simple facts. They can be written in clausal form as independent units and placed
in a KB in any order. Inference is straightforward with a procedure such as chaining
126
Sec. 7.2 Associative Networks 127
PROFESSION(bob,protessor)
FACULTY(bob,engioearing)
MARRIED(bob,sandy)
FATHER-OF4bob,sue.joe)
DRIVES(bob.buick)
OWNS(bob.house)
- MARRIED(x.y) V MARRIED(y,x)
or resolution. For example, facts about Bob, a university professor, might be entered
as clauses in a KB as depicted in Figure 7.1
The entries in the KB of Figure 7.1 have no particular order or grouping
associated with them. Furthermore, in representing various facts about Bob, it was
necessary to repeat Bob's name for each association given. All facts appear indepen-
dently, without any linkage to other facts, even though they may be closely related
conceptually (Bob is married, owns a house, has children, drives a Buick. and so
forth).
For small KB5, the representation used in Figure 7.1 presents no problem.
Adding, or otherwise changing facts in the KB is easy enough, and a search of all
clauses in the KB can be made if necessary when performing inferences. When
the quantity of information becomes large and more complex, however, the acquisi-
tion, comprehension, use, and maintenance of the knowledge can become difficult
or even intractible. In such cases, sonic form of knowledge structuring and organization
becomes a necessity.
Real-world problem domains typically involve a number and variety of different
objects interacting with each other in different ways. The objects themselves may
require extensive characterizations. and their interaction relationships with other
objects may be very complex.
CAN
COLOR
networks (also known as semantic networks) and conceptual graphs and give some
of their properties.
Associative networks are directed graphs with labeled nodes and arcs or arrows.
The language used in constructing a network is based on selected domain primitives
for objects and relations as well as some general primitives. A fragment of a simple
network is illustrated in Figure 7.2. In the figure, a class of objects known as Bird
is depicted. The class has some properties and a specific member of the class named
Tweety is shown. The color of Tweety is seen to be yellow.
Associative networks were introduced by Quillian (1968) to model the semantics
of English sentences and words. He called his structures semantic networks to signify
their intended use. He developed a system which "found" the meanings between
words by the paths connecting them. The connections were determined through a
kind of "spreading activation" between the two words.
Quillian's model of semantic networks has a certain intuitive appeal in that
related information is clustered and bound together through relational links. The
knowledge required for the performance of some task is typically contained within
a narrow domain or "semantic vicinity" of the task. This type of organization in
some way, resembles the way knowledge is stored and retrieved in humans.
The graphical portrayal of knowledge can also be somewhat more expressive
than other representation schemes. This probably accounts for the popularity and
the diversity of representation models for which they have been employed. They
have, for example, been used in a variety of systems such as, iatural language
understanding. information retrieval. deductive data bases, learning systems, com-
puter vision, and in speech generation systems.
Unlike FOPL. there is no generally accepted syntax nor semantics for associative
networks. Such rules tend to be designer dependent and vary greatly from one
implementation to another. Most network systems are based on PL or FOPL with
extensions, however. The syntax for any given system is determined by the object
and relation primitives chosen and by any special rules used to connect nodes.
Some efforts have been made toward the establishment of acceptable standards by
Sec. 7.2 Associative Networks 129
Schubert, Goebel. and Cercone (1979). Shapiro (1979), Hendrix (1979). and Brach-
man (1979). Later in this section we will review one formal approach to graphical
representations which was recently proposed by John Sowa (1984).
Basically, the language of associative networks is formed from letters of the
alphabet, both upper- and lowercase', relational symbols, set membership and subset
symbols, decimal digits, square and oval nodes, and directed arcs of arbitrary length.
The word symbok used are those which represent object constants and n-ary relation
constants. Nodes are commonly used for objects or nouns, and arcs (or arc nodes)
for relations. The direction of an arc is usually taken from the first to subsequent
arguments as they appear in a relational statement. Thus, OWNS(bobs,house) would
be written as
OII
Figure 7.3 depicts graphically some additional concepts not expressed in Figure
7.1.
A number of arc relations have become common among users. They include
such predicates as ISA, MEMBER-OF, SUBSET-OF, AKO (a-kind-of). HAS-
State policies
BUDGET ',,
University \ IS A Insti
,etute
systen ) ( Hir Leerning
of
U.T.\
Austin I ..X El Peso
College of
.ieeEngin1ng_
IS A
building
Bob
MARRIED TOO
OWNSDRIVES
I U.-
130 Structured Knowledge: Graphs. Frames, and Related Structures Chap. 7
GENERIC-GENERIC RELATIONSHIPS
GENERIC-INDIVIDUAL RELATIONSHIPS
Figure 7.3 illustrates some important features associative networks are good
at representitg. First, it should be apparent that networks clearly show an entity's
attributes and its relationships to other entities. This makes It easy to retrieve the
properties an entity shares with other entities. For this, it is only necessary to check
direct links tied to that entity. Second, networks can be constructed to exhibit any
hierarchical or taxonomic structure inherent in a group of entities or concepts. For
example, at the top of the structure in Figure 7.3 is the Texas State University
System. One level down from this. node are specific state universities within the
system. One of these universities, the University of Texas at El Paso. is shown
with some of its subparts, colleges, which in turn have subparts. the different depart-
ments. One member of the Computer Science Department is Bob, a professor who
Sec. 7.2 Associative Networks 131
owns a house and is married to Sandy. Finally, we see that networks depict the
way in which knowledge is distributed or clustered about entities in a KB.
Associative network structures permit the implementation of property inheri-
tance, a form of inference. Nodes which are members or subsets of other nodes
may inherit properties front higher level ancester nodes. For example, from
the network of Figure 7.4. it is possible to infer that a mouse has hair and drinks
milk.
Property inheritance of this type is recognized as a form of default reasoning.
The assumption is made that unless there is information to the contrary, it is reasonable
for an entity to inherit characteristics from its ancestor nodes. As the name suggests,
this type of inheritance is called default inheritance. When an object does not or
cannot inherit certain properties, it would be assigned values of its own which
override any inherited ones.
Data structures patterned after associative networks also permit the efficient
storage of information since it is only necessary to explicitly store objects and
shared properties once. Shared properties are attached only to the highest node in
a structure to which they apply. For example, in LISP. Bob's associations (Figure
7.3) can be implemented with property lists where all of Bob's properties are linked
to one atom.
The semantics of associative networks are sometimes defined along the same
lines as that of traditiokal logics. In fact, some network system definitions provide
a meaus of mapping to and from FL or FOPL expressions, For these systems, the
semantics are based on interpretations. Thus, an interpretation satisfies a portion of
a network if and only if all arc relations hold in the given portion.
Inference procedures for networks can also parallel those of FL and FOPL.
If a class A of objects has some property P. and a is a member of A, we infer that
a has property P. Syntactic inference in networks can also be defined using parallels
to traditional logics such as unification, chaining, modus ponens, and even resolution.
rOd.
IS A
These procedures are implemented through node and arc matching processes and
operators which insert, erase, copy, simplify, and join networks. We examine some
typical inferencing procedures in more detail below.
Conceptual Graphs
Although there are no commonly accepted standards for a syntax and semantics
for associative networks, we present an approach in this section which we feel
may at least become a de facto standard in the future. It is based on the use of
the conceptual graph as a primitive building block for associative networks. The
formalism of these graphs has been adopted as a basic representation by a number
of Al researchers and a variety of implementations using conceptual graphs are
currently under development. Much of the popularity of these graphs has been due
to recent work by John Sowa (1984) and his colleagues.
A conceptual graph is a graphical portrayal of a mental perception which
consists of basic or primitive concepts and the relationships that exist between the
concepts. A single conceptual graph is roughly equivalent to a graphical diagram
of a natural language sentence where the words are depicted as concepts and relation-
ships. Conceptual graphs may be regarded as formal building blocks for associative
networks which, when linked together in a coherent way, form a more complex
knowledge structure. An example of such a graph which represents the sentence
"Joe is eating soup with a spoon" is depicted in Figure 7.5.
In Figure 7.5, concepts are enclosed in boxes and relations between the concepts
are enclosed in ovals. The direction of the arrow corresponds to the order of the
arguments in the relation they connect. The last or nth arc (argument) points away
from the circle relation and all other arcs point toward the relation.
Concept symbols refer to entities, actions, properties, or events in the world.
A concept may be individual or generic. Individual concepts have a type field followed
by a referrent field. The concept IPERSON:joe] has type PERSON and referrent
Joe. Ret'errents like joe and food in Figure 7.5 are called individual concepts since
they refer to specific entities. EAT and SPOON have no referrent fields since they
are generic concepts which refer to unspecified entities. Concepts like AGENT.
OBJECT, INSTRUMENT, and PART are obtained from a collection of standard
concepts. New concepts and relations can also be defined from these basic ones.
PERSOr'jo
I'I EI'°EE H FOOD: op
INSTRUMENT
t
SPOON
where square brackets have replaced concept boxes and parentheses have replaced
relation circles.
The language of conceptual graphs is formed from upper . and lowercase letters
of the alphabet, boxes and circles, directed arcs, and a number of special characters
including .. ?, !, . #. @. V. -. ", :. I. J. (. ). -'.. 4-. {, and }. Some symbols
are, used to exhibit the structure of the graph, while others are used to determine
the referrents.
The dash signifies continuation.of the linear graph on the next line. The question
mark is used to signify a query about a concept when placed in the referrent field:
[HOUSE:?] means which house? The exclamation mark is used for emphasis to
draw attention to a concept. The asterisk signifies a variable or unspecified object:
IHOUSE:*xl means a house or some house. The pound sign signifies a definite
article known to the speaker. For example. [House:#4321 refers to a specific house,
house number 432. The Ca symbol relates to quantification: [l-IOUSE:@ n] means
n houses. V signifies every or all, the same as in FOPL. The tilde is negation.
Double quotation marks delimit literal strings. And the colon, brackets, parentheses,
and directed arcs are used to construct graph structures as illustrated above.
Since conceptual graphs and -FOPL are both a form of logical system, one
might expect that it is possible to map from one representation to the other. Indeed
this is the case, although some mappings will, in general, result in second order
FOPL statements.
To transform a conceptual graph to a predicate logic statement requires that
unique variable names be assigned to every generic concept of the graph. Thus,
the concepts EAT and FOOD of Figure 7.5 would be assigned the variable flames
x and y, respectively. Next, all type labels such as PERSON and FOOD are converted
to unary predicates with the same name. Conceptual relations such as AGENT.
OBJECT, and INSTRUMENT are converted to predicates with as. many arguments
as there are arcs connected to the relation. Concept referrents such as Joe and soup
become FOPL Constants. Concepts with extended referrents such as V map to the
universal quantifier V. Generic concepts with no quantifier in the referrent field
have an existential quantifier, 3, placed before the formula for each variable, and
conjunction symbols, &, are placed between the predicates.
As an example, one could convert the sentence "Every-car has an engine'
from its conceptual graph representation given by
to its equivalent FOPL representation. Using the rules outlined above, the equivalent
FOPL representation derived is just
134 Structured Knowledge: Graphs. Frames, and Related Structures Chap. 7
Mapping the other way, that is from FOPL statements to conteptual graphs.
begins by putting the FOPL formula into prenex normal form, and converting all
logical connectives to negation and conjunction. Next, every occurrence of universal
quantification Yx is replaced with the equivalent form 3x (in graph notation this
I I to close off the
is [ V[ with the subsequent addition of balancing brackets
expression). Every variable x and every occurrence of 3r is then replaced with the
most general type concept denoted as T:*x1. And finally, every n-ary predicate
symbol is replaced with an n-ary concept relation whose ith arc is attached to the
concept in the ith argument place in the predicate.
Implication in a conceptual graph can be represented with negation and conjunc-
tion. For example, the FOPL equivalent of P -. Q can be written as (P 1Q11
(recall that P - Q = (P V Q) = ( P & 'Q)). In this expression. ( is read as if
and the nested [ is read as then. More generally, we write the implication as
*peq] where *p and *q are themselves any conceptual graph.
Inference can be accomplished by modifying and combining graphs through
the use of operatots and basic graph inference rules. Four useful graph formation
operators are copy , restrict, join, and simplify. These operators are defined as follows.
Tweety: [BIRO:tweetvl
ate: [Animal] .-(AGENT) — lATE) - -. (Patient)-. IENTITYI
a: IT: 1
fat: (FAT) - (ATTRIBUTE) - [PHYSICAL-OBJECT]
worm: [WORM]
The T:dI signifies that something of an unspecified type exists IT is the most
general type of all concepts).
Sec. 7.2 Associative Networks 135
From these basic graphs a single conceptual graph can be constructed using
the formation operators. First, the subgraph from "a fat worm" is constructed by
restricting PHYSICAL-OBJECT in the fat graph to WORM and then joining it to
the graph for worm to get (FAT] — (ATI'RIBUTE) — [WORM]. Next. ENTITY
in the griph for ate is restricted to WORM and joined to the graph just completed.
This gives
The final conceptual graph is obtained by restricting ANIMAL to BIRD with referrent
Tweety. joining the graphs and labeling the whole graph with PAST (for past tense).
In forming the above graph, concept specialization occurred (e.g., when restric-
tion took place as in PHYSICAL-OBJECT to WORM). Thus, the formation rules
and their inverses provide one method of inference. When rules for handling negation
and other basic inference rules are combined with the formation rules, a complete
inference system is obtained. This system is truth preserving. The inference rules
needed which are the equivalent of those in a PL system are defined as follows.
Insertion. Any conceptual graph may be inserted into another graph Context
which is enclosed by an odd number of negations.
Other inference methods including inheritance (if all A have property P and
all B are A, all B have property P) and default reasoning are also possible with
conceptual graphs. The implementation of modal logic formalisms with these graphs
is possible by using concepts such as possible and necessary. Heuristic reasoning
can be accomplished within the theory of conceptual graphs.
In summary, conceptual graphs offer the means to represent natural language
statements accurately and to perform many forms of .inference found in common
sense reasoning.
Frames were first introduced by Marvin Minsky (1975) as a data structure to represent
a mental model of a stereotypical situation such as driving a car, attending a meeting,
or eating in a restaurant. Knowledge about an object or event is stored together in
memory as a unit. Then, when a new situation is encountered, an appropriate frame
is selected from memory for use in reasoning about the situation.
Frames are general record-like structures which Consist of a collection of slots
and slot values. The slots may be of any size and type. Slots typically have names
and values or subfields called facets. Facets may also have names and any number
of values. An example of a simple frame for Bob is depicted in Figure 7.6 and a
general frame template structure is illustrated in Figure 7.7.
From Figure 7.7 it will be seen that a frame may have any number of slots.
and a slot may have any number of facets, each with any number of values. This
gives a very general framework from which to build a variety of knowledge structures.
The slots in a frame specify general or specific characteristics of the entity
for which the frame represents, and sometimes they include instructions on how to
apply or use the slot values. Typically, a slot Contains information such as attribute
value pairs, default values, conditions for filling a slot, pointers to other related
tramcs, and procedures that are activated when needed for . different purposes. For
example, the Ford frame illustrated in Figure 7.8 has attribute-value slots (COLOR:
silver, MODEL: 4-door, and the like), a slot which takes default values .for GAS-
MILEAGE and a slot with an attached /needed procedure.
(bob
(PROFESSION (VALUE professor))
(AGE (VALUE 42))
(WIFE (VALUE sandy))
(CHILDREN (VALUE sue joe))
(ADDRESS (STREET (VALUE 100 elm))
(CITY (VALUE dallas))
(STATE VALUE tx))
(ZIP (VALUE 75000))))
(<frame name>
(<slot!> (<fecetl><vaiuel> .... <valuek;>)
(<facet2><valuei> .... <valuek2>)
The value fget in the GAS-MILEAGE slot is a function call to fetch a default
value from another frame such as the general car frame for which Ford is a-kind-
of (AKO). When the value of this slot is evaluated, the fget function is activated.
When fget finds no value for gas mileage it recursively looks for a value from
ancestor frames until a value is found.
The if-needed value in the Range slot is a procedure mme that, when called,
computes the driving range of the Ford as a function of gas mileage and fuel capacity.
Slots with attached procedures such as fget and if-needed are called procedural
attachments or demons. They are done automatically when a value is needed but
not provided for in a slot. Other types of demons include if-added and if-removed
procedures. They would be triggered, for example, when a value is added or removed
from a slot and other actions are needed such as updating slots in other dependent
frames.
Like associative networks, frames are usually linked together in a network
through the use of special pointers such as the AKO pointer in Figure 7.8. Hierarchies
of frames are typical for many systems where entities have supertype-subtype or
generic-instance relationships. Such networks make it possible to easily implement
property inheritance and default reasoning as depicted in Figure 7.8. This is illustrated
Transport
Origin:_
ZN
Destination:
Public Pflt.tI
conveyance convsyarce
Reservation: - Plan route:
ViZ ViZ
Pack: Pack:
/
Bus Train Limo
in the network of frames which represents various forms of transportation for people
(Figure 7.9).
Frame representations have become popular enough that special high level frame-
based representation languages have been developed. Most of these languages use
LISP as the host language. They typically have functions to create, access, modify,
update, and display frames. For example, a function which defines a frame might
be called with
(fdefine f-name <parents><slots>)
where fdefine is a frame definition function, f-name is the name assigned to the
new frame, <parents> is a list of all parent frames to which the new frame is
linked, and <slots> is a list of slot names and initial values. Using the function
fdefine to create a train frame we might provide the following details.
everal frame languages have now been developed to aid in building frame-
based ;ystems. They include the Frame Representation Language (FRL) (Bobrow
et al.. 1977). Knowledge Representation Language (KRL), which served as a base
language for a scheduling system called NUDGE (Goldstein et al.. 1977) and KLONE
(Brachman, 1978).
One way to implement frames is with property lists. An atom is used as the frame
name and slots are given as properties. Facets and values within slots become lists
of lists for the slot property. For example. to represent a train frame we define the
following putprop.
Another way to implement frames is with an association list (an a-list), that
is. a list of sublists where each ,uhlist contains a key and one or more corresponding
values. The same train frame w3uld be represented using an a-list as
Conceptual Dependencies
ENTITIES
Picture producers (PP) are actors or physical objects (including human memory)
that perform different acts.
Picture aiders (PA) are supporting properties or attributes of producers.
ACTIONS
Objective Case
Directive Case
Instrumental Case
Recipient Case
CONCEPTUAL DEPENDENCIES
Semantic rules for the formation of dependency structures such as the relationship
between an actor and an event or between a primitive action and an instrument.
sec Figure 7.111.
Conditional (c)
Continuing (k)
Finished Transition (tf)
Future (t)
Interrogative (?)
Negative (I)
Past (p)
Present (nil)
Start Transition (Cs)
Timeless (delta)
Transition (t)
Conceptual structures in the form of a graph are used to represent the meanings
of different English (or other language) sentences. The graphs are constructed from
elementary structures in accordance with basic syntax rules. Some of the basic
concept rules are as follows.
Using these syntactic elements, structures which represent any sentence can
be constructed. Some examples of simple graphs and their corresponding sentences
are illustrated in Figure 7.11.
More complex sentence representations are constructed from the basic building
blocks given above. Note the similarities between CD theory and the conceptual
graphs of the previous section. Both have primitive concepts and relations defined,
and both have a syntax for graphical representation. Conceptual graphs differ from
CDs mainly in that the conceptual graph is logic based. whereas CD theory is
mainly concerned with the semantics of events. We now turn to theevent representation
structure which uses CDs, the script.
Sec. 7.4 Conceptual Dependencies and Scripts 143
of actors, roles, props, and scenes. Slots in a script which correspond to parts of
the event are filled with CD primiti.es as defined above. An example of a supermarket
script is illustrated in Figure 7.12. This script has four scenes which correspond to
the main events which commonly occur in a supermarket shopping experience.
Since scripts contain knowledge that people use for common every day activities,
they can be used to provide an expected scenario for a given situation.
Reasoning in a script begins with the creation of a partially filled script named
to meet the current situation. Next, a known script which matches the current situation
is recalled from memory. The script name, preconditions, or other key words provide
index values with which to search for the appropriate script. Inference is accomplished
by filling in slots with inherited and default values that satisfy certain conditions.
For example, if it is known that Joe-PTRANS-Joe into a supermarket and Joe-
ATRANS-cashier money, it can be inferred that Joe needed groceries, shopped for
items, paid the cashier, checked out, and left the market with groceries but with
less money than when he entered.
Scripts have now been used in a number bf language understanding systems
(English aswell as other languages) at Yale University by Schank and his colleagues.
One such system is SAM (Script Applier Mechanism) which reads and reasons
with text to demonstrate an "understanding" of stories (such as car accident stories
from newspapers) that were script based. Other programs developed at Yale include
PAM. POLITICS, FRUMP, IPP, BORIS. BABEL, and CIRUS. All of these pro-
grams deal with reading, planning, explaining, or in some way understanding stories.
They all used some form of script representation scheme.
7.5 SUMMARY
EXERCISES
7,1 Express the following concepts as an associative network structure with interconnected
nodes and labeled arcs.
Company ABC is a software development company. Three departments within
the company are Sales. Administration, and Programming. Joe is the manager
of Programming. Bill and Sue are programmers. Sue is married to Sam. Sam is
an editor for Prentice Hall. They have three children, and they live on Elm
street. Sue wears glasses and is five feet four inches tall.
7.2 Write LISP expressions which represent the associative network of Problem 7.1.
a. using property lists, and
b. using a-lists.
7.3 Write PROLOG expressions which represent the associative network of Pru61em 7.1.
7.4 Transform the FOPL statements given below into equivalent conceptual graphs.
a. Vx NORMAL(x) & GROWN(x)-.WALK(x).
b. Vx,y MARRIED(x,y) - MARRIED(y,X).
c. Yx HASWINGS(x) & LAYSEGGS(x) -. ISBIRD(x).
7.5 Transform the . following conceptual graphs into equivalent FOPL statements.
a. IPERSON:sucl '-(AGENT) —[DRINK]-
(OBJECT)-. IFOOD:milkl
(INSTRUMENT)-. IGLASSI
b. (PAST)-. IICAMEL:clydel -(AGENT)—(DRINKI-. ' (OBJECr)-
IWATERI-. (A11'RIBIJTE)-. 150-GALLONSII
7.6 The original primitives of conceptual dependency theory developed by Schank fail to
represent some important concepts directly. What additional primitives can you discover
that would be useful?
7.7 Create a movie script similar to the supermarket script of Figure 7.11.
7.8 What are the main differences between scripts and frame structures?
7.9 Express the following sentences as conceptual dependency structures.
a. Bill is a programmer.
b. Sam gave Mary a box of candy.
C. Charlie drove the pickup fast.
7.10 Create a frame network for terrestrial motor vehicles (cars, trucks, motorcycles) and
give one complete frame in detail for carswhich includes the slots for the main component
II-
146 Structured Knowledge: Graphs, Frames. and Related Structures Chap. 7
parts, their attributes, and relations between parts. Include an as-heeded slot for the
gas of each type mileage.
7.11 Write a LISP program to create a frame data structure which represents the car crarne
of Problem 7.10.
7.12 Compare the inference process using frames to that of inference in FOPL. Give examples
of both.
-- / -
r-dif
LU]
Object-Oriented
Representations
The previous chapter marked a departure from the approaches to knowledge representa-
tion of earlier chapters in that the methods there focused on adding Structure to the,
knowledge. Structure was added by linking concepts through relations and clustering
together all related knowledge about an object. In some cases, as with frames,
procedures related to the knowledge were also attached to the knowledge cluster.
The approach in grouping knowledge and related procedures together into a Cohesive
unit is carried even further with object-oriented systems which we examine in some
detail in this chapter.
8.1 1N1R000CTION
Grouping related knowledge together in Al systems gains some of the same cognitive
advantages realized in the human brain. The knowledge required for a given cognitive
task is usually quite limited in domain and scope. Therefore, access and processing
can be made more efficient by grouping or partitioning related knowledge together
as an unit. We saw how this notion was implemented with linked frames in the
previous chapter. We shall see in this chapter, that object-oriented systems share a
number of similarities with the frame implementations.
In procedural programming languages such as Pascal or FORTRAN, a program
147
148 object-Oriented Representations Chap. B
consists of a procedural part and a data part. The procedural part consists of the
set of program instructions, and the data part, the numbers and character strings
that are manipulated by the instructions. Programs typically contain several modules
of instructions that perform computations on the same data set. When some change
is made to the format of the data, every module that uses it must then be modified
to accommodate the newly revised format. This places a heavy burden on the software
maintenance process and makes these types of programs more prone to errors.
In an object-oriented system (OOS) the emphasis between data and procedures
is reversed. Data becomes the primary object and procedures are secondary. For
example, everything in the universe of an OOS is an object, and objects are inaccessible
to outside procedures. This form of structuring is sometimes called encapsulation
or data hiding. It is a well known system design principle used to make systems
more modular and robust. With encapsulation, objects are associated with their
own procedures and, as such, are responsible for their own actions. Thus, when
some change is required in the data or procedure, only the changed object need be
modified. Other objects are not affected and therefore require no modifications.
In object-oriented systems there is a simplicity in structure because almost
everything is an object. For example, a car can be regarded as an object consisting
of many interacting components or subobjects: an engine, electrical system, fuel
system, drive train, controls, and so on. To model such a system using an object-
oriented approach requires that all parts be declared as objects, each one characterized
by its own attributes and its own operational behavior. Even a simple windshield
wiper would be described as an object with given attributes and operations. As
such, it might be described as the structure presented in Figure 8.1.
This object has a name, a class characterization, several distinguishing attributes,
and a set of operations. Since all characteristics of the wiper object, including its
operations, are contained within a single entity, only this entity needs changing
when some design change is made to this part of the car. Other objects that interact
with it are not affected, provided the communication procedures between the objects
were not changed To initiate the task of cleaning moisture from the windshield
requires only that a message be sent to the object. This message can remain the
same even though the structure of the wiper or its mode of operation may have
changed.
Because there is more than one wiper, each with similar attributes and operations.
some savings in memory and procedures can be realized by creating a generic
class which has all the characteristics which are common to the left. ri g ht. and
rear wipers. The three instances retain some characteristics unique to themselves.
but they inherit common attributes and operations from the more general wiper
class.
The object paradigm described above seems to model real-world systems more
closely than the procedural programming models where objects (data) and procedures
are separated. In object-oriented systems. objects become individual, self-contained
units that exhibit a certain behavior of their own and interact with other objects
only through the passing of messages. Tasks get performed when a message is
sent to an object that can-perform the task. All the details of the task are rightfully
hidden from other objects. For example, when your car needs repairing, you send
a message to the repair shop. The repair shop, in turn. may need parts from one
or more manufacturers for which they must send messages. When the car has been
repaired, the repair shop sends you a message informing you of that fact.
In having the shop repair your car, you probably are not interested in all the
details related to the repair, the fact that messages were sent to other organizations
for pans that they were obtained by Federal Express, and that certain detailed
procedures were taken to complete the repair process. Your primary concern is
that the repair operation was properly completed and your car returned in working
order. The need to model operational behavior such as this has prompted the develop-
ment of object-oriented systems.
The basic idea behind an OOS is the notion of classes of objects interacting with
each other to accomplish some set of tasks. The objects have well-defined behaviors.
They interact with each other through the use of messages. When a task needs to
be performed, an object is passed a message which specifies the task requirements.
The receiving object then takes appropriate action in response to the message and
res?onds by returning a message to the sender. In performing the required task.
the receiver may need assistance from other objects, thereby prompting further mes-
sages to be Sent. -
These ideas are illustrated in Figure 8.2 which depicts the simulation of a
seaport facility. Ocean ships arrive for docking, unloading, loading, and departing.
When the facilities (tugboats, berths, and loading and unloading equipment and
crews) are busy, arriving ships must queue and wait at sea until space and other
facilities are available. The harbor master coordinates the arrivals and departures
by assigning tugs and other resources to the arriving ships. The objects in this
example are, of course, the individual ships, the tugs, the docks, the harbor master.
150 Object-Oriented Representations Chap. 8
Sea vessel
I and maneuver,
I_ShiP25j___I
I
I II Load.ng
I j Harbor
Doek 2
I
______I I
I
Figure 8.2 Objects communicating to
complete a task
and the cargo handling facilities. Actions are initiated by message passing between
these objects: The dashed lines connecting members of the class of sea vessels
depict the way common characteristics and operational behaviors are shared by
members of the same class (they all have a coordinate position, a maximum cruising
speed, cargo capacity, and so on).
Tasks are performed when a message is sent from one object to another.. For
exmaple, the harbor master may send a message to a tug to provide assistance to
ship 87 in deberthing from dock 2. This would then result in a sequence of actions
from the tug having received the message.
In general, a task may consist of any definable operation. such as changing
an objects position, loading cargo, manipulating a character string, or popping up
a prompt window. A complete program would then be a sequence of the basic
tasks such as the simulated movement of ships into and out of the seaport after
discharging and taking on cargo.
In this section we present definitions for the basic concepts that make up an OOS:
the object, message, class, methods. and class hierarchies. There are probably as
many as fifty different OOS languages, and the examples presented in this section
may not comply exactly with any one in particular. The examples are representative
of all OOS however, and are based mainly on material from the Smailtalk family.
including Smalltalk 80 (Goldberg and Robson, 1983. and Kaehler and Patterson,
1986). SmalltalklV (Digitalk. Inc., 1986). and Little Smalitalk (Budd, 1987). Special-
ized OOS languages are considered in Section 8.5.
Sec. 8.3 Objects, Classes. Messages, and Methods 151
Objects
Objects are the basic building blocks in object-oriented systems. All entities except
parts of a messige, comments, and certain punctuation symbols are objects. An
object consists of a limited amount of memory which contains other objects data
and procedures). They are encapsulated together. as a Unit and are accessible to
that object only. Examples of objects are numbers such as 5, 31. 6.2 13. strings
like 'this is a siring.' arrays such as #(23 'a string' 311 (3 4 5)), the Turtle (a
global graphics object originally used in LOGO), a windshield wiper as desiribed
above, a ship, and so on. Objects are characterized by attributes and by the way
they behave when messages are Sent to them. All objects belong to some class.
They are created by declaring them as instances of an existing class and instantiating
instance variables. The class to which an object belongs can be determined by
sending it the message • 'class."
Messages
Actions are performed man OOS by sending messages to an object. This corresponds
to a function or procedure call in other languages. The messages are formatted
strings composed of three pans: a receiver object, a message selector, and a sequence
of zero or more arguments. The format of a message is given as
<object><selector><arg, er9 2 . .
The object identifies the receiver of the message. This field may contain an
object item or another aessage which evaluates to an object. The selector is a
procedure flame. It specifies what action is required from the object. The arguments
are objects used by the receiver object to complete some desired task. Messages
may also be given in place of an argument since a message always elicits an object
as a response.
When an object receives a valid message, it responds by taking appropriate
actions (such as executing a procedure or sending messages to ether objects) and
then returning a result. For example. the message 9 .- 5 causes the receiver object
9 to respond to the selector - by subtracting 5 from 9 and returning the object 4.
There are three types of messages: unary, binary, and ke y word (n-ary). All
three types parse from left to right, but parentheses may be used to determine t(
order of interpretation. A unary message requires no arguments. For example. cad
of the following are unary messages:
5 sign
10 factorial
'once upon a time' Size
#la b c d) reversed
68 asCharacter
152 Object-Oriented Representations Chap. b
In each of these examples, the first item in the message is the receiver object, and
the second item the selector. The first example returns the integer + I to signify a
positive algebraic sign for the number 5. The second example returns 3628800 the
factorial value of the integer 10. The third example returns 16. the length of the
string. The fourth returns the array #(d c b a), and the fifth returns D. the ASCII
character equivalent of 68.
Binary messages take one argument. Arithmetic operations are typical of binary
messages, where the first operand is the receiver, the selector is the arithmetic
operation to be performed, and the second operand is the argument. Examples of
binary messages are
Comments may be placed anywhere within an OOS pro-gram using the double
quotation marks as seen in the above examples. Note that the last three examples
are nonarithmetic binary messages. They result in the combinog ..' two arrays
into one, a boolean relational test, and the creation of a graphics coordinate point
at column 7, row 12, respectively.
The third and most general type of message is the keyword message. These
messages have selectors which consist of one or more keyword identifiers, where
each is followed by a colon and an argument. The argument can be an object or
any message, but if it is another keyword message, it must be enclosed in parentheses
to avoid ambiguity. Examples of keyword messages are
The last two examples above contain messages within messages, while the
last example has a message delimited with parentheses. In executing a message
Sec. 8.3 Objects. Classes, Messages and Methods 153
without parentheses, the execution proceeds left to right with unary messages taking
precedence followed by binary, and then keyword. Therefore, the messages 'texas'
size and 4 factorial are completed before the keyword part beiween:and: in the last
example above.
Methods
Procedures are called methods. They determine the behavior of an object when a
message is sent to the object. Methods are the algorithms or sequence of instructions
executed by an object. For example, in order to respond to the message 5 + 7.
the object 5 must initiate a method to find the sum of the integer numbers 5 and 7.
On completion of the operation, the method returns the object 12 to the sending
object.
Methods are defined much like procedures in other programming languages
using the constructs and syntax of the given OOS. The constructs used to build
higher level methods are defined in terms of a number of primitive operations and
basic methods provided as part of the. OOS. The primitives of an OOS are coded
in some host language such as an assembler language orC. For example. the operation
for integer addition used in some versions of Smalltalk would be written as
+ aNumber
<SameTypeOfObject self aNumber>
iflrue: t<lntegerAddition self aNumber>l
itFaise: (super + aNumberl
The name of this method is + and the argument is an object of type aNuinber.
The primitive operation SamelypeOfObjcct tests Whether the two object arguments
are of the same type (instances of the same class). The variable self is a temporary
variable of an instance of the class it belongs to. Integer. If the two objects are of
the same type, the primitive IntegerAddition in the iffrue block of code is executed
and the sum returned. Otherwise, a search for an appropriate method is made by
checking the superclass of this class (the class Number) . The up-arrow signifies
the quantity to be returned by the method.
A. typical OOS may have as many as a few hundred predefined primitives
and basic methods combined. We will see examples of some typical methods in
the next section.
A class is a general object that defines a set of individual (instance) objects which
share common characteristics. For example, the class of rabbits contains many individ-
ual rabbit objects, each with four legs, long ears, whiskers, and short bushy, tails.
The class of natural numbers contains many instance objects such as 43,91,2, . . . .
All objects are instances of some class and classes are subclasses of some higher
154 Object-Oriented Representations Chap. 8
class, except for a most general root class. The root class for an OOS is the class
named Object.
Classes can often be divided into subclasses or merged into superelasses. The
class of fruit can be divided into citrus and noncitrus, both of which can be further
divided. Fruit is part of the superclass ofplant-grown foods which in turn is part
of the class of all foods. Classes permit the formation di hierarchies of objects
which can be depicted as a tree or taxonomic structure as illustrated in Figure 8.3.
Objects belonging to the same class have the same variables and the same
methods. They also respond to the same set of messages called the protocol of the
class. Each class in a hierarchy inherits the variables and methods of all of its
parents or superclasses of the class.
When a message is sent to an object, a check is first made to see if the
methods for the object itself or its immediate class can perform the required task.
If not, the methods of the nearest superclass are checked. If they are not adequate,
the search process continues up the hierarchy recursively until methods have been
found or the end of a chain has been reached. If the required methods are not
found, an error message is printed.
Some OOSs permit classes to have two or more direct superclasses (Stefik
and Bobrow. 1986). For example, StereoSystem may have superclasses of Appliances,
.Luxury000ds, and FragileCommodity. As such, a stereo object may inherit character-
istics and methods from all three superclasses. When this is the case, an inheritance
precedence must be defined among the superclasses. One approach would be to try
the leftmost superclass path in the hierarchy first. If applicable methods are not
found up this path, the next leftmost path is taken. This process continues progressively
shifting to the right until a method is found or failure occurs.
An OOS will have many predefined classes. For example, a few of the classes
for the Smailtalk family and their hierarchical structure are depicted in Figure 8.4.
Each of the classes depicted in Figure 8.4 has a number of methods that
respnJ to the protocol for the class. A class may also inherit methods from a
superclass. For example, all classes inherit the method anObject" which
answers true if the receiver and anObject are the same, and answers false otherwise.
/I\ /\
Grir.s
PInt
Fruit
Oiry
Vegeobl Beef
We
Chicken
Fih
/N
Lenon
Citrus Noncitrus
Obj.ct
/\ /\
CharNt,nb, Se* KeyedOIletior,
•
Sequenceable Dictionary
oIIectiOn
/ I N
• /\
Arayed collection List File
paths := 6.
paths tmesRepeat [ship move: 100; turn: 3601/pathsl.
10. The variable self in a method refers to the receiver object of the message
that invokes the method. The variable super is used in a method to invoke a search
for a method in an object's superclass.
In addition to the above examples, an OOS will have many special methods
for the definition of classes and objects and for the definition of class behaviors
and class related tasks.
This system would have as a minimum the three events: (I) cargo ship arrivals.
(2) ship berthing operations, and (3) cargo transfer and ship departures. These events
are symbolized by the following expressions which will be used in our program.
shipArrival
shipDockrng
shipDepature
In the interest of clarity, several simplifying assumptions are made for the
simulation problem. First, we limit our objects of interest to three classes, namely
the class of ships (three types of cargo ships), the group of entities which make up
the harbor operations (tugs, docks, cranes, crews, and the like) treated collectively
as one object class called HarborOperations, and the class called Simulator. The
Simulator class is defined to permit separate simulation runs, that is, separate instances
of Simulator.
Second, we assume that ships arriving to find all berths full depart immediately
from the system. Ships arriving when at least one of the eight berths is available
are scheduled for docking and cargo transfer. Once a ship has been docked, its
departure is then scheduled;
To add some realistic randomness to the operation, the time between ship
arrivals is assumed to be exponentially distributed with a mean value of 3 time
Units. The time to dock is assumed to be uniformly distributed with a range of 0.5
to 2.0 time units, and the time to transfer cargo is assumed to be 'exponentially
distributed with a mean of 14 time Units. Finally, to simulate three different types
of ships, a newly arriving ship is randomly assigned a cargo weight of 10. 20. or
30 thousand tons from an empirical distribution with probabilities 0.2. 0.5. and
0.3, respectively.
The three types of simulated events may occur at any discrete time point,
and they may even occur concurrently. To manage these events, we require a system
clock to assign scheduled event times and a data structure in which to record all
pending events. At any time during the run, the pending event list could include a
scheduled ship arrival at current time r,., plus some time increment t, the berthing
of a ship at time + ,. and the departures of one or more ships at t + i.
+ t4 , and so on. Scheduled events are removed from the list of pending events
in the order of smallest time value first.
Pending events are held in a dictionary data structure which contains index.
value pairs of objects. And, since multiple events may occur at the same time
points, we use a set to hold all events indexed by the same time value. Thus,
pending events will be stored in a dictionary of indexed sets with each set containing
one or more of the basic events.
Messages required to access sets and dictionary objects (collectively referred
to as collections) are needed in the program. The messages and the corresponding
actions they elicit are as follows.
158 Object-Oriented Representations Chap. 8
For output from the simulation, we print the arrival time of each ship, indicating
whether it docks or not, each ship departure, and the total cargo transferred at the
end of the run.
With the above preliminaries, we now define the three classes and their corre-
sponding methods. We begin with the class Simulator which is the most complicated.
To define a class, the word Class is given followed by the class name and (optionally)
by the class's immediate superclass; if no superclass is given, the default class
Object is assumed. This is followed by a list of local variables within vertical bar
delimiters. The method protocol for the class is defined next, with the methods
separated by vertical bars. The message template for each method is given as the
first item following the vertical bar. When local variables for a method are needed,
they follow the message template, also given within vertical bars.
Class Simulator
IcurreotTime eventsPendingj
new
eventsPending : Dictionary new.
currentTime : 0
time
currentTime
The method responding to the message addEvent checks to see if a time value
(key) exists in the dictionary. If so, it adds the new event to the set under the key.
If the time does not already exist, a new set is created, and the event is added to
the set and the set put in eventsPending. The proceed method finds the smallest
event time (key) and retrieves and removes the first element of the set located
there. If the resultant set is empty, the key for the empty set is removed. A message
is then sent to the processEvent object in the class HarborOperations which is defined
next.
new
totalCargo := 0.
remainingBerths := 8
arrivalDistributiOn = Exponential new: 3.
shipDistribution := DiscreteProb new: #(0.2 0.5 0.3)
dockingOistribution := Uniform new , #(0.5 2.0).
serviceDistribution : = Exponential new: 14.
self scheduleArrival
processEvent: event
event value.
('ship arrived at, self time) print.
totalCargo totalCargo + (shipSize • 10).
self scheduleArrival
reportCargo
('total cargo transferred', totalCargo) print
The method new initializes some variables, including the arrival, docking.
and service distributions. A sample from a distribution is obtained by sending the
distribution the message next. (The programming details for the generation of random
samples from all distributions have been omitted in the Interest of presenting a
more readable example.) The scheduleArrival method sends a message to the Ship
class to create a new ship and then adds the event block
to the pending event list at the arrival time value. The processEvent method is
activated from the proceed method in the Simulator class. It initiates evaluation of
the event block stored in the pending list, prints a ship arrival message. and computes
the new total for the cargo discharged.
Next, we define the class Ship.
Class ship
IshipSizel
new
shipSize = shipD.stnbutiOn next
shipSize
- shipSize
With the object classes defined, we can now write the statements For the
three object events. The arrival event is initiated from the HarhorOperation class.
This event then schedules the next operation (docking), which in turn schedules
the ship departure event.
shipArrival: ship
(remainingBerths > 0)
ifTrue: jremainingBerths : remainingBerths -- 1.
self addEvent: [self shipDocking: ship!
at: (self time + dockingOistribution next)l.
ifFalse: l'all berths occupied, ship departs' print]
Sec. 8.5 Object-Oriented Languages and Systems 161
shipDocking: ship
totalCargo = totalCargo + shipSize.
self oddEvent: (self shipDepart: shipi
next: (serviceDistribution next)
shipDepart: ship
ship departs after cargo transfer' print.
remajnjngBorths := remaining8erths + 1
Note that the message "port proceed" in the whileTrue block is sent to Harbor-
Operation. Since there is no method proceed in this class, it must he inherited
from the Simulator class.
An environment for an OOS will usually include all of the basic primitives,
class and method definitions, an editor, a browser, window and mouse facilities,
and a graphics output.
In FLAVORS, classes are created with the defflavor form, and methods of the
flavor are created with defmethod. An -instance of a flavor is created with a make-
instance type of function. For example, to create a new flavor (class) of ships with
instance variables x-position, v-position, x-velocity, v-velocity, and cargo-capacity.
the following expression is evaluated:
Methods for the ship flavor are written in a similar manner with a defmethod, say
for the ship's speed, as
tdefrnethod (ship :speed) ()
(sqrt (4- ( x-v&ocity X-velocity)
(y'velocity V-velocity))))
12-
162 Object-Oriented Representations Chap. 8
Values for the ship instance variables can now be assigned either with a message
or when an instance of the ship is created.
Variable assignments can be examined with the describe method, one of the base
methods provided with the system.
(describe ship42)
#<SHIP 1234567>, an object of flavor SHIP,
has instance variable values:
X-POSITION 5.0
V-POSITION 8.0
X-VELOCITV unbound
V-VELOCITY unbound
CARGO-CAPACITY 22.5
Default values can also be assigned to instance variables with the defvar statement.
"mixed." Inheritance of methods is then achieved much the same as in the Smalltalk
case. For example. to create a ship flavor which has two superclass flavors named
moving-object and pleasure-craft, the following form would be used:
If the flavor moving-object has a method for speed, it will be inherited unless a
method for speed has been defined explicitly for the ship flavor.
The base flavor for FLAVOR extensions is the vanilla-flavor. This flavor will
typically have a number of base methods including a :print-self and describe method
as used above. Generic instance variables are also defined for the vanilla-flavor.
Through method inheritance and other forms, methods may be combined to give a
variety of capabilities for flavors including the execution of some methods just
prior to or just after a main method.
A typical special purpose OOS language is ROSS (for Rand OOS) developed by
the Rand Corporation for military battle simulations (Klahr et al. 1980, 1982).
This system has been implemented in several dialects of LISP as an interactive
simulation system which includes a movie generator and graphics facility. Visual
representations can be generated as the simulation is running, and on-the-fly changes
can easily be made to a program. U has been used to simulate both air and ground
battles.
Messages are sent to objects in ROSS with an ask" form having the following
Structure.
Simulator
chZnel
// \
Red forcex Blue f
already predefined. A typical class hierarchy for a battle simulation might be defined
as the structure illustrated in Figure 8.5.
8.6 SUMMARY
EXERCISES
8.1. Show the order of evaluation for the subexpressions given in the following Cxprc..sion:
9/2 between: 8+ l9sqrt and: 45
8.2. What values will be printed after the following sequences!
a.il7
j: i + Ij
i print
b. j value print (after the sequence in a above)
c. i value print (after the sequence in b above)
8.3. What is the class of Class? What is the superclas g of Class?
8.4. What is the result from typing the following expression?
3- (4 print) * 6
8.5. A bag is like a set except the same Item may occur more than once One way to
implement the class Bag is with a dictionary, where the value contained in the dictionary
is the number of times the item occurs in the bag. A partial implementation for a bag
is given below. Complete the implementation, keeping in mind that instances of dictionary
respond to first and next with values and not keys. The current key is accessible,
however, if currentKcy is used.
new
diet: =Dictionary new
some methods go here.
first
(count:=dict first) isNil ifTrue:l - nil].
count:=count - 1.
dict currentkey
next
[count notNill whileTrue:
l(count>O)
ifTrue:(count:count - 1.' diet currentKey[
ifFalse:(count:dict nextil.
nil
8.6. One method of defining a discrete probability distribution is to provide the actual sample
space elements in a collection. A random sample can then be obtained from the collection
entries. Produce a class description for a class called SampleSpace which will he used
to randomly select points using the following:
166 Object-Oriented Representations Chap. 8
8.7. Modify the simulation program givn in Section 8.4 to collect use statistics on the
tugs: at the end of the run a printout of average tug usage should he made.
...s . ••.
. - .
• • • ,_.•_ •_ ,...
. - • . '-.•.• •. .
PART 3
Knowledge Organization and Manipulation
In the next three chapters we examine the organization and manipulation of knowledge.
This chapter is concerned with search, an operation required in almost all Al programs.
Chapter 10 covers the comparison or matching of data structures and in particular
pattern matching, while Chapter II is concerned with the organization of knowledge
in memory.
Search is one of the operational tasks that characterize Al programs best.
Almost every Al program depends on a search procedure to perform its prescribed
functions. Problems are typically defined in terms of states, and solutions correspond
to goal States. Solving a problem then amounts to searching through the different
states until one or more of the goal states are found. In this. chapter we investigate
search techniques that will be referred to often in subsequent chapters.
9.1 INTRODUCTION
Consider the process of playing a game such as chess. Each board configuration
can be thought of as representing a different state of the game. A change of State
occurs when one of the players moves a piece. A goal state is any of the possible
board configurations corresponding to a checkmate.
It has been estimated that the game of chess has more than 10° possible
167
16$ Search and Control Strategies Chap. 9
states. (To see this, just note that there are about 20 alternative moves for each
board configuration and more than 1(0 different configurations. Thus, there are
more than 20° = I(] Iil * 2' > IO U"). This i4 another example ofthe combinatorial
explosion problem. The number of states grows exponentially with the number of
basic elements. Winning a game amounts to finding a sequence of states through
this maze of possible states that leads to one of the goal states.
An ''intelligent" chess playing program certainly would not play the game
by exploring all possible moves (it would never finish in our lifetime nor in your
distant descendent's lifetimes). Like a human, the program must eliminate many
questionable states when playing. But, even with the elimination of numerous states,
there is still much searching to be done since finding good moves at each state of
the game often requires looking ahead a few moves and evaluating the consequences.
This type of problem is not limited to games. Search is ubiquitous in Al. For
every interesting problem there are numerous alternatives to consider. When attempt-
ing to understand a natural language, a program must search to find matching words
that are known (a dictionary), s .tenee constructions, and matching contexts. In
vision perception, program searches must be performed to find model patterns that
match input scenes. In theorem proving, clauses must be found by searching axioms
and assertions which resolve together to give the empty clause. This requires a
search of literals which unify and then a search to find resolvable clauses. In planning
problems, a number of potential alternatives must be examined before a good workable
plan can be formulated. As in learning, many potential hypotheses be considered
before a good one is chosen.
Time and space complexities of algorithms may be defined in terms of their best.
their average, or their worst-case performance in completing some task. In evaluating
different search strategies, we follow the usual convention of considering worst-
Sec. 9.2 Preliminary Concepts 169
case performances and look for ways to improve on them. For this, we need the 0
(for order) notation.
Let! and g be functions of n, where algorithm A has size n. The size can be
the number of problem states, the number of input characters which specify the
problem or some similar number. Let 1(n) denote the time (or space) required to
solve a given problem using algorithm A. We say 'f is big 0 of g" written f =
0(g), if and only if there exists a constant c > 0 and an integer n0 , such that f(n)
cg(n) for all n n0 . Stated more simply, algorithm A solves a problem in at
most cg(n) units or steps for all but a finite number of steps. Based on this definition,
we say an algorithm is of linear time if it is 0(n). It is of quadratic time if it is
0(n 2 ). and of exponential time if it is 0(2") for some constant k (or if it is OlbA.
for any real number b > I).
For example, if a knowledge base has ten assertions (clauses), with an average
of five literals per clause, and a resolution proof is being performed with no particular
strategy; a worst-case prof may require as many as 1125 comparisons (52 X 10(9)!
2) for a single resolution and several times this number for a complete proof.
Task
node
and node/
Send to
piuhop
representation unless noted otherwise. And-Or graph searches are covered in Section
9.6.
In this section we describe three typical problems which illustate the concepts defined
above and which are used in subsequent sections to portray different search techniques.
The problems considered are the often-used examples, the eight puzzle and the
traveling salesman problem.
The eight puzzle consists of a 3-by-3 square frame which holds eight movable
square tiles which arc numbered from I to 8. One square is empty, permitting tiles
Sec. 9.3 Examples of Search Problems 171
3 8 I 2 3
825 8 4
47 785
Aitait Agoet
configurato confiratio Figure92 The eight puzzle game.
to be shifted (Figure 9.2). The objective of the puzzle is to find a sequence of tile
movements that leads from a starting configuration to a goal configuration such as
that shown in Figure 9.2.
The states of the eight puzzle are the different permutations of the tiles within
the frame. The operations are the permissible moves (one may consider the empty
space as being moveable rather than the tiles): up, down, left, and right. An optimal
or good solution is one that maps an initial arrangement of tiles to the goal configuration
with the smallest number of moves.
The search space for the eight puzzle problem may be depicted as the tree
shown in Figure 9.3.
• ,• .'
In the figure. the nodes are depicted as puzzle configurations. The root node
represents a randomly chosen starting configuration. and its successor nodes corre-
spond to the three single tile movements that are possible from the root A path is
a sequence of nodes starling from the root and progressing downward, to the goal
node.
The traveling salesman problem involves n cities with paths connecting the cities.
A tour is any path which begins with some starling city, visits each of the other
cities exactly once, and returns to the starting city. A typical tour is depicted in
Figure 9.4.
The objective of a traveling salesman problem is to find a minimal distance
tour. To explore all such tours requires an exponential amount of time. For cxanipI,
a minimal solution with only 10 cities is tractable (3.628.000 tours). One with 20
or more cities is not, since a worst-case search requires oil the order of 20! (about
23 x 10') tours. The state space for the problem can also be represented as a
graph as depicted in Figure 9.5.
Without knowing in advance the length ofa minimum tour, it would be necessary
to traverse each of the distinct paths shown in Figure 9.5 and compare their lengths.
This requires some O(n!) traverses through the graph, an exponential number.
The General Problem Solver was developed by Newell. Simon. and Shaw (Ernst
and Newell, 1969) in the late 1950s. It was important as a research tool for several
reasons and notable as the first Al system which cleanly serated the task knowledge
from the problem solving part.
General Problem Solver was designed to solve a variety of problems that
could be formulated as a set of objects and operators, where the operators Were
applied to the objects to transform them into a goal object through a sequence of
applications.
Given an initial object (state) and a goal object (state), the system attempted
to transform the initial object to the goal object through a series of operator application
transformations. It used a set of methods similar to those discussed in Chapter 8
for each goal type, to achieve that goal by recursively creating and solving subgoals.
The basic method is known as means-end analysis, which we now describe.
Startinç City
Next city
/V . V ...........
Figure 9.5 State space representation for the TSP.
I. Comparing the current state S i to a goal state S. and computing the difference
In carry ing out these methods, the General Problem Solver may transform
some S, into an intermediate state S, to reduce the difference D,, between states S
and S, then apply another operator O to the S. and so on until the state S, is
obtained. Differences that may occur between objects will. of course, depend on
the task domain. --
As an example, in proving theorems in propositional logic, some common
differences that occur are a variable may appear in one object and not in the other.
174 Search and Control Strategies Chap. 9
R&P-->O)
(P-->Q)&R
V 0) & A
(PVQ) & A
a variable may occur a different number of times between two objects, objects will
have different signs or different connectives, associative groupings will differ, and
so on.
To ir1ujate the search process, we assume the General Problem Solver operators
are rewrite rules of the following form:
WI: IAVB)-.(BVA)
R2: (A&B)-.(B&A)
R3: (A-.B)-.(B.-.A)
R4: (A-. B)-. ('AVB)
In a worst can situation the only information available will be the ability to
distinguish goal from nongoal nodes. When no further information is known a priori.
a search program must perform a blind or uninformed search. 4 blind or uninformed
search algorithm is one that uses no information other than the initial state, the
search operators, and a test for a solution. A blind search should proceed in a
systematic way by exploring nodes in some predetermined order or simply by selecting
nodes at random. We consider only systematic search procedures in this section.
Search programs may be required to return only a solution value when a goal
is found or to record and return the solution path as well. To simplify the descriptions
that follow, we assume that only the goal value is returned. To also return, the
path requires making a list of nodes on the path or setting back-pointers to ancestor
nodes along the path.
Breadth-First Search
Breadth-first searches are performed by exploring all nodes at a given depth before
proceeding to the next level. This means that all immediate children of nodes are
explored before any of the children's children are considered. Breadth first tree
search is illustrated in Figure 9.7. It has the obvious advantage of always finding a
minimal path length solution when one exists. However, a great many nodes may
need to be explored before a solution is found, especiall y if the tree is very full.
An algorithm for the breadth-first search is quite simple. It uses a queue structure
to hold all generated but still unexplored nodes-. The order in which nodes are
placed on the queue for removal and exploration determines the type of search.
The breadth-first algorithm proceeds as follows.
BREADTH-FIRST SEARCH
4. Remove and expand the first element from the queue and place all the children
at the end of the queue in any order.
5. Return to step 2.
The time complexity of the breadth-first search is 0(b"). This can be seen by
noting that all nodes up to the goal depth d are generated. Therefore, the number
generated is b + b + . + W which is 0(b"). The space complexity is also
0(b) since all nodes at a given depth must be stored , in order to generate the
nodes at the next depth, that is, bd I nodes must be stored at depth d - I to
generate nodes at depth d, which gives space complexity of 0(//'). The use of
both exponential time and space is one of the main drawbacks of the breadth-first
search.
Depth-First Search
/\.
Figure 9.8 Depth-first scireh of a tree,
Sec.. 9.4 Uninformed or Blind Search 171
DEPTH-FEW SEARCH
1. Place the starting node $ On the queue.
2. if the queue is empty, return failure and stop.
3. If the first element on the queue is a goal node g, return success and stop.
Otherwise,
4. Remove and expand the first element, and place the children at the front of
the queue (in any order).
S. Return to step 2.
The depth-first search is preferred over the breadth-first when the search tree
is known to have a plentiful number of goals. Otherwise, depth-first may never
find a solution. The depth cutoff also introduces some problems. If it is set too
shallow, goals may be missed; if set too deep, extia computation may be performed.
The time complexity of the depth-first tree search is the same as that for
breadth-first. 0(b4 ). It is less demanding in space requirements, however, since
only the path from the starting node to the current node needs to be stored .. Therefore,
if the depth cutoff is d, the space complexity is just 0(d).
Bidirectional Search
When a problem has a single goal state that is given explicitly, and all node generation
operators have inverses, bidirectional search can be used. (This is the case with
13-
178 Search and Control Strategies . Chap. 9
the eight puzzle described above, for example). Bidirectional search is performed
by searching forward from the initial node and backward from the goal node simulta-
neously. To do so, the program must store the nodes generated on both search
frontiers until a common node is found. With some modifications, all three of the
blind search methods described above may be used to perform bidirectional search.
For example, to perform bidirectional depth-first iterative deepening search
to a depth of k, the search is made from one direction and the nodes at depth k are
stored. At the same time, a search to a depth of k and k + I is made from the
other direction and all nodes generated are matched against the nodes stored from
the other side. These nodes need not, be stored, but a search of the two depths is
needed to account for odd-length paths. This process is repeated for lengths k = 0
to d12 from both directions.
The time and space complexities for bidirectional depth-first iterative deepening
search are both 0(b"2 ) when the node matching is done in constant time per node.
Since the number of nodes to be searched using the blind search methods
described above increase as fr' with depth d, such problems become intractable for
large depths. It, therefore, behooves us to consider alternative methods. Such method..
depend on some knowledge to limit the number of problem states visited. We turn
to these methods now in the next Section.
When more information than the initial state, the operators, and the goal test is
available, the size of the search space can usually be constrained. When this is the
case, the better the information available, the more efficient the search process
will be. Such methods are known as informed search methods. They often depend
on the use of heuristic information. In this section, we examine search strategies
based on the use of some problem domain information, and in particular, on the
use of heuristic search functions.
Heuristic Information
Information about the problem (the nature of the states, the cost of transforming
irom one state to another, the promise of taking a certain path, and the characteristic
of the goals) can sometimes be used to help guide the search more efficiently.
This information can often be expressed in the form of a heuristic evaluation function
a function of the nodes n and/or the goals g.
Recall that a heuristic is a rule of thumb or judgmental technique that leads
to a solution some of the time but provides no guarantee of success. It may in fact
end in failure. Heuristics play an important role in search strategies because of the
poncntial nature of most problems. They help to reduc' the number of alternatives
from an exponential number to a polynomial number and, thereby, obtain a solution
Sec. 9.5 Informed Search 179
Search methods based on hill climbing get their names from the way the nodes are
selected for expansion. At each point in the search path, a successor node that
appears to lead most quickly to the top of the hill (the goal) is selected for exploration.
This method requires that some information be available with which to evaluate
and order the most promising choices,
Hill climbing is like depth-first searching where the most promising child is
selected for expansion. When the children have been generated. alternative choices
are evaluated using some type of heuristic function. The path that appears most
promising is men chosen and no further reference to the parent or other children is
retained. This process continues from node-to-node with previously expanded nodes
being discarded. Atypical path is illustrated in Figure 9.9 where the numbers by a
node correspond to the computed estimates of the goal distance for alternative paths
Hill climbing can produce substantial savings over blind searches when an
informative, reliable function is available to guide the search to a global goal. It
suffers from some serious drawbacks when this is not the case. Potential problem
types named after certain terrestrial anomalies are the foothill, ridge, and plateau
traps.
The foothill trap results when local maxima o? peaks are found. In this case
the children all have less promising goal distances than the parent node. The search
is essentially trapped at the local node with no indication of goal direction. The
only way to remedy this problem is to try moving in some arbitrary direction a
few generations in the hope that the real goal direction will become evident, backtrack-
180 Search and Control Strategies Chap. 9
ins to an ancestor node and trying a secondary path choice, or altering the computation
procedure to expand ahead a few generations each time before choosing a path.
A second potential problem occurs when several adjoining nodes have higher
values than surrounding nodes. This is the equivalent of a ridge. It too is a form
of local trap and the only remedy is to try to escape as in the foothill case above.
1
Finally, the search ma encounter a plateau type of structure, that is, an area
in which all neighboring nodes have the same values. Once again, one of the methods
noted above must be tried to escape the trap.
The problems encountered with hill climbing can be avoided using a best-
first search approach.
Best-First Search
Best-first search also depends on the use of a heuristic to select most promising
paths to the goal node. Unlike hill climbing, however, this algorithm retains all
estimates computed for previously generated nodes and makes its selection based
on the best among them all. Thus, at an y point in the search process. hest-trt
moves forward from the most promising of all the nodes generated so far. In so
doing, it avoids the potential traps encountered in hill climbing. The best-first process
as estimates
is illustrated in Figure 9.10 where numbers by the nodes may he regarded
of the distance or cost to reach the goal node.
The algorithm we give for best first search differs from the previous blind
search algorithms only in the way the nodes are saved and ordered on the queue.
The algorithm reads as follows.
BEST-FIRST SEARCH
I. Place the starting node s on the queue.
2. If the queue is empty, return failure and stop.
Sec. 9.5 Informed Search 181
3. If the first element on the queue is a goal node g. return success and stop.
Otherwise,
4. Remove the first element from the queue, expand it and compute the estimated
goal distances for each child Place the children on the queue (at either end)
and arrange all queue elements in ascending order corresponding to goal distance
from the front of the queue.
S. Return to step 2.
Best-first searches will always , find good paths to a goal, even when local
anomalies are encountered. All that is required is that a good measure of goal
distance be used.
Branch-and-Bound Search
BRANCH-AND-BOUND SEARCH
• 1. Place the start 'node of zero path length on the queue.
2. Until the queue is empty or a goal node has been found: (a) determine if the
first path in the queue contains a goal node. (b) if the first path contains a
goal node exit with success, (c) if the first path does not contain a goal node.
182 Search and Control Strategies Chap. 9
remove the path from the queue and form new paths by extending the removed
path by one step, (d) compute the cost of the new paths and add them to the
queue, (e) sort the paths on the queue with lowest-cost paths in front.
3. Otherwise, exit with failure.
The previous heuristic methods offer good strategies but fail to describe how the
shortest distance to a goal should be estimated. The A* algorithm is a specialization
of best-trst search. It provides general guidelines with which to estimate goal distances
for general search graphs.
At each node along a path to the goal, the A* algorithm generates all successor
nodes and computes an estimate of the distance (cost) from the start node to a goal
node through each of the successors. It then chooses the successor with the shortest
estimated distance for expansion. The successors for this node are then generated.
their distances estimated, and the process continues until a goal is found or the
search ends in failure.
The form of the heuristic estimation function for A* is
where the two components g t (n) and h*(n) are estimates of the cost (or distance)
from the start node to node n and the Cost from node n to a goal node, respectively.
The asterisks are used to designate estimates of the corresponding true values f(n)
Sec. 9.5 Informed Search 183
= g(n) + h(n). For state space tree problems g*(n) = g(n) since there is only one
path and the distance g*(n) will be known to be the true minimum from the start
to the current node n. This is not true in general for graphs, since alternate paths
from the start node to n may exist.
For this type of problem, it is convenient to maintain two lists of node types
designated as open and closed. Nodes on the open list are nodes that have been
generated but not yet expanded while nodes on the closed list are nodes that have
been expanded and whose children are, therefore, available to the search program.
The A* algorithm proceeds as follows.
A SEARCH
1. Place the starting node s on open.
2. If open is empty, stop and return failure.
3. Remove from open the node n that has the smallest value of f*(n). 11 the
node is a goal node, return success and stop. Otherwise.
4. Expand n, generating all of its successors n' and place n on closed. For every
successor n', if n' is not already on open or closed attach a back-pointer to
n computef(n') and place it on open.
5. Each n' that is already on open or closed should be attached to back-pointers
which reflect the lowest g*(n) path. If n was on closed and its pointer was
changed, remove it and place it on open.
6. Return to step 2.
Iterative Deepening A
The depth-first and breadth-first strategies given earlier for Or trees and graphs can
easily be adapted for And-Or trees. The main difference lies in the way termination
conditions are determined, since all goals following an And node must be realized,
whereas a single goal node following an Or node will do. Consequently, we describe
a more general optimal strategy that subsumes these types, the AO* (0 for ordered)
algorithm.
As in the case of the At algorithm, we use the open list to hold nodes that
have been generated but not expanded and the closed list to hold nodes that have
been expanded (successor nodes that are available). The algorithm is a variation of
the original given by Nilsson (1971). It requires that nodes traversed in the tree be
labeled as solved or unsolved in the solution process to account for And node
solutions which require solutions to all successor nodes. A solution is found when
the start node is labeled as solved.
THE A0 ALGORITHM
1. Place the start node s on open.
2. Using the search tree constructed thus far, compute the most promising solution
tree T0.
Sec. 9.7 Summary 185
3. Select a node n that is both on open and a part of T0 . Remove n from open
and place it on closed.
4. If n is a terminal goal node, label n as solved. If the solution of n results in
any of n's ancestors being solved, label all the ancestors as solved. If the
start node s is solved, exit with success where 7'0 is the solution tree. Remove
from open all nodes with a solved ancestor.
S. If n is not a solvable node (operators cannot be applied), label n as unsolvable.
If the start node is labeled as unsolvable, exit with failure. If any of it's
ancestors become unsolvable because it is, label them unsolvable as well.
Remove from open all nodes with unsolvable ancestors.
6. Otherwise, expand node a generating all of its successors. For each such
successor node that Contains more than one subproblem, generate their successors
to give individual subproblems. Attach to each newly generated node a back
pointer to its predecessor. Compute the cost estimate h* for each newly generated
node and place all such nodes that do not yet have descendents on open.
Next, recompute the values of h* at n and each ancestor of n:
7. Return to step 1
It can be shown that AO* will always find a minimum-cost solution tree if
one exists, provided only that h*(n) h(n), and all arc costs are positive. Like
A* , the efficiency depends on how closely h* approximates it.
9.7 SUMMARY
search. Heuristic evaluation functions are used in best-first search strategies to find
good solution paths. A solution is not always guaranteed with this type of search,
but in most practical cases, good or acceptable solutions are often found.
We saw several examples of informed searches, including general best-first,
hill climbing, branch-and-bound, A*, and finally, the optimal And-Or heuristic search
known as the OA* algorithm. Desirable properties of heuristic search methods were
also defined.
EXERCISES
9.1. Games and puzzles are often used to describe search problems because they are easy
to describe. One such puzzle is the farmer-fox-goose-grain puzzle. In this puzzle, a
farmer wishes to cross a river taking his fox, goose, and grain with him. He can use
a boat which will accommodate only the farmer and one possession. If the fox is left
alone with the goose, the goose will be eaten. If the goose is left alone with the
grain it will be eaten .. Draw a state space search tree for this puzzle using leftbank
and rightbank to denote left and right river banks iespectively.
9.2. For the search tree given below, use breadth-first searching and list the elements of
the queue just before selecting and expanding each next stare until a goal node is
reached. (Goal states designated with .)
/N /CN E F'
HI LM
9.9. Give three different heuristics for an h(n) to be used in solving the eight puzzle.
9.10. Using the search tree given below. list the elements of the queue just before the next
node is expanded. Use best-first search where the numbers correspond to estimated
cost-to-goal for each corresponding node.
A 30
C25
I t.\
0 22
19 J7
E 19
K6
F16
.
t\ I
L3
G 10
9.11. Repeat Problem 9.10 when the cost of node B is changed to 18.
MO
H 12
N4
9.12. Give the time and space complexities for the search methods of Problems 9 2 and
9.3.
9.13. Discuss some of the potential problems when using bill climbing search. Give examples
of the problems cited.
9.14. Discuss and compare hill climbing and best-first search techniques.
9.15. Give an example of an admissible heuristic for the eight puzzle
9.16. Give two examples of problems in which solutions requiring the minimum search are
more appropriate than optimal solutions. Give reasons for your choices.
9.17. Write a LISP program to perform a breadth-first search on a solution space irce con-
structed using property lists. For example, children nodes e. f. and g of node 1) of
the tree would be constructed with the LISP function
9.18. Write a LISP program to perform a depth-first search on the tree constructed in Problem
9.17.
Matching, Techniques
10.1 INTRODUCTION
Matching is the process of comparing two or more structures to discover their like--
nesses or differences The structures may represent a wide range of objects including
physical entitles, words or phrases in some language. complete classes of things,
general concpts. relations between complex entities, and the like. The representations
will be given in one or more of the formalisms like FOPL, networks, or some
other scheme, and matching will invoke comparing the component parts of such
structures.
Matching is used in a variety of programs for different reasons. It may serve
to control the sequence of operations. to identify or classify objects, to determine
188
Sec. 10.1 Introduction 189
would not match since ?x could not be bound to two different constants.
in some extreme cases, a complete change of representational form may be
required in either one or both structures before a match can be attempted. This
will be the case, for example, when one visual object is represented as a vector of
pixel gray levels and objects to be matched are represented as descriptions in predicate
logic or some other high level statements. A direct comparison is impossible unless
one form has been transformed into the other.
In subsequent chapters we will see examples of many problems where exact
matches are inappropriate, and some form of partial matching is more meaningful.
Typically in such cases, one is interested in finding a best match between pairs of
structures. This will be the case in object classification problems, for example,
when object descriptions are subject to corruption by noise or distortion. In such
cases, a measure of the degree of match may also be required.
Other types of partial matching may require finding a match between certain
key eJernents while ignoring all other elements in the pattern. For example. a human
language input Unit should be flexible enough to recognize any of the following
three statements as expressing a choice of preference for the low-calorie food item.
Finally, some problems may obviate the need for a form of fuzzy matching
where an entity's degree of membership in one or more classes is appropriate.
Some classjIjctjon problems will apply here if the boundaries betieen the classes
are not distinct, and an object may belong to more than one class.
Figure 10.1 illustrates the general match process where an input description
is being compared with other descriptions. As stressed earlier, the term object is
used here in a general sense. It does not necessarily imply physical objects. Al!
objects will be represented in some formalism such as a vector of attribute values,
propositional logic or FOPL statements, rules, frame-like structures, or other scheme.
Transformations, if required. may involve simple instantiations or unifications among
clauses or more complex operations such as transforming a two-dimensional scene
to a description in some formal language. Once the descriptions have been transformed
into the same schema, the matching process is performed element-by-element Using
a relational or other test (like equality or ranking). The test results may then be
combined in some way' to provide an overall measure of similarity. The choice of
measure will depend on the match criteria and representation scheme employed.
The output of the matcher is a description of the match. It may be a simple
yes or no response or a list of variable bindings, or as complicated as a detailed
annotation of the similarities and differences between the matched objects.
To summarize then, matching may be exact, used with or without pattern
variables, partial, or fuzzy, and any matching algorithm will be based on such
factors as
Oblint
to
___,j____._ flepre.entar,on —a.- Transformations
Match
[
- cottparator ,- Result
Merri.
IcI! Representations Transformations
We are already familiar with many of the repreemttation structures used in rn,jlchin
programs. Typically, they will be some type of ItsI structures that represenl clauses
in propositional or predicate logic such a
or rules, such as
- wife
,on br,dqe.pa"ne's
-,
name: data-structures
alto: university-course
department: computer-science
credits: 3-hours
prerequisites:(if-needed check catalog)
(a)
Variables
AU of the structures we shalIcons ider here are constructed front basic atomic elements,
numbers, and characters. Character string elements may represent either constants
or variables. If variables, they may be classified by either the type of match permitted
or by their value domains.
We can classify match variables by the number of items that can replace
them (one or more than one). An open variable can be replaced by a single item.
while a segment variable can be replaced by zero or more items. Open variables
are labeled with a preceding question mark ( ) x. 'v. ?class). They may match or
assume the value of any single string element or word. but they are sometimes
subject to consistency constraints. For example. to he consistent, the variable ?X
can be bound only to the same top level element in any single structure. Thus (a
x d ?x e) may match (a b d h e. but not (a b d it Segment ariable types will
be preceded with an asterisk *x . *1 . *words . This type of variable can match an
arbitrary number or segment of contiguous atomic elements (anN sublist including
the empty list). For example. (t d (c f) *v) hill match the patterns
(a (b l tI(e [) ' h). (ci (c]) (t))
or other similar patterns Segment variables may also he subject to consistency
constraints similar to open variables.
Variables may also he classified by their value domains. This distinction will
be useful when we consider similarity measures below. The variables may be either
quantitative, having a meaningful origin or zero point and a meaningful interval
difference between two values, or they may be qualitative in which (here is no
origin nor meaningful interval value difference. These two ipes may be further
subdivided as follows.
such objects. Of course each state can be given a numerical code. For example.
"marital status" has states of married, single, divorced, or widowed. These states
have no numerical significance, and no particular order nor rank. The states could
be assigned numerical codes however, such as married = I. single = 2. divorced
= 3, and widowed = 4.
Binary variable. Qualitative discrete variables which may assume only one
of two values, such as 0 or I, good or bad, yes or no, high or low.
Two other structures we shall consider in this section are graphs and trees. One
type of graph we are already familiar with is the associative network (Chapter 6).
Such structures provide a rich variety of representation schemes. More generally, a
graph G (V. E) is an ordered pair of sets V and E. The elements of V are nodes
or vertices and the elements of E are a subset of V X V called edges (or arcs or
links). An edge joints two distinct vertices in V.
Directed graphs, or digraphs, have directed edges or arcs with arrows. If an
arc is directed from node n to n1 , node n, is said to be a parent or successor of n,,
and n, is the child or successor of n, Undirected graphs have simple edges without
arrows connecting the nodes. A path is a sequence of edges connecting two nodes
where the endpoint of one edge is the start of its successor. A cycle is a path in
which the two end points coincide. A connected graph is a graph for which every
pair of vertices is joined by a path. A graph is complete if every element of V x
V is an edge.
A tree is a connected graph in which there are no cycles, and each node has,
at most, one parent. A node with no parent is called the root node, and nodes with
no children are called leaf nodes. The depth of the root node is defined as zero.
The depth of any other node is defined to be the depth of its parent plus I. Pictorial
representations of some graphs and a tree are given in Figure 10.4.
Recall that graph representations typically use labeled nodes and .arcs where
14-
194 Matcj'iing Techniques Chap. 10
Ia) - tel
Figure 11)4 Examples iii tat g eneral connected g raph. Ib, diraph. I disconnetted graph
and idi tree of depth 3
the nudes correspond to entities and the arcs to relations Labels for the nodes and
arcs are attribute values.
Next, we turn to the problem of comparing structures without the use of pattern
matching variables. This requires consideration of measures used to determine the
likeness or similarit y between two or more structures, The similarit y between txI)
structures is .a measure of the degree of associaton or likeness between the ishiects
attributCs and other characteristic parts. If the describing variables are qualitEtalIc.
a distance metric is often used to measure the proximity.
Distance Metrics
For all elements .x, v of the set E, the function ci is a metric if and onl y it
a. d(x.x) 0
b. d(x,v) 0
c. d(x.v) = d(y,.r)
d. d(x.v) 5 d(.t.:) 5- d(:,v)
v
['•
For the case p fhis metric is the familiar Euclidean distance When p = I. il,,
is the so-called absolute or cit y block distance.
Probabilistic Measures
where the prime C) denotes transpose (row vector) and C is the inverse of C.
The X and V vectors may be adjusted fdr zero means by first subtractin g the vector
means u and ui..
Another popular probability measure is the product moment correlation r,
given by
= Cov(X.Y)
r
lVar(X)*Var(Y)1I'
where Coy and Var denote covariance and variance respectively. The correlation
r. which ranges between - I and + I, is a measure of similarity frequently used in
vision applications.
Other probabilistic measures often used in Al applications are based on the
scatter of attribute values. These measures are related to the degree of clustering
among the objects. In addition, conditional probabilities are sometimes used. For
example, they may be used to measure the liklihood that a given X is a member.
of class C. P( C J X ), the conditional probability of C given an observed X These
measures can establish the proximity of two or more objects. These and related
measures are discussed further in Chapter 12.
Qualitative Measures
EII rt..tJ
X might he horned and Y might he lotte tailed In thiscase, the cittr a is the
number ot animals having both horns and long tails Note that ii u+0+
il. the total number of objects
Various measures of association for such hinar ', arijhlc, have been delined
For example
a - a-i-il
a + h ± e ± (I - it 0
a a
Contingency tables are also useful for describing other qualitatise variables.
both ordinal and nominal. Since the methods are similar to those for binar y variables.
we omit the details here.
Whate'er the variable types used in a measure, they should all he properk.
scaled or normalized to prevent variables having large values from negating the
eltects of smaller valued variables. This could happen when one variable is scaled
in millimeters and another variable in meters.
Similarity Measures
For many problems, distance metrics are not appropriate Instead, a measure of
similarity satisfy ing conditions different from those of Table 10.1 may be more
appropriate Of course, measures of dissimilarit y (or similarity), like distance, should
decrease (or increase) as objects become more alike. There is strong evidence,
however, to suggest that similarities are not in general symmetric (Tversky, 1977)
and hence, any similarity measure between a subject description A and its referrent
B, denoted by .c(A,B), is not necessarily equal: that is, in general, s(A,B) k s(B,.4)
or "A is like B" may not be the same as "B is like A."
Tests on subjects have shown that in similarity comparisons, the focus of
attention is on the subject and, therefore, subject features are given higher weights
than the referrent. For example, in tests comparing countries, statements like "North
Korea is similar to Red China" and "Red China is similar to North Korea" or "the
Sec. 10.3 Measures for Matching 197
USA is like Mexico" and "Mexico is like the USA" were not rated as s)ilimetrical
or equal. The likenesses and differences in these cases are directional. Moreos-er.
like many interpretations in Al. similarities may depend strongly on the contest in
which the comparisons are made. They may also depend on the purpose of the
comparison.
An interesting family of similarity measures which takes into account such
factors as asymmetry and has some intuitive appeal has recentl y been proposed
(Tversky. 1977). Such measures may be adapted to give more realistic results for
similarity measures in Al applications where context and purpose should i nfluence
the similarity comparisons.
Let 0 = { O i .0, ....... . the universe of objects of interest andlet A he the
set of iittributes or features used to represent o A similarit y measure s.hich ts a
function of three disjoint sets of attributes common tO iny two objects A, and 4 is
given as
s(A,.A) EtA, &A,. 4, - A,. A, - A, 1(2)
where It, & A, is the set of features common to both o, and o,. A, - .1 1 the set of
features belonging to o, and not o. and A, A, is the set of featurcs belonging to
o. and not o,. The function F is a real valued nonnegative function. Under tamrly
general assumptions equation 10.2 can be written as
for some a.b.c 0 and where is an additive inters al metric function. The function
f(A) may be chosen as any nonnegative function of the set A. like the numhr of
attributes in A or the average distance between points in A. Equation 10 .3 ma y he
normalized to give values of similarity ranging between 0 and I by writing
ftA & A
S(A A 7 ) = -------- L.__L____
( 10 4)
f(.4, & A,)±aJ(A, - .4,) + iii/ 0, —A,)
Fuzzy Measures
Finally, we can define a distance between the two fuzi y sets A and B as
d(A.B) = - ] -
= (I - i,()]) 10.61
which gives the mean trait membership difference between two objects ., and .i,,.
Of course .s(.v., 0 corresponds to equal likeness or maximal similarit y , and
I for i j corresponds to maximum dissimilarity.
Matching Substrings
Since many of the representation structures are just character strings, a basic function
required in man y match algorithms is to determine if a substring S consistin g of
fit characters occurs somewhere in a string S 1 of pm characters, In n. A direct
approach to this problem is to compare the two strings character-by-character. starting
with the first characters of both S 1 and S. If any two characters disagree. the
process is repeated, starting with the second character of S 1 and matching again
against S character-by-character until a match is found or disagreement occurs
again. This process continues until a match occurs or Si has no more characters.
Let i and j be position indices for string S 1 and k a position index for S. We
can perform the substring match with the following algorithm.
Sec. 10.4 Matching Like Patterns 199
i:=O
while i(n-m+1) do
begin
i:=i+1; j =i; k:-1;
while S,(jl=S211k) do
begin
it km writeln(success')
also do
begin
:j+1;.k:=k4-1
end
end
end
writeln('fail')
end.
This algorithm requires m(n - rn) comparisons in the worst case. A more
efficient algorithm will not repeat the same comparisons over and over again. One
such algorithm uses two indices, i and j, where i indexes (counts) the character
positions in S 1 and is set to a "match state" value ranging from 0 tom (like the
states in a finite automaton). The state 0 corresponds to no matched characters
between the strings, while the state I corresponds to the first letter in S, matching
character i in S 2 . State 2 corresponds to the first two consecutive letters in S2
matching letters i and i + I in S 1 respectively, and so on, with state m corresponding
to a successful match. Whenever consecutive letters fail to match, the state index
is reduced accordingly. We leave the actual details as an exercise.
Matching Graphs
Two graphs O and G match if they have the same labeled nodes and same labeled
arcs and all node-to-node arcs are the same. More generally, we wish to determine
if C 2 with m nodes is a subgraph of G with n nodes, where n m. In a worst
case match, this will require n!/(n - m)! node comparisons and 0(m) arc comparison
Consequently, we will see that most graph matching applications deal with sm
manageable graphs only or use some form of heuristics to limit the number
comparisons.
Finding subgraph isomorphisms is also an important matching problem. An
isomorphism between the graphs G 1 and G 2 with vertices (nodes) Vt. V2 and edges
El, E2. that is, (Vl,El) and (V2,E2), respectively, is a one-to-one mapping to I
between Vl and V2, such that for all vi € Vt. f(H) = v2, and for each arc el €
El connecting vi and vi', there is a corresponding arc e2 e E2 connecting f(vl)
and f(vl'). An example of an application in which graph isomorphisms are used to
determine the similarity between two graphs is given in the next section.
200 Matching Techniques Chap. 10
An exact match of two sets having the same number of elements requires that their
intersection also have that number of elements. Partial matches 6f two sets can
also be determined by taking their intersection. If the two sets have the same number
of elements and all elements are of equal importance, the degree of match can be
the proportion of the total members which match. If the number of elements differ
between the sets, the proportion of matched elements to the minimum of the total
number of members can be used as a measure of likeness. When the elements are
not of equal importance, weighting factors can be used to score the matched elements.
For example, a measure such as
One of the best examples of nontrivial pattern matching is in the unification of two
FOPL litetaIs. Recall the procedure for unif y ing two literals, both of which may
variables
vaables (see Chapter 4). For example. to unifyP(f(a,.r).v.v) and PCv.h.:)
we first rename variables so that the two predicates have no variables in common.
This can be done by replacing the x in the second predicate with a to give P(u,h,:t.
Next, we compare the two symbol-by-symbol from left to right until a disagreement
is found. Disagreements can be between two different variables, a nonvariable term
and a variable, or two nonvariable terms. If no disagreement is found, the two are
identical and we have succeeded.
If a disagreement is found and both are nonvariable terms, unification is impossi-
ble; so we have failed. If both are variables, one is replaced throughout by the
other. (After any substitution is made, it should be recorded in a substitution worktist
for later use.) Finally, if the disagreement is a variable and a nonvariable term, the
variable is replaced by the entire term. Of course, in this last step, replacement is
Sec. 10.5 Partial Matching 201
possible only if the term does not contain the variable that is being replaced. This
matching process is repeated until the two are unified or until a failure occurs.
For the two predicates P. above, a disagreement is first found between the
term f(a,x) and variable u. Since f(a,) does not contain the variable u, we replace
u with f(a,x) everywhere it occurs in the literal. This gives a substitution set of
{f(a,x)Iu} and the partially matched predicates P(f(o,x),y,y) and P(f(a,x),b.:).
Proceeding with the match, we find the next disagreement pair, y and h. a
variable and term, respectively. Again, we replace the variable y with the term b
and update the substitution list to get {f(a,x)/u, b/y}. The final disagreement pair is
two variables. Replacing the variable in the second literal with the first we get the
substitution set {f(a,x)Iu,b/y,ylz} or, equivalently, {f(a,9 1u ,b 1 v,b 1 4 . Note that this
procedure can always give the must general unifier.
We conclude this section with an example of a LISP program which uses
both the open and the segment pattern matching variables to find a match between
a pattern and a clause.
Notice that when a segment variable is encountered (the *v). match is recursively
executed on the cdrs of both pattern and clause or on the cdr of clause and pattern
as v matches one or more than one item respectively.
[fl.• I
[
Figure 10.5 Discrete version ot
stretchable overlay image.
Sec. 10.5 Partial Matching 203
displacements and infinite cost for displacements of more than two increments.
Other pieces would be assigned higher costs for Unit and larger position displacements
when stronger constraints were applicable.
The matching problem here is to find a least cost location and distortion pattern
for the reference sheet with regard to the sensed picture. Attempting to compare
each component of some reference to each primitive part of a sensed picture is a
combinatonally explosive problem. However, in using the template-spring reference
image and heuristic methods (based on dynamic programming techniques) to compare
against different segments of the sensed picture. the search and match process can
be made tractible..
Any matching metric used in the least cost comparison would need to take
into account the sum of the distortion costs C, the sum of the costs for reference
and sensed component dissimilarities C, and the sum of penalty Costs for missing
components C,, Thus, the total cost is given by
tlO.8
Distortions occurring in representations are not the only reasons for partial matches.
For example, in problem solving or analogical inference, differences are expected.
In such cases the two structures are matched to isolate the differences in order that
they may be reduced or transformed. Once again, partial matching techniques are
appropriate. The problem is best illustrated with another example
In a vision application (Eshera and Fu, 1984), an industrial part may be described
using a graph structure where the set of nodes correspond to rectangular or cylindrical
block subparts. The arcs in the graph correspond to positional relations between
the subparts. Labels for rectangular block nodes contain length, width, and height.
while labels for cylindrical block nodes give radius and height. The arc labels give
location and distances between block nodes, where location can be above, to the
right of. behind, inside, and so on.
Figure 10.6 illustrates a segment of such a graph. In the figure the following
abbreviations are used:
,,, hJ
\ (V d2)
lJ V
Graphs such as this are called attributed relational graphs (ATRs). Such a
graph C) is defined formally as a sextuple
G = (N,B,A,G,.(;5)
as a fuzzy set, and a metric similar to equation 10.6 may then be used to match
compare the two objects based on their attribute memberships.
If the attribute', represent linguistic variables such as height. weight, facial-
appearance. color ot-eves. and type-of-hair, each variable may be assigned a limited
number 01 values. For example, a reasonable assignment for height would he the
integers 10 to 96 corresponding to height in 'inches. Eye colors could he assigned
brown, black, blue. hazel, and so on An object description of tall, slim, pretty.
blue e y ed, blonde s ill have characteristic function values for the b ye attributes of
u.,(o 1 ) and u (o) for objects o l and o respectively A measure of fuzzy similarity
between the two objects can then he defined as
.1 ) ( i .0,) = I (I - (I).
where
(I - - - u 4,(o)) 2 j (109)
I, [,
Production lOr rule-based) systems are described in Chapter IS. They are popular
architectures for evert s)stenls. A typical system will contain a Knowledge Base
which contains structures representing the domain expert's knowledge in the form
of rules or productions. a working memory which holds parameters for the current
problem. and an inference engine with rule interpreter which determines which
rules are applicable for the current problem (Figure 10.7).
The basic inference cycle of a production system is match, select, and execute
as indicated in Figure 10.7. These operations are performed as follows.
Match. During the match portion of the cycle, the conditions in the left
hand side (LHS) of the rules in the knowledge base are matched against the contents
206 Matching Techniques Chap. 10
-0 riCh
Select
Execute
s__j
Figure 10.7 Production system components and basic cycle,
i
of working memory to determine wh ch rules have their LUIS conditions satisfied
with consistent bindings to working memory terms. Rules which are found to be
applicable (that match) are put in a conflict set.
Select. From the conflict set, one of the rules is selected to execute. The
selection strategy may depend on recency of useage, specificity of the rule, or
other criteria.
Execute. The rule selected from the conflict set is executed by carrying
out the action or conclusion part of the rule, the right hand side (RHS) of the rule.
This ma y involve an I/O operation, adding., removing or changing clauses in Working
Memory or' simply causing a halt.
The above cycle is repeated until no rules are put in the conflict set or until a
stopping condition is reached.
A typical knowledge base will contain hundreds or even thousands of rules
and each rule will contain several (perhaps as man y as ten or more) conditions.
Working memories typically contain hundreds of clauses as well. Consequently.
exhaustive matching of all rules and their LUIS conditions against working-memory
clauses may require tens of thousands of comparisons. This accounts for the claim
made in the introductory paragraph that as much as 90 17c of the computing time for
such systems can be related to matching operations.
To eliminate the need to perform thousands of matches per cycle, an efficient
match algorithm called RETE has been developed (Forgy. 1982). It was initially
developed as part of the OPS family of programming languages (Brownston, et al.,
195) This algorithm uses several novel features, including methods to avoid repetitive
matching on successive cycles. The main time-saving features of RETE are as
follows.
1. In most expert systems, the contents of working memory change very little
from cycle to cycle. There is a persistence in the data known as temporal redundancy.
Sec. 10.7 The RETE Matching Algorithm 207
Changes to
working memory
>atcherCt
C
__________ IHS rsde
Conditions
Figure 10.8 Changes to working memor are mapped it) the conflict set
sets up a link between rules and their LHS conditions, whereas statements like
link specific LHS terms to all rules which contain the term in th same LI-IS positions
When a change is made to working memory, such as the addition of the clause
208 Matching Techniques Chap. 10
eond-1 cond-t
father
(R6
((father 7y ?x)
(father a '50 cond-2
(grandfather
Ofld
(R12
(father 2 y ) x)
--
Rt3R23
(male 'yl
R13
((father 'y 'e(
(male a( mond-1 cond-1
-- father
on ?x "i)l
Vcond
male
(R23
((father ?a ?yl. ,; :d: .\\
t
(brother', ?x
-
(untIe ?z 'yl(
(father bill joe), all rules which contain father as an LHS condition are easily identified
and retrieved.
In RETE, the retrieval and subsequent testing of rule conditions is initiated
with the creation of a token which is passed to the network constructed by the rule
compiler. The network provides paths for all applicable tests which can lead to
consi s tent bindings and hence to complete-LHS satisfaction of rules. The matcher
traverses the network finding all rules which newly match or no longer match Working
Memory element ,;. The output from the matcher are data structures which consist
of pairs of elements •. a rule name and list of working-memory elements that match
its LHS. like (R6 ((father bob sam) (father mike bob)).
The reader will, notice that the indexing methods described above are similar
to those presented in the following chapter. Other time-saving tricks are also employed
in RETE however, the ones noted above are the most important. They provide a
substantial saving over exhaustive matching of hundreds or even tens of thousands
of conditions.
Chap. 10 Exercises 209
10.8 SUMMARY
EXERCISES
10.1. Indicate whether or not consistent substitutions can t made which result ill matches
for the following pairs of clauses. If substitutions can be made. given example, tit
valid ones.
a. P(a.f(x,b).gtt(a.y)Lz). P(a.f,yf.g(f(x.yflc)
b. P(a,x) V Q(b,y,fty)) V R(x,y).
P(x,a) V Q(f(y).y.b) V R(y.x)
C. R(a,b,c) V Q(.v,z) V P(f(a,x,bI,
P(z) V O(x.y,b) V R(x.y,z)
10.2. State what variable bindings. if any, will make the following lists match
IS-
210 Matching Techniques Chap. 10
10.3. Write a LISP function called "match" that takes two arguments and returns T if
the two are identical, returns the two arguments if one is a variable and the other a
term and returns nil, otherwise.
10.4. Identify the following variables as nominal, ordinal, binary or interval:
temperature sex
wavelength university class
population intelligence
quality of restaurant
10,5. What is the difference between a bag and a set? Give examples of both. Hov. could
a program determine whether a data structure was either a bag or a set?
10.6. Compute the Mahalanohis distance between two normal distributions having zero
means, variances of 4 and 9, and a covariance of 5.
10.7. Give three dierent examples of functionsf that can be used in the similarity equations
10.3 and 10.4.
10.8. Choose two simple objects 01 and 02 that are somewhat similar in their features
Al and A2, respectively, and compute the similarity of the two using a form ol
equation 10.4.
10.9. Define two fuzzy sets ''tall" and "short'' and compute the distance between theill
using equation 10.5.
10.10. For the two sets defined in Problem 10.9. compute the similarity of the two using
equation 10.6.
10.11. Write a LISP function to find the intersection of two sets using the marking method
described in the subsection entitled Matching Sets and Bags.
10,12. Write a LISP function that determines if two sets match exactly.
10.13. Write pseucocode to unify two FOPL literals.
10.14. Write a LISP program based on the pseudocodc developed in Problem 10.13
10.15. Write pscudocodc to find the similarity between two attributed relational graphs
(AGRs).
10,16. Suppose an expert system working memory has n clauses each with an average ol
four if .. then conditions per clause and a knowledge base with 200 ules. Each
rule has an avereage of five conditions. What is the time complexity of a matching
algorithm which performs exhaustive matching?
10.17. Estimate the average time savings if the RETE algorithm wals used in the previous
problem.
10. 18. Write a PROLOG program that determines if two eis match exactly
10.19. Write a PROLOG program that determines if two sets match except possibly for the
first elements of each set.
II
Knowledge Organization
and Management
211
212 Knowledge. Organization and Management Chap. 11
it the knowledge is poorl y organized. Such problems can easil y become intractible
or at best intolerable.
In this chapter, we investigate various approacheslo the effecti'e organization
of knowledge within memors.. We reco g nize that while the reprcsentitom of knoAl-
edue is still an Important taclor, we are more concerned here with the broader
pn)hlein, that of organization and maintenance for efficient storage and recall as
wl I as for i ts manipulation.
111 INTRODUCTION
with this change. our memories exhibit some rather remarkable properties We are
able to adapt to varied changes in the environment and still improse our pertorivanee
This is because our memor y sssIem is continuousl y adapting through a rrilif,ilion
process. Ness knossledge is continualIN being added to our memories. existin g knoss I.
edge is continualk being revised. and less important knowledge is er,idualk being
forgotten. Our memories are continuall y bein g reorL'anhlcd to expand ttu r recall
and reasonin g abilities. This process leads to iinpro ed memor\ performance ih ri tu h
out most of our lives.
When dcxc loping computer memories for intelligent ' sICilts . 55 C I t1\ -, ,t ill
ssiine useful insight b learning \% hat xx e can from human meinoi sx sic ins c
xx ou Id expect ci iniputer memors 55
illstems
soiue of the same feat nrc For s
example. h u loan memories tend to he limitless in capaci1 .anxl ihe pit is ide a
uniform grade of recall sers ice, independent of the amount of inIorinaii'ii sitired
For later use, xxe ha\e sunititarucd these and other desirable characteristics that
' c f e el alleffective computer memor y organh/ation sxsteili should possess
i
These characterist cs suggest that memory he organized around ,n
lusters of knowledge Related clusters should be g rouped and stored in close princrin
it\ to each other and he linked to similar concepts through ussuciatixe relations
Access to any given cluster should he possible through either direct or indirect
links such is concept pointers indexed h) meaning. Index kc "fill s nnninoinnniu'
meanings should provide links to the same know ledge clusters 1 hese notions are
illustrated graphieulk in Figure II. I where the clusters represent urhitrars groups
of close!> related know ledge such as objects and their properties or basic conceptual
cate g ories. The links connecting the clusters are Iwoxxa y pointers which provide
relational associations between the clusters thev connect.
214 Knowledge Organization and Management Chap. II
assratIve links
One tricky aspect of systems that must function in dynamic environments is due to
the so-called frame problem. This is the problem of knowing what changes hase
and have not taken place following some action. Some changes will he the direct
result of the action. Other changes will be the result of secondary or side etiects
rather than the result of the action. For example, if a robot is cleaning the floor ,, in
a house, the location of the floor sweeper changes with the robot even though this
Source input
'It
Retrieve relevant Fai
knowledge
t
Succeed
Marco,,
Reorganize
memory
Figure 11.2 Memoni organi,ation
functions
Indexing and Retrieval Techniques 215
Sec. 11.2
is not explicitly stated. Other objects not attached to the robot remain in their rigina)
places: The actual changes must somehow be reflected in memory. a feat that requires
some ability to infer. Effective memory organization and management methods must
take into account effects caused by the frame problem.
In the remainder of this chapter we consider three basic problems related to
knowledge organization; ( I ) classifying and computing indices for input information
presented to a system. 12) access and retrieval of kno ledge from memory through
the use of the computed indices, and (3) the reorganization of memory struciure
when necessary to accommodate additions, revisions, and forgetting. These tunetion
are depicted in Figure 11.2.
When a know ledge base is too large to he held in main niernon . it iliust he stored
as a tile in secondary storage (disk, drum or tape). Storage and retrieal of intoi In -At ion
in secondary memory is then performed through the transfer ol equalsi/c ph 'deal
blocks consisting of between 2 12561 and 2H4096) bytes. When an item of intornia-
tion is retrieed or stored, at least one complete block must he transferred bet's een
main and secondary memory . The time required to transfer a block t\pIcall\ r.iilges
between It) ms. and 100 ms. . about the same amount of time required to sequeritiall
search the whole block for an item. Clearl y . then, grouping related knoss ledge
together as a unit can help to reduce the number of block transfers, and hence die
total access time.
An example of et)eet i e grouping alluded to abos e . an he found in some
expert s y stem KB organizations. Grouping together rules s hih share some of the
saIflC conditions (propositions) and conclusions call block transfer tutics since
such rules are likely to he needed during the saute problem sols ing session (oiic
qucntly . collecting rules together h similar conditions or content call to teduec
the number of block transfers required. A noted before, the RF II al,orithiv -
scribed III previous chapter. is all of this i fi e ol oreani/ai[oil
Indexed Organization
are pairs of record key values and block addresses. The key value is the key of the
first record stored in the corresponding block. To retrieve an item of knowledge
from the main file, the index file is searched to find the desired record key and
obtain the corresponding block address. The block is then accessed using this address.
Items within the block are then searched sequentially for the desired record.
An indexed file contains a list of the entry pairs (k.b) where the values k are
the keys of the first record in each block whose starling address is b. Figure 11.3
illustrates the process used to locate a record using the key value of 378. The
largest key value less than 378 (375) gives the block address (800) where the item
will be found Once the 8(0 block has been retrieved, it can be searched linearh
to locate the record with key value 378. This key could he an y alphanumeric string
that uniquely identifies a block, since such strings usually have a collation order
defined b y their code set.
If the index file is large, a binary search can he used to speed up the index
file search. A binary search will significantly reduce the search time over linear
search when the number of items is not too small. When a file contains n records,
the average time for a linear search is proportional to n/2 compared to a binary
search time on the order of ln,(n).
Further reductions in search time can be realized using secondary or higher
order (hierarchically) arranged index tiles. In this case the secondary index file
would contain key and block-address pairs for the primary index tile. Similar indexing
would apply for higher order hierarchies where a separate hi used for each
level. Both binary search and hierarchical index file organization may be needed
when the KB is a very large tile.
index KB file
fIe bIok ddree of eord, k key
(kb) b k Other record fieId
key 009. 100 100 009...............
y.elue p 138,200 100 110 ....................
378, 100 014....
100 021..
100 032....
375, 800 200 138...
41.0,900 .
200 165.
_800 375
800 377
800 378
800 382
800 391
800 405
900 410
900 412
When the total number of records in a KB tile is n with r records stored per
block giving a total of b blocks tn = r * hI. the average search time for a nonindexed,
sequential search is b / 2 block access tinces plus it 2 record tests. This compares
with an index search time of h / 2 index tests, one block access, and r 2 record
tests: A binary index search on the other hand would require only ln(/n index
tests, one block access, and r 2 record tests. Therefore. we see that for aric ii
and moderately large r 13() to SO), the time savings possible using hinar indexed
access can be substantial.
Indexing in LISP can he implemented ith property lists. A-lists, and or
tables. For example. a KB can be partitioned into segments b y storing each segment
as a list under the property value for that seement. Each list indexed in this sa
can be found v ith the get property function and then searched sequentiallN or sorted
and searched with binary search methods. A hash-table is a special data structure
in LISP which provides a means of rapid access through kes hashing. We resiess
the hashing process next.
Hashed Files
Indexed organizations that permit efficient access are based on the use of a hash
function. A hash function. h. transforms ke y values k into integer storage location
indices through a simple computation. When a maximum number of items or categories
C are to be stored, the hashed values h(k) will range front to C - I. Therefore.
given any key value k. h(k) should map into one of 0 ....- I.
An effective, but simple hash function can be computed by choosing the largest
prime numberp less than or equal to C. converting the key value k Into an integer
- k' if necessary, and then using the value k mod p as the index value h. For example.
if C is lO). the largest prime less than C is p 997. Thus. it the record key
salue is 12345789 (a social securit y number. the hashed value is h = (k iiod
997) = 273.
When using hashed access, the value of C should he chosen large enough to
accommodate the maximum number of categories needed. The use of the prime
number p in the algorithm helps to insure that the resultant indices are soiiics hat
uniformly distributed or hashed throughout the range 0 . - C -
This type of organization is well suited for groups of items coresponding to
C different categories. When two or more items belon g to the same cate g or y . the
will have the same hashed values. These values are culled .cvnonv,ns. One \a to
accommodate collisions (simultaneous attempts to access synonyms) is with data
structures known as buckets. A bucket is a linked list of one or more Items, where
each item is a record, block, list or other data structure. The first item in each
bucket has an address corresponding to the hashed address Figure II .4 illustrates
a form of hashed memory organization which uses buckets to hold all Items ith
the same hashed key value. The address of each bucket in this case is the indexed
location in an array.
218 Knowledge Organization and Management Chap. 11
Hashed address
Conceptual Indexing
The indexing schemes described above are based on lexical ordering, where the
collation order of a key value determines the relative location of the record Keys
for these items are typically chosen as a coded field (employee number, name,
part number, and so on) which uniquely identifies the item. A better approach to
indexed retrieval is one which makes use of the content or meaning associated
with the stored entities rather than some nonmeaningful key value. This suggests
the use of indices which name and define or otherwise describe the entity being
retrieved. Thus, if the entity is an object, its name and characteristic attributes
WOU ' d make meaningful indices. If the entity is an abstract object such as a concept.
the name and other defining traits would be meaningful as indices.
How are structures indexed by meaning, and how are they organized in mel11or
for retrieval? One straightforward and popular approach uses associative networks
(see Chapter 7) similar to the structures illustrated in Figure 11.1. Nodes within
the network correspond to different knowledge entities, whereas the links are indices
or pointers to the entities. Links connecting two entities name the association or
relationship between them. The relationship between entities may be defined as a
hierarchical one or just through associative links
As an example of an indexed network, the concept of computer science ICS
should be accessible directly through the CS name or indirectly through associative
links like a universit y major, a career field, or a type of classroom course. These
notions are illustrated in Figure 11.5.
Object attributes can also serve as indices to locate items or categories based
on the attribute values. In this case, the best attribute keys are those which provide
the greatest discrimination among objects within the same category. For example,
suppose we wish to organize knowledge by object types. In this case, the choice
of attributes should depend on the use intended for the knowledge. Since objects
Integrating Knowledge in Memory 219
Sec. 11.3
may be classified with an unlimited number of attributes (color. size, shape, markings.
and so on). those attributes which are most discriminable with respect to the concet
meaning should be chosen. Alternatively, object features with the most predictive
power make the best indices. A good index for bird types is one based on individual
differences like feet. size, beak shape, sounds emitted, special markings, and so
forth. Attribute values possessed by all objects are useful for forming categories
but poor for identifying an object within the category.
Truly intelligent methods of indexing will be content associative and usually
require some inferring. Like humans, a system may fail to locate an Item when it
has been modified in memory. In such cases, cues related to the item ma y be
needed. For example, you may fail to remember whether or not you hav, ever
discussed American politics with a foreigner until you have considered under what
circumstances you may have talked with freigners (at a university, while traveling
or living abroad, or just a chance meeting). An example of this type of indexing
strategy is discussed in Section 11.4.
Hypertext
One of the earliest computer models of memory wasthe Human Associative Memor\
(HAM) system developed by John Anderson and Gordon Bower (1973). This memory
is organized as a network of propositional binary trees. An example of a simple
tree which represents the statement ''In a park a hippie touched a debutante'' is
illustrated in Figure 11.6. When an informant asserts this statement to HAM, the
system parses the sentence and builds a binary tree representation. Nodes in the
tree are assigned unique numbers, while links are labeled with the following functions:
y 9 \ 71\
L
park psi hipo.e 4 5
As HAM is informed of new sentences, they are parsed and formed into ne
tree-like memory structures or integrated with existing ones. For example. to add
the fact that the hippie was tall, the following .suhtree is attached to the tree structure
of Figure 11.6 by merging the common node hippie (node 3) into a single node.
21
,/3
•
patt 3. 24
hippie tall
When HAM is posed with a query, it is formed into a tree structure called a
probe. This structure is then matched against existing ' memory structures for the
best match. The Structure with the closest match is used to formulate an anser to
the query.
Matching is accomplished by first locating the leaf nodes in memory that
match leaf nodes in the probe. The corresponding links are then checked to see it
they.have the same labels and in the same order. The search process is constrained
by searching only node groups that have the same relation links, based on reeene
of usage. The search is not exhaustive and nodes accessed infrequently may be
forgotten. Access to nodes in HAM is accomplished through word indexing in
LISP (node words in tree structures are accessed directly through property lists or
A-lists).
Roger Schank and his students at Yale University have developed several computer
systems which perform different functions related to the use of natural language
222 Knowledge Organization and Management Chap. 11
Frame $MEET
Content
4
EV1 EV2 EVI tV2 EV1 EV2
to the same MOP category are entered. common e'rent features are used to generalize
the E-MOP. This information is collected in the traitie contents. Specialization ina
also he required when over- generalization has occurred. Thus, mctnor\ is cntinualIv
being reorganized as ness facts are entered This process prevents the addition of
excessive memory entries and touch redundancy which would result it eser' event
entered resulted in the addition of a separate event. Reorganization can also cause
forgetting. since originally assigned indices may he changed when ness structures
are formed. When this occurs, an iem cannot be located so the s y stem attempts
to derive . new indices from the context and through other indices by reconstructing
related events.
To see how CYRUS builds and maintains a memory organtzatton. we briefly
examine how a basic E-MOP grows and undergoes revision with time Initially,
the $MEET E-MOP of Figure 11.7 would consist of the Content part of the frame
only. Then, after a first -meeting occurred, indices relevant and unique to that meeting
are established and recorded, and pointers are set to the corresponding event. Subse-
quent meeings also result in the determination of new event indices, or, if two or
more of the new meetings have some features in common, a new sub-EMOP would
be formed with indices established and pointers set to the new E-MOP. This process
continues with new indices to events added or new E-MOPs formed and indexed
as new meetings occur. Furthermore, the content portion of all E-MOPs is continually
monitored and modified to better describe the common events it indexes. Thus,
when a number of meeting events exhibit some new property, the frame content is
generalized to include this property and new indices are determined. When over-
generalization occurs, subsequent events will result in a correction through some
specialization and recoruputation of indices.
After the two diplomatic meetings described above had been entered, indices
are developed by the system to index the events (EVI and EV2) using features
which discriminate between the two meetings (Figure 11.7). If a third meeting is
now entered, say one between Vance and Sadat of Egypt. which is also about
Arab-Israeli peace, new E-MOPs will be formed since this meeting has some features
in common with the Begin (VI) meeting. One of the new E-MOPs that is formed
is indexed under the previous topic index. It has the following structure:
Topic
E- MOP) SALT
Topic Arab-Israeli peace
Underlyins topc: peace
Involves: Israel and the Arabs EV2
Participants: heads of state
Participants'
nationalities
Israel Egypt
I,
F's/i EV2
The key issues in this type of organization are the same as those noted earlier.
They are (I) the selection and computation of good indices for new events so that
simiiaevents can be located in memory for new event integration. (2) monitoring
and reorganization of memory to accommodate new events as they occur, and (3)
access of the correct event information when provided clues for retrieval.
Chap. 11 ExerciseS 225
11.5 SUMMARY
EXERCISES
11.1. What important characteristics should a computer memory organization System po.ssess
11.2. Explain why each of the characteristics named in Problem 11.1 are important.
11.3. What basic operations must a program perform in order to access specific chunks of
knowledge?
11.4. Suppose 64-byte records arc stored in . . ize 2 bytes. Describe a suitable
index file to access the records using the f0V wing keys (start with block address
16-
226 Knowledge Organization and Management Chap. 11
time when a block can be located and read on the average within 60 ma. and the
time to search each record is one m. per block? Compare this time to the time
• required to search a single block for the same information.
11.6. Referring to Problem 11.4, describe how a hashing method could be applied to
• search for the indicated records.
11.7. Draw a conceptual indexing tree structure using the same keys as those given in
• Problem 11.4, but with the addition of a generalized node named farm-animals.
11.8. Using the same label links as those used in HAM, develop propositional trees for
• the following sentences.
The birds were singing in the park.
John and Mary went dancing at the prom.
Do not drink the water.
11.9. For the previous problem, add the sentence "There are lots of birds and they are
small and yellow."
11.10. Develop an E-MOP for a general episode to fill up a car with gasoline using the
elements Actor, Participant, Objects, Actions, and Goals.
11.11. Show how the E-MOP of Problem 11.10 would be indexed and accessed for the
two events of filling the car at a self-service and at a full-service location.
11.12. Are the events of Problem II. II good candidates for specialized E-MOPs Explain
your answer.
11.13. Give an example of a hashing function that does not distribute key values uniformly
over the key space.
11.14. Draw a small hypertext network that you might want to browse where the general
network subject of artificial intelligence is used. Make up your own subtopics and
show all linkages which you feel are useful, including link directions between subtopics.
11.15. Show how the E-MOP of Figure 11.7 would be generalized when peace was one of
the topics discussed at every meeting.
11.16. Modify the E-MOP of Figure 11.7 to accommodate a new meeting between Vance
and King Hussain of Jordan. The topic of their meeting is Palestinian refugees.
PART
Perception, Communication, and Expert
Systems
A en
Natural Language
Processing
227
228 Natural Language Processing Chap. 12
12.1 INTRODUCTION
a noun and some other part of the Sentence. Conjunctions join words or groups of
words together, and interjections are used to express strong feelings apart from the
rest of the sentence.
Phrases are made up of words but act as a single unit within a sentence.
These form the building blocks for the syntactic structures we consider later.
Syntactic. This knowledge relates to how words are put together or structured
to form grammatically correct sentences in the language.
World. World knowledge relates to the language a user must have in order
to understand and carry on a conversation. It must include an understanding of the
other person's beliefs and goals.
The approaches taken in developing language understanding programs generally
follow the above levels or stages. When a string of words has been detected, the
230 Natural Language Processing Chap. 12
sentences are parsed or analyzed to determine their structure (syntax) and grammatical
correctness. The meanings (semantics) of the sentences are then determined and
appropriate representation structures created for the inferencing programs. The whole
process is a series of transformations from the basic speech sounds to a complete
set of internal representation structures.
Understanding written language or text is easier than understanding speech.
To understand speech, a program must have all the capabilities of a text understanding
program plus the facilities needed to map spoken sounds (often corrupted by noise)
into textual form. In this chapter, we focus on the easier problem, that of natural
language understanding from textual input and information processing. The process
of translating speech into written text is considered in Chapter. 13 under Pattern
Recognition and the process of generating text is considered later in this chapter.
Essentially, there have been three different approaches taken in the development of
natural language understanding programs. (1) the use of keyword and pattern match-
ing, (2) combined syntactic (structural) and semantic directed analysis, and (3).compar-
ing and matching the input to real world situations (scenario representations).
The keyword and pattern matching approach is the simplest. This approach
was first used in programs such as ELIZA described in Chapter 10. It is based on
the use of sentence templates which contain key words or phrases such as -
my mother ," "I am ___________. and, "1 don't like ," that
are matched against input sentences. Each input template has associated with it
one or more output templates, one of which is used to produce a response to the
given input. Appropriate word substitutions are also made from the input to the
output to produce the correct person and tense in the response (I and me into you
to give replies like "Why are you "). The advantage of this approach is
that ungrammatical, but meaningful sentences are still accepted. The disadvantage
is that no actual knowledge structures are created; so the program does not really
understand.
The third approach is based on the use of structures such as the frames or
scripts described in Chapter 7. This approach relies more on a mapping of the
input to prescribed primitives which are used to build larger knowledge structures.
It depends on the use of constraints imposed by context and world knowledge to
develop an understanding of the language inputs. Prestored descriptions and details
for commonly occurring situations or events are recalled for use in understanding a
new Situation. The stored events are then used to fill in missing details about the
current scenario. We will be returning to this approach later in this chapter. Its
advantage is that much of the computation required for syntactical analysis is bypassed.
The disadvantage is that a substantial amount of specific, as well as general world
knowledge must be prestored.
The second approach is one of thq most popular approaches currently being
Sec. 12.3 Grammars and Languages 231,
used and is the main topic of the first part of this chapter. With this approach.
knowledge structures are constructed during a syntactical and semantical analysis
of the input sentences. Parsers are used to analyze individual sentences and to
build structures that can be used directly or transformed into the required knowledge
formats. The advantage of this approach is in the power and versatility it provides.
The disadvantage is the large amount of computation required and the need for
still further processing to understand the contextual meanings of more than one
sentence.
G = (v.v.s,p)
xv: — xw:
where x. y. :, and w are strings from v. This rule states that v should be rewritten
as w in the context of x to z where x and can be any string including the empty
string e.
As an example of a simple grammar G, we choose one which has component
pans or constituents from English with vocabulary Q given by
P: S—.NPVP
NP— ART N
VP V NP
N - boy I popsicle I frog
V -ate i kissed I flew
ART—I' the I a
where the vertical bar indicates alternative choices.
S is the initial symbol (for sentence here), NP stands for noun phrase.. VP
stands for verb phrase. N stands for noun. V is an abbreviation for verb, and ART
stands for article.
The grammar (i defined above generates only a small fraction of English,
but it illustrates the general concepts of generative grammars. With this G. sentences
such as the following can he generated.
S - NP VP
ART N VP
the N VP
-. the boy VP
the boy V NP
-. the boy ate NP
-. the boy ate ART N
- the boy ate a N
-* the boy ate a popsicle
It should be clear that a grammar does not guarantee the generation of meaningful
sentences, only that they are structurally correct. For example, a gramatically correct,
but meaningless sentence like "The popsicle flew a frog" can be generated with
this grammar.
We learn a language by learning its structure and not by memorizing all of
the sentences we have ever heard, and we are able to use, the language in a variety
of ways because of this familiarity. Therefore, a useful model of language is one
which characterizes the permissible structures through the generating grammars.
Unfortunately, it has not been possible to formally characterize natural languages
with a simple grammar. In other words, it has not been possible to classify natural
languages in a mathematical sense as we did in the example above. More constrained
Sec. 12.3 Grammars and Languages 233
languages (formal progrmmiflg languages) have been classified and studied through
the use of similar grammars. including the Chomsky classes of languages (1965).
Structural Representations
NP VP
I
ART N
A more extensive English grammar than the one given above can be obtained
with the Wition of other constituents such as prepositional phrases PP. adjectives
ADJ, determiners DEl. adverbs ADV, auxiliary verbs AUX. and so on. Additional
rewrite rules permitting the use of these constituents could include some of the
following:
PP - PREP NP
VP-I. V ADV
v p V PP
VP-. V NP PP
VP-. AUX V NP.
DET-. ART ADJ
DET-.ART
These extensions broaden the types of sentences that can he generated by permitting
the added constituents in sentence forms such as
Tranfotietion& Grammars
""^
I
S
NP Vp NP 'VP
V NP Sue VERB pp
(a) (b)
Figure 12.2 Sti-ucturei for (a) active and (b) passive voice.
grammatical constituent parts. This reveals the surface structure of the sentence,
the way the sentence is used in speech or in writing. This structure can be transformed
into another one where the deeper semantic structure of the sentence is determined.
Application of the transformation rules can produce a change from passive
voice to active voice, change a question to declarative form, and handle negations.
subject-verb agreement, and so on For example, the structure in 12.2(b) could be
transformed to give the same basic structure as that of 12.2(a) as is illustrated in
Figure 12.3.
Transformational grammars were never widely adopted as computational models
of natural language. Instead, other grammars. including case grammars, have had
more influence on such models.
Case Grammars
A case relates to the semantic role that a noun phrase plays with respect to verbs
and adjectives. Case grammars use the functional relationships between noun phrases
and verbs to reveal the deeper case of a sentence. These grammars use the fact
V NP PASSIVE
Joe
I
kiss
I
Sue
Iby Figure 12.3 Passive voice transformed
to active voice.
Sec. 12.3 Grammars and Languages 237
that verbal elements provide the main source of structure in a sentence since they
describe the subject and objects.
In inflected languages like Latin, nouns generally have different ending forms
for different cases. In English these distinctions are less pronounced and the forms
remain more constant for different cases. Even so, they provide some constraints.
English cases are the nominative (subject of the verb). possessive (showing possession
or ownership). and objective (direct and indirect objects). Fillmore (1968. 1977)
revived the notion of using case to extract the meanings of sentences. He extended
the transformational grammars of Chornsky by focusing more on the semantic aspects
of a sentence.
In case gramniars, a sentence is defined as being composed of a proposition
P. a tensele.ss set of relationships among verbs and noun phrases and a modality
constituent M. composed of mood, tense, aspect. negation, and so on. Thus, a
sentence can he represented as
S—.M + P
P—*Cl+C2+. . .+Ck.
The number of cases suggested by Fillmore were relatively few. For example,
the original list contained only some six cases. They relate to the actions performed
by agents, the location and direction of actions, and so on. For example, the case
of an instigator of an action is the agenhive for agent), the case of an instrument or
object used in an action is the instrumental, and the case of the object receiving
the action or change is the objective. Thus, in sentences like "The soldier struck
the suspect with the rifle butt" the soldier is the agentive case. the suspect the
objective case, and the rifle butt the instrumental case. Other basic cases include
dative (an animate entity affected by an action). factitive (the case of the object or
of being that which results from an event), and locative (the case of location of
the event).. Additional tases or substitutes for those given above have since been
introduced, including beneficiary, source, destination, to or from, goal, and time.
Case frames are provided for verbs to identify allowable cases. They give
the relationships which are required and those which are optional. For the above
sentence, a case frame for the verb struck might he
This may be interpreted as stating that the verb struck must occur in sentences
with a noun phrase in the objective case and optionally (parentheses indicate optional
use) with noun phrases in the agentive and instrumental cases.
A tree representation for a case grammar will identify the words by their
modality and case. For example, a case grammar tree for the sentence "Sue did -
not take the car" is illustrated in Figure 12.4.
238 Natural Language Processing Chap. 12
s
/f\
. "'-^
Declarative V Cl C2
negation
pas t
Figure 12.4 Case grammar tree
take Se the car representation.
To build a tree structure like this requires that a word Lexicon with sufficient
information be available in which to determine the case of sentence elements.
Systemic Grammars
Classification of units. Units are classified by the role they play at the
next higher level. For example, the verbal serves as the predicate, the nominal
serves as the subject or complement, and so on.
declarative
independen.— imperative yes-no
clause---- - interrogative-1 wli-
dependen'
Semantic gram mars encode semantic information into a syntactic grammar. They
use context-fre e rewrite rules with nonterminal semantic constituents. The constituents
are categories, or metasymbols such as attribute, object, present (as in display),
and ship, ratl ter than NP, VP. N. V. and so on. This approach greatly restricts the
range of sen tences which can be generated and requires a large number of rewrite
rules.
Sema tic grammars have proven to be successful in limited applications includ-
ing LIFER., a data base query system distributed by the Navy which is accessible
lough A RPANET (Hendrix et al.. 1978), and a tutorial system named SOPHIE
which is 1 .jsed to teach the debugging of circuit faults. Rewrite rules in these systems
cssentialy take the forms
In the LIFER system, there are rules to handle numerous forms of wh-queries
such as
What is the name and location of the carrier nearest to New York
Who commands the Kennedy
240 Natural Language Processing Chap. 12
where print matches <PRESENT>, length matches <AT1'RIBUTE>, and the Enter-
prise matches <SHIP>. Other typical lexicon entries that can match <ATTRIBUTE>
include CLASS, COMMANDER. FUEL. TYPE. BEAM. LENGTH, and so on.
LIFER can also accommodate elliptical (incomplete) inputs. Given the query
is the length of the Kennedy?" a subsequent query consisting of the abbreviated
form "of the Enterprise?" will elicit a proper response (see also the third and
fourth, example queries above).
Semantic grammars are suitable for use in systems with restricted grammars
since computation is limited. They become unwieldy when used with general purpose
language understanding systems, however.
Before the meaning of a sentence can be determined, the meanings of its constituent
parts must be established. This requires a knowledge of the structure of the sentence.
the mcanings of individual words and how the words modify each other. The process
of determining the syntactical structure of a sentence is knowh as parsing.
Pa;sing is the process of arrzag a sentence by taking it apar' word-by-
word and deterrniairg its structur from its constituent parts and subsatis. The
structure of a sentence can be represented with a.syntactic tree or a list as duwribcd
in the previous section. The parsing process is basically the inverse of the sentence
g eneration process since it involves finding a grammatical sentence structue from
an input string. When given an input string, the lexical pans or terms (root words)
must first he identified by type, and then the role they play in a sentence must he
determined. These parts can then be combined successively into larger units until a
complete tree structure has been completed.
To determine the meaning f a word. a parser must have access to a lexicon.
When the parser selects a word from the input stream it locates the word in the
lexicon and obtains the word's possible function and other features, including semantic
information. This information is then used in building a tree or other representation
structure. The general parsing process is illustrated in Figure 12.5.
Sec. 12.4 Basic Parsing Techniques
241
Input
string Parsereaen tat ion
structure
Lexicon
Figure 12.5 Parsing an input to create
an output structure
The Lexicon
orange Adjective
Noun {3s)
the Determiner (Is. jp
to Prcposuon
we Pronoun
Case, subjective
yelIo Adjective
Figure
Figure 12.6 Typical entries in a lexicon
17—
7--
Natural Language Processing Chap. 12
242
categories
and so on). and all ssords contained within the lexicon listed within the
to hich the belong
The organization and entries of a lexicon will vary from one implementation
to another, but the are usually made up of variable length data structures such as
lists or records arranged in alphabetical order. The word order may also he given
in tCrflis of usage frequency so that frequently used words Uke a, the, and an will
appear at the beginning of the list facilitating the search.
Access to the w!rds in.i he facilitated by indexing. with binary searches,
hd h i n. or combinations of these methods A lexicon may also he partitioned to
eneral. frequentiv used words and domain specific
contain a base lexicon set of g
component' of words
Transition Networks
ed to represent formal and natural
Transition networks are another popular method us
are based on the application of directed graphs (digraphs)
language structures The y
and finite state automata. A transition network consistS of a number of nodes and
labeled arcs. The nodes represent different states in traversing a sentence. and the
arcs represent rules or test conditions required to make the transition from one
state to the-next. A path through a transition network corresponds to a permissible
sequence of word types for a given grammar. Thus, if a transition network can be
successfully' traversed, it will have recognized a permissible sentence structure. For
example, a network used to recognize a sentence consisting of a determiner, a
noun and a verb ("The child runs'') would he represented by the three-node graph
as follows.
noun verb
deierm ner
3N4
determiner djetne
pronoun '\ \, ,' noun
NP (,N I_ N?) N3
proper noun
Figure 12.7 A noun phra sce S flieD C ot a Iran o fin neiw ork
Words in the input sentence are replaced with their syntactic categories and those
in turn are replaced by constitutents of the same or smaller size until S has been
rewritten or until failure occurs.
The reader may have noticed the close similarity between rewrite rules and Horn
clauses, especially when the Horn clauses are written in the form of PROLOG
Sec. 12.4 Basic Parsing Techniques
245
adjective
article noon verb article noon
NI N2 N3 N5 N6 N7
aux verb
verb
N2 N5 INS
S_
noun
verb
The variables A. B. and C in this statement represent lists of words. The argument
A is the whole list of words to be tested as a sentence, and C is the list of remaining
words, if any. Similar assumptions hold for A. B, and C in the noun and verb
phrase conditions respectively.
Rule definitions which rewrite the noun phrases and verb phrases must also
be defined. Thus, an NP may be defined with statements such as the following:
Like the
the above rule, these rules state that (I) a noun phrase can be either an article
which consists of a list A and remaining list B (if any) and a noun which is a list
B and remaining list. C or (2) a noun consisting of the list A with remaining list B
(if any). Similarly, a verb phrase may be defined with rules like the following:
Natural Language Processing Chap. 12
246
verbPhrase(A,B verb(A,B).
verbPhrase(A,C) = verb(AB). nounphrase(BC).
vorbPhrase(AC) : = verb(A,B), prepositionPhras)B.C)
Definitions for the prepositional phrase as well as lexical terminals must also
he given. These can include the following:
prepositionhlatXi.X).
a rt deC Ia I . X C
a rticle( theXI,X).
nounoldogi XIX).
noun(lcow XIX).
nounil m000 XLX)
verb(tba rked XIX).
verb) lwinkedXI,X).
With this simple parser we can determine if strings ufthe following t y pe are grammati-
callN correct. -
To do so. we must enter sentence queries as lists such as the following tor the
PROLOG interpreter:
?
X=ll
? - sentence)) barked,a,mOOfl.dOg.thelXI
no
Since the remainder of the sentence hound toX is the empty set, it is recognited
a corrct The second sentence failed since it could not instantiate with the correct
constituent parts.
Of course, for a parser to be of much practical use, other constituent , and a
great many more words should be defined. 1 he example illustrates the utility of
using PROLOG as a basic parser.
The simple networks described above are not powerful enough to recognize the
variety of sentences a human language system could be expected to cope with. In
fact. they fail to recognize all languages that can be generated by a context-free
Sec. 12.4 Basic Parsing Techniques 247
grammar. Other extensions are needed to accept a wider range of sentences but
still avoid the necessity for large complex networks. We can achieve such extensions
by labeling some arcs as a separate network state (such as an NP) and then constructing
i subnetwork which recognizes the different noun phrases required. In this way, a
single subnetwork for an NP can be called from several places in a sentence Similar
arcs can be labeled for other sentence constituents including VP. PP (prepositional
phrases) and others. With these additions, complex Sentences having .embedded
phrases can he parsed with relatively simple networks, This leads directly to the
notion of using recursion in a network.
A recursive transition network (RTN) is a transition network which permits
are labels to refer to other networks (including the network's own name), and .they
in turn may refer back to the referring network rather than just permitting word
categories used previously. For example, an RTN described by William Woods
1970) is illustrated in Figure 12.9 where the main network calls two subnetworks
and an NP and PP network as illustrated in 12.9(b) and (c).
The top network in the figure is the top level (sentence) network, and the
lower level networks are for NP and PP arc states. The arcs corresponding to these
states will be traversed only if the corresponding subnetworks (b) or (c) are successfully
traversed.
NP
S: POP
N: VsO: i
AUX
::
ADJ PP
NP N2 N4 POP
Ni NPR POP
N3
NP'
PP _.._E__ POP
)c) Prepositional phrase network
Tv pc
01 arc Purpose of arc Example
Starting with CND set to SI. POS Set to I. and RLIST set to nil, the first arc test
(NP) would be completed. Since this test is for a state, the parser would PUSH
the return node S2 onto RLIST. set CND to NI. and call the NP network. Trying
the first test DEl (a CAT test) in the NP network, a match would be found with
word position 1. This would result in CND being updated to N2 and POS to position
2. The next word (big) satisfies the ADJ test causing CND to be updated to N2
again, and POS to be updated to position 3. The ADJ test is then repeated for the
word tree, but it fails. Hence, the arc test for N is made next with no change
made to l'OS and CND. This time the test succeeds resulting in updates of N4 to
CND and position 4 to POS. The next test is the POP which signals a successful
completion of the NP network and causes the return node (SI) to be retrieved
from the RUST stack and CND to be updated with S2. POP does not cause an
advance in the-word position POS.
The only possible test from S2 is for category V which succeeds on the word
"shades" with resultant updates of S5 to CND and 5 to POS. At S5, the only
possible test is the NP. This again invokes a call to the lower level NP network
which is traversed successfully with the noun phrase "the old house. After a
return to the main network, CND is set to S6 and POS is set to position b. At this
point, the lower PP network is called with CND being set to Pt and So pushed
onto RLIST. From P1. the CAT test for PREP passes with CND being set to P2
and POS being set to 9. NP is then called with CND being set to NI and P2 being
pushed onto RLIST. As before, the NP network is traversed with the noun phrase
"the stream" resulting in a POS value of II, P3 being popped from RLIST and a
return to that node. The test at P3 (POP) results in S6 being popped from RLIST
and a return to the S6 node. Finally, the POP test at N6. together with the period
at position II results in a successful traversal and acceptance of the sentence.
During a network traversal, a parse can fail if (I) the end of the input sentence
(a period) has been reached when the test from the CND node value is not a terminal
(POP) value or(2) if a word in the input sentence fails to satisfy any of the available
arc tests from some node in the network.
The number of sentences accepted by an RTN can be extended if backtracking
is permitted when a failure occurs. This requires that states having alternative transi-
tions be remembered until the parse progresses past possible failure points. hi this
w, if a failure occurs at some point, the interpreter can backtrack and try alternative
paths. The disadvantage with this approach is that parts of a sentence may be parsed
more than one time resulting in excessive computations.
The networks considered so far are not very useful for language understanding.
They have only been capable of accepting or rejecting a sentence based on the
grammar and syntax of the sentence. To be more useful, an interpreter must be
able to build structures which will ultimately be used to create the required knowledge
entities for an Al system. Furthermore, the resulting data structures should contain
a
250 Natural Language Processing Chap. 12
more information than just the syntactic information dictated by the grammar alone.
Semantic information should also be included. For example, a number of sentence
features can also be established, and recorded, such as the subject NP, the object
NP, the subject-verb number agreement, the mood (declarative o interrogative),
tense, and so on. This means that additional tests must be performed to determine
the possible semantics a Sentence may have. Without these additional tests, much
ambiguity will still be present and incorrect or meaningless sentences accepted.
We can achieve the additional capabilities required by augmenting an RIN
with the ability to perform additional tests and store immediate results as a sentence
is being parsed. When an 'RTN is given these additional features, it is called an
augmented transition network or ATN.
When building a representation structure, an ATN uses a number of different
registers as temporary storage to hold the different sentence constituents. Thus,
one set of registers would he used for an NP network, one for a PP network, one
for a V. and so on. Using the register contents, an ATN builds a partial structural
description of the sentence as it moves from state to state in the network. These
registers provide temporary storage which is easily modified, switched, or discarded
until the final sentence structure is constructed. The registers also hold flags and
other indicators used in conjunction with some arcs. When a partial structure has
been stored in registers and a failure occurs, the interpreter can clear the registers.
backtrack, and start a new set of tests. At the end of a successful parse, the contents
of the registers are combined to form the final sentence data structure required for
output.
A specification language developed by Woods 41970. 1986 for ATNs takes the
lorni of an extended context-free grammar. This language is given in Figure 12.10
where the vertical bar indicates alternative choices for a construction and the *
Kleene star) signifies repeatable (zero or more) elements. All nonterrninals are
enclosed in angle brackets. Some of the capitalized words appearing in the language
were defined earlier as arc tests and actions. The other words in uppercase correspond
to functions which perform many of the tasks related to the construction of the
structure using the registers.
The specification language is read the same as rewrite rules. Thus, it specifies
that a transition network is composed of a list of arc sets, where each arc set is in
turn a list with first element beinL a state name and the remaining elements being
arcs which emanate from that state An arc can be any of the forms CAT. JUMP.
PUSH, TEST. WORD or POP. Vor example. as noted earlier, the TEST arc corre-
sponds to an arbitrary test which determines whether the arc is to be traversed or
not. Note that a sequence of actions is associated with the arc tests. These actions
are executed during the arc traversals. They are used to build pieces of structures
such as a tree or a list. The te,minal action of any arc specifies the state to which
control is passed to complete the transition.
21
Sec. 12.4 Basic Parsing Techniques
Among other things, an action can be any of the three function forms SETR.
SENDR, and LIFTR which cause the indicated register values to be Set to the
value of form. Terminal actions can be either TO or JUMP where TO requires that
the input sentence pointer should be advanced, and JUMP requires that the pointer
remain fixed and the input word continue to be scanned. Finally, a construction
form can be any of the seven alternatives in the bottom group of Figure 12. 10,
including the symbol @ which is a terminal symbol placeholder for form.
The function SETR causes the contents of the indicated registers to be set
equal to the value of the corresponding form. This is done at the current level in
the network, while SENDR causes it to be done by sending it to the next lower
level of computation. L(FR returns information to the next higher level of computa-
tion. The function GETR returns the value of the indicated register. and GETF
returns the value of a specified feature for the current input word. As noted before.
the value of @ is usually an input word. The function BUILDQ takes lists from
the indicated registers (which represent fragments of a parse tree with marked nodes)
and builds the sentence structures.
An ATN network similar to the RTN illustrated in Figure 12.9 is presented
in Figure 12.11. Note that the arcs in this network have some of the tests described
above. These tests will have the basic forms given in Figure 12.10, together with
the indicated actions. The actions include building the final sentence structure which
may contain more features than those considered thus far, as well as certain semantic
features.
Using the specification language, we can represent this particular network
with the constituent abbreviations and functions described above in the form of a
LISP program. For example, a partial description of the network is depicted in
252
Natural Language Processing Chap. 12
:2 PUSH(NP)
cAlivi POP
CAT(ADJ PUSFBPPI
N CAT(OET(
NP
Pop
C R(
Pop
CATiPREP) PUSH(NP)
PP
Sec. 12.4 Basic Parsing Techniques 253
(N4) is tested and the PP test subsequently fails. POP is executed and a rcturfl 01
control is made to statement 2.
3. The register SUBJ is set to the value of 01 which is the list structure
(NP(dog(big) DEF) returned from the NP reiscrs. DEF sign ihes that the determiner
is definite.
4. In line 3. register TYPE is set to DCL (for declarative).
5. Control is transferred t', S I with the statement TO in line 4 and the input
pointer is moved past the noun phrase to the verb-likes.-
6. If an auxiliary verb had been found at the beginning of the sentence
instead of an NP, control would have been passed to line 5 where statements 5. 7.
and 8 would have beer' executed. This would have resulted in registers AtJX .'t'
TYPE being set to the values (o and Q respectively.
254 Natural Language Processing Chap. 12
7. At SI. a category test is made for a V. Since this succeeds (is 1). statements
II, 12. and 13 are executed. This results in register AUX being set to nil, and
register V being set to the contents of (i to give (V likes). Control is then passed
to S4 and the input pointer is moved to the word 'the,"
8. If the test for V had failed, and an auxiliary verb had been found, statements
14 and 15 would have been executed.
9. Since S4 is a terminal node, a sentence structure can be built there. This
will be the case if the end of the sentence has been reached. If so, the BUILDQ
function creates a list structure with first element S. followed by the values of the
three registers TYPE, SUBJ. AUX, corresponding to the three plus (+) signs.
These are then followed with VP and the contents of the V register. For example,
with an input sentence of.
the structure (S DCI. (NP (boy) DEF) (AUX can) (VP whistle)) would he constructed
from the tour re g isters TYPE. SUB), AUX. and V.
10. Because more input words remain, the BUILDQ in line 22 is not executed,
and control drops to the next line where a push is made to the lower NP network.
As before, the NP succeeds with the structure (NP (boy (Small) DEF)) being returned
as the value of (I Register VP is then set to the list returned by BUILDQ (line
24) which consists of VP followed by the verb phrase and control is passed to SS.
H. Since S5 is a terminal node and the end of the input sentence has been
reached. BU1LDQ will build the final sentence structure from the TYPE, SUB),
AUX. and VP register contents. The final structure constructed is
The use of recursion, are tests, and a variet y of arc and node combinations
give the ATNs the power of a Turing Machine. This means that an ATN can
recognize any language that a genera! purposecomputer can recognize. This versatility
also makes It possible to build, deep sentence structures rather than just structure.s
with surface features only. (Recall that surface features relate to the torm of words.
phrases, and sentences, whereas deep features relate to the content or meaning of
these elements). The ability to build deep structures requires that other appropriate
tests. be included to cheek pronoun references, tense, number agreement, and other
featdres.
Because of their power and versatility. ATNs have become popular as a model
for general purpose parsers. They have been used successfully in a number of natural
language systems as well as front ends for databases and expert systems.
Semantic Analysis and Representation Structures 255
Sec. 12.5
It turned into a black day. In his haste to catch the flight, he hacked over Tom's
bicycle. He should never have left it there. It was damaged beyond repair..That caused
the tailpipe to break. It would be impossible to make it now. . . - It was all because
of that late movie. He would he heartbroken when he found Out about It.
Although a car was never explicitly mentioned, it must be assumed that a car
was the object which was backed over Tom's bicycle. A program must be able to
infer this. The "black day" metaphor also requires some inference. Days are not
usually referred to by color. And sorting out the pronoun references can also he an
onerous task for 'a program. Of the seven uses of it, two refer to the bicycle. two
to the flight, two refer to the situation in general, and one to the status of the day.
There are also four uses of he referring to two different people and a that which
refers to the accident in general. The placement of the pronouns is almost at random
making it difficult to give any rule of association. Words that point hack or refer
to people, places, objects, events, times, and so on that occurred before, are culled
anaphors. Their interpretation may require the use of heuristics, syntactic and semantic
constraints, inference, and other forms of object analysis within the discourse content.
This example should demonstrate again that language cannot be separated
from intelligence and reasoning. To fully understand the above situation requires
that a program be able to reason about people's goals, beliefs. motives, and facts
about the world in general.
The semantic structures constructed from utterances such as the above, must
account for all aspects of meaning in what is known as the domain. context. and
the task. The domain refers to the knowledge that is part of the world model the
system knows about. This includes object descriptions, relationships, and other rele-
vant concepts. The cornea relates to previous expressions, the setting and time ot
the utterances, and the beliefs. ,esires, and intentions of the speakers. A hO/s IS
part of the service the system otters, such as retrieving information from a data
base, providing expert advice, or performing a language translation The domain.
context, and task are what we have loosely referred to before as semantics, pragmatle,
and world knowledge.
256
Natural Language Processing Chap. 12
The semantic grammars described in Sect Lon 12.2 are one form of approach based
on the use of lexical semantics . With this approach, input sentences are transformed
throu g h the ue of domain dependent semantic rewrite rules which create the target
knowledge strucoires. A second example of an iifcrmal lexical-semantic approach
is one which USCS c'oncepttai dependency theory
. Conceptual dependency structures
provide a form of inked Knowlde that can be uscd in larger structures such a
scenes and script..
The construction o t c ncetual dependency structures is accomplished without
performing any direct s y ntactic analysis. Making the jump between utterance and
Sec. 12.5 Semantic Analysis and Representation Structures 257
these structures requires that more information be contained in the lexicon. The
lexicon entries must include word sense and other information which relate the
words to a number of prinhltive semantic categories as well as some s y ntactic informa-
tion.
Recall front 7 that conceptualizations are either events or object states.
Event structures include objects and their attributes, picture producers (PPs) or actors.
actions, direction of action (to or from) and sometimes instruments that participate
in the actions, and the location and time of the event. These items are collected
together in a slot-filler structure as depicted in Figure 12.13.
Verbs in the input string are a dominant factor in building conceptual dependency
structures because they denote the event action or state. Consequently, lexicon entries
for verbs will be more extensive than other entry types. They will contain all possible
senses, tense, and other information. Each verb maps to one of the primitive actions;
ATRANS. AT-FEND, CONC, EXPEL, GRASP, INGEST, MBUILD, MOVE.
MTRANS. PROPEL. VrRANs. and SPEAK. Each primitive action will also have
an associated tense: past, present, future, conditional, continuous, interrogative.
end, negation, start, and timeless
The basic process followed in building conceptual dependency structures is
simpl y the three steps listed below.
is-
Chap. 12
258 Natural Language Processing
would he initiated to look for associated words which complete the phrase
beginning with to or for.
For the above tests, there are four types of actions taken.
These actions build up the conceptual dependency structure as the input string
is parsed. For example, the action taken for a verb like drank would be to build a
;ubstructure for the primitive acticn INGEST with unfilled slots for ACTOR. OB-
JECT, and TENSE.
Subsequent words in the input string would initiate actions to add to this structure
and fill in the empty ACTOR and OBJECT slots. Thus, a simple sentence like
would be transformed through a series of test and action steps to produce a structure
such as the following.
This would be parsed, and the following tree structure would be output from the
ATN;
IS DCL
(NP (N (Samp(e24())
fAUX (TENSE (PRESENT)))
(VP (V (contain))
(NP IN (silicon))))
Using this structure, the semantic interpreter would produce the predicate clause
produce expressions that are natural and close to humans requires more than rules
of syntax, semantics, and discourse. In general, it requires that a coherent plan be
developed to carr y out multiple goats. A great deal of sophistication goes into the
simplest types of utterances when they are intended to convey different shades of
meanings and emotions. A participant in a dialog must reason about a hearers
understanding and his or her knowledge and goals. During the dialog, the system
must maintain proper focus and formulate expressions that either query, explain,
direct, lead or just follow the conversation as appropriate.
The study of language generation falls naturally Into three areas: (I the determi-
nation of content. (2) formulating and developing a text utterance plan and (3)
achieving a realization of the desired utterances.
Content determination is concerned with what details to include in an explana---
tion. a request, a question or argument in order to convey the meanings set forth
by the goals of the speaker. This means the speaker must know what the hearer
already knows, what the hearer needs to know, and what the hearerwants to know.
These topics are related to the domain, task, and discourse context described above.
Text planning is the process of organizing the content to be communicated so as to
best achieve the goals of the speaker. Realization is the process of mapping the
organized content to actual text. This requires that specific words and phrases be
chosen and formulated into a syntactic structure.
Until about 1980, not much work had been done beyond single sentence genera-
tion. Understanding and generation was performed with a single piece of isolated
text without much regard given to context and consideration of the hearer. Following
this early work, a few comprehensive systems were developed. To complete this
section, we describe the basic ideas behind two of these systems. They take different
approaches to those taken by the lexical and compositional semantics understanding
described in the previous section.
KAMP is a knowledge and modalities planner developed for the generation of natural
language text. Developed by Douglas Appeit (1985), KAMP simulates the behavior
of an expert robot named Rob (a terminal) assisting John (a person) in the disassembly
and repair of air compressors.
KAMP uses a planner and a data base of knowledge in (modal) logical form.
The knowledge includes domain knowledge, world knowledge, linguistic knowledge,
and knowledge about the hearer. A description of actions and action summaries
are available to the planner. Given a goal, the planner uses heuristics to build and
refine a plan in the form of a procedural network. Other procedures act as critics
of the plans and help to refine them If a plan is completed, a deduction system is
used to prove that the sequence of actions do, in fact, achieve the goal. If the plan
fails, the planner must do further searching for a sequence of actions that will
work. A completed plan states the knowledge and intentions of the agent, the robot
Sec. 12.6 Natural Language Generation 261
Rob. This is the first step in producing the output text. The process can be summarized
as follows.
Suppose KAMP has determined the immediate goal to be the removal of the
compressor pump from the platform.
Truel'Attached)pump platform))
KAMP first formulates and refines a plan that John adopt Rob's plan to remove
the pump from the platform. The first part of Rob's plan suggests a request for
John to remove the pump leading to the expression
After axioms are used to prove that actions in the initial summary plan are
successful, the request is expanded to include details for the pump removal. Rob
decides that John will know he is near the platform and that he knows where the
toolbox is located. but that he does not know what tool to use. Rob, therefore,
determines that John will not need to be told about the platform, but that he must
be informed, with an imperative statement, to remove the pump with a wrench in
the toolbox.
The next step is for Rob to plan speech acts to realize the request. This
req.ires linguistic knowledge of the structure to use for an imperative request. in
this case, that the sentence should have the form V NP ()* (recall that stands
for optional repetition). Words to complete' the output string are then selected and
ordered accordingl y . -
This leads to the generation of a sentence with the following tree structure.
/\V IPP
remove DET
the
I I N
the wrenth P NP
in DEl N
the toolbo
The overall process of planning and formulating the final sentence "Remove
the pump with the wrench in the toolbox" is very involved and detailed. It requires
planning and plan verificaticn for content, selecting the proper structures, selecting
senses, mood, tense, the actual words, and a final ordering. All of the steps must
be constrained toward the realization of the (possibly multiple) goals set forth. It is
truly amazing we accomplish such acts with so little effort.
Niel Goldman (Schank et al.. 1973) developed a generation component called BABEL
which was used as part of several language understanding systems built by Schank
and his students SAM. MARGIE. QUALM. and SO on). This component worked
in conjunction with an inference component to determine responses to questions
about short news and other stories.
Given the general content or primitive event for the response, BABEL selects
and builds an appropriate conceptual dependency structure which includes the intended
word senses. A modified ATN is then used to generate the actual word string for
output
To determine the proper word sense, BABEL uses a discrimination net. For
example, suppose the system is told a story about Joe going into a fast-food restaurant,
ordering sandwich and a soft drink in a can, paying, eating, and then leaving.
After the understanding part of the system builds the conceptual dependency and
script structures for the story, questions about the events could he posed. If asked
what Joe had in the restaurant. BABEL would first need to determine the conceptual
Sec. 12.6 Natural Language Generation 263
INGEST
fl.i(P smoke'
N/" \ES N7/1 "\ES
thr
ough use air' use
mouth? "drink" "smoke"
category of the question in order to select the proper conceptual -dependency pattern
to build. The verb in the query determines the appropriate primitive categories of
eat and drink as being INGEST. 'To determine the correct sense of INGEST as eat
and drink a discrimination net like that depicted in Figure 12.14 would be used. A
traversal of the discrimination net leads to eat and drink, using the relation from
have and sandwich as being taken through the mouth and soft drink as fluid.
Once a conceptual dependency framework has been selected, the appropriate
words must be chosen and the slots filled. Functions are used to operate on the net
to complete it syntactically to obtain the correct tense, mood, form, and voice.
When completed, a modified ATN is then used to transform the conceptual dependency
structure into a surface sentence Structure for output.
The final conceptual dependency structure passed to the ATN would appear
as follows.
joe
P 0 0
joe '" INGEST -e-- oft-dr,nk C can ---- ^T loe
t MOVE
soft drink
Contain Can
DI
ft-drink mouth
An ATN used for text generation differs from one used for analysis. In particular,
the registers and arcs must be different. The value of the register contents (denoted
as (i in the previous section) corresponds to a node or arc in the conceptual dependency
264 Natural Language Processing Chap. 12
(or other type) network rather than the next word in the input sentence. Registers
will be present to hold tense, voice, and the like. For example, a register named
FORM might be set to past and a register VOICE set to active when generating an
active sentence like "Joe bought candy." Following an arc such as a CAT/V arc
means there must be a word in the lexicon corresponding to the node in the conceptual
dependency. The tense of the word then follows from the FORM register contents.
In this section, we briefly describe a few of the more successful natural language
understanding systems. They include LUNAR, LIFER, and SIIRDLU.
The LUNAR system was designed as a language interface to give geologists direct
access to a data base containing information on lunar rock and soil compositions
obtained during the NASA Apollo-] I moon landing mission. The design objective
was to build a system that could respond to natural queries received from geologists
such as
The system has a dictionary of some 3500 words, an English grammar and
two data bases. One data base contains a table of chemical anal y ses of about 13.000
entries, and the other contains 10,000 indexed document topics. LUNAR uses a
meaning representation language which is an extended form of FOPL. The language
uses (1) designators which name objects or classes of objects like nouns, variables,
and classes with range quantifiers, (2) propositions that can be true or false, that
are connected with logical operators and, or, riot, and quantification identifiers.
Sec. 12.7 Natural Language Systems 265
and (3) commands which carry out specific actions (like TEST which tests the
truth value of propositions against given arguments (TEST (CONTAIN sarnple24
silicon).
Although never fully implemented, the LUNAR project was considered an
operational success since it related to a real world problem in need of a solution.
It failed to parse or find the correct semantic interpretation on only about 10% of
the questions presented to it.
LIFER (Language Interface Facility with Ellipsis and Recursion) was described
briefly in Section 12.2 under semantic grammars. It was developed by Gary Hendnx
(1978) and his associates to be used as a development aid and run-time language
interface to other systems such as a data base management system. Among itS
special features are spelling corrections, processing of elliptical Inputs, and the
'ability of the run-time user to extend the language through the use of paraphrase.
LIFER consists of two major components, a set of interactive functions for
language specifications and a parser. The specification functions are used to define
an application language as a subset of English that is capable of interactirlE with
existing software. Given the language specification, the parser Interprets the lantiage
inputs and translates them into appropriate structures that interact with the application
software.
In using a semantic grammar. LIFER systems incorporate much semantic infor-
mation within the syntax. Rather than using categories like NP. VP, N. and V.
LIFER uses semantic categories like <SHIP-NAME> and <ATTRIBUTE> which
match ship names or attributes. In place of yntactic patterns like NP VP. semantic
patterns like What is the <ATTRIBUTE> of <SHIP>? are used. For each such
pattern, the language definer supplies an expression with which to compute the
interpretations of instances of the pattern. For example, if LIFER were used as the
front end for a database query system, the interpretation would he for a database
retrieval command
LIFER has proven to be effective as a front end (nt a number of systems.
The main disadvantage, as noted earlier, is the potentially large number of patterns
that may be required for a system which requires many. diverse patterns.
syntactic and semantic analysis, as well as the reasoning process are more closely
integrated.
The system can be roughly divided into four component domains: (I) a syntactic
parser which is governed by a large English (systemic type) grammar. (2) a semantic
component of programs that interpret the meanings of words and structures, (3) a
cognitive deduction component used to examine consequences of facts, carry Out
commands, and find answers, and (4) an English response generation component.
In addition, there is a knowledge base containing blocks world knowledge, and a
model of Its own reasoning process. used to explain its actions.
Knowledge is represented with FOPL-likestatements which give the state of
the world at any particular time and procedures for changing and reasoning about
the state. For example. the expressions
(IS bi block)
(IS b2 pyramid)
(AT b (LOCATION 120 120 0))
(SUPPORT bl b2)
(CIEARTOP b2(
(MANIPULATE bi)
(IS blue color)
12.8 SUMMARY
Grammars were formally introduced, and the Chomsky hierarchy was presented.
This was followed with a description of structural representations for sentences,
the phrase marker. Four additional extended grammars were briefly described. One
was the transformational grammars, an extension of generative grammars. Transfor-
mational grammars include tree manipulation rules that permit the construction of
deeper semantic structures than the generative grammars. Case, semantic, and sys-
temic grammars were given as examples of grammars that are also more semantic
oriented than the generative grammars.
Lexicons were described, and the role they play in NL systems given. Basic
parsing techniques were examined. We locked at simple transition networks, recursive
transition networks, and the versatile ATN. The ATN includes tests and actions as
part of the arc components and special registers to help in building syntactic structures
With an ATN, extensive semantic analysis is even possible. We defined top-down
bottom-up, deterministic, and nondeterministic parsing methods, and an example
of a simple PROLOG parser was also discussed.
We next looked at the semantic interpretation process and discussed two broad
approi'ches, namely the lexical and compositional semantic approaches. These ap-
proaches are also identified with the type of target knowledge structures generated.
In the compositional semantics approach. logical forms were generated, whereas in
the lexical semantics approach, conceptual dependency or similar network structures
are created.
Language generation is approximately the opposite of the understanding analysis
process, although more difficult. Not only must a system decide what to say but
how to say it. Generation falls naturally into three areas, content determination.
text planning, and text realization. Two general approaches were presented. They
are like the inverses of the lexical and compositional semantic analvsis processes.
The KAMP system uses an elaborate planning process to determine what, when.
and how to state some concepts. The system simulates a robot giving advice to a
human helper in the repair of air compressors. At the other. extreme, the BABEL
system generates output text from conceptual dependenc y and script structures.
We concluded the chapter with a look at three s stems of somewhat disparate
architectures: the LUNAR. LIFER, and SFIRDLU s y stems. These systems typtf.
the state-of-the-art in natural language processing sy'crns.
EXERCISES
12.1. Derive a parse tree,for the sentence "Bill loves the frog." where the following
rewrite rules are used.
S*NPVP
-
NP -'N
NP -.DETN
VP -.VNP
268 Natural Language Processing Chap. 12
DIET -.the
V —e Loves
N —bill 1 frog
12.2. Develop a parse tree for the sentence "Jack slept on the table" using the following
rules.
S -.NPVP
NP -.N
NP -.DETN
VP -.vpr
PP PREP NP
N -. jack table
V -. slept
1ET -. the
PREP-. on
12.3. Give an example ot each of the four types 0. I, 2. and 3 for Chomskys hierarchy
of grammers.
12.4. Modify the grammer of Problem 12.1 to allow the NP (noun phrase) to have zero
to many adjectives.
12.5. Explain the main differences between the following three grammars and describe
the principal features that could be used to develop specifications for a snta-tical
recognition program. Consult additional references for more details regardin g each
grammar
Chomsk y s Transformational Grammar
Fillmore Case Grammar
Systemic Grammars
12.6. Draw an ATN to implement the granlmer of Problem 12.1.
12.7. Given the following parse tree, write down the corresponding context free gramrner.
NP'
DEl ADJ N
12.8. Create a LISP data structure to model a simple lexicon similar to the one depicted
in Figure 12.6.
12.9. Write a LISP irratch program which checks an input sentence for matchin g words in
the lexicon of the previous problem.
12.10. Derive an ATN for the parse tree of Problem 12.7.
12.11. Dense an ATN sraph to implement the parse tree of Problem 12
12.12. Determine it the following sentence s "ill he accepted bN the grammar I Ii hk m
12.6.
ta The g reen g reen grass of the home
h The red ear drove in the last lane.
12.13. Write PROLOG rul' to implement the grammar used to (lcris' the parse tOe
Pnsrhlcnr 12.7 Omit rules for the individual word categories (like noun ([bail . .,\
Generate a syntas tree using one output parameter..
12.13. Write a PROLOG program that will take grammar rules in the following format:
1NT— (NT I T*
where NT is any vonterminal. T is any terminal, and Kleene star signiltes arr
number of repetitions, and generate the corresponding top-down parser: that i-j.
12.15. TvlodifN the program in Problem I 2. 12 to accept extra ar g uments used to return
meaningful knowledge structures.
12.16. Write a LISP pro g ram which uses property lists to create the recursive transition
network depicted in Figure 12.9. Each node should be given a nalno such as SI.
NI. and P1 and :tssoci:ited with a list of arc and node pairs emanating from the
node.
12.17. Write a recursive program in LISP which tests Input sentences br the F IN developed
in the .previous problem. The program should return t if the sentence is acceptable.
and nil if not
12.18. rviod i fv the pro g ram of Problem 12. 15 to accept sentences of the type depicted in
Figure 12.12
12.19. Write an KN type of program as depicted in Figure 12.12 which builds structures
like those of Figure 12.13.
12.20. Describe in detail the differences between language understanding and IatigUae gcrier
iron. Explain the problems in developing a program which is capable of carrying on
a dialo g with a group of people.
0
270 Natural Language Processing Chap. 12
12.21. Give the processing steps required and corresponding data structures needed for a
robot named Rob to formulate instructions for a helper named John to complete a
university course add-drop request form.
12.22. Give the conceptual dependency graph for the sentence "Mary drove her car to
school" and describe the steps required for a program-to transform the sentence to
an internal conceptual dependency structure.
4r)
IL)
Pattern Recognition
One of the most basic and essential characteristics of living things is the ability to
recognize and identify objects. Certainly all higher animals depend on this ability
for their very survival. Without it they would be unable to function een in a
static, unchanging environment.
In this chapter we consider the process of computer pattern recognition, it
process whereby computer programs are used to recognize various forms of input
stimuli such as visual or acoustic (speech) patterns. This material will help to round
Out the topic of natural language understanding when speech, rather than test. i
the language source. It will also serve as an introduction ts the following chapter
where we take up the general problem of computer vision.
Although some researchers feel that pattern recognition should no longer he
considered a part of Al. we believe many topics from pattern recognition are essential
to an understanding and appreciation of important concepts related to natural language
understanding. computer vision, and machine learning. Consequently. we have in-
cluded in this chapter a selected number of those topics believed to he important.
271
272 Pattern Recognition. Chap. 13
13.1 INTRODUCTION
Recognition is the process of establishing a close match between some new stimulus
and preroiisfy stored stimulus patterns This process is bein g pertrined continually
throughout the lives of all living things. In higher animals this ability is manifested
in man y forms at both the conscious and Unconscious levels, for both abstract as
well as physical objects Throu g h visual sensing and recognition. we identit y many
special objects. such as home, office, school, restaurants. face sof people. handwriting.
and printed words Through aural sensing and recogntion, ccc identif y familiar
VOiCCS, songs and pieces of music, and bird and other animal sounds Through
touch. we identity pIty 'ocai objects such as rens, Cups. automobile controls, and
food items. And through our other senses we identify foods, fresh air. toxic substances
and much else.
At more abstract levels of cognition, we recognize or identif y such thins as
ideas (electromagnetic radiation phenomena, model of the atom, world peace). con-
cepts (beauty, generosit y , complexity), procedures (game playing. making a hank
deposit), plans, old arguments, metaphors, and so On.
Our pervasive use of and dependence on our abilit y to recognize patterns has
motivated much research toward the discovery of mechanical or artificial methods
comparable to those used by intelligent beings. The results of these efforts to date
have been impressive, and numerous applications have resulted. S y stems have now
been developed to reliabl y perform character and speech recognition: fingerprint
and photograph identifications: electroencephelogram (EEG), electrocardiogram
IiCGj, oil log cvell, and Othei graphical pattern analyses various types of medical
and s y stem diagnose': resource identification and evaluation ' (geological, forestry,
h y drological, crop disease): and detection of explosive and hostile threats (submarine,
aircraft, missilei to name a few.
Object classification is closel y related to recognition. The ability to classify
or group objects according to come commonly shared features is a form of class
recognition Classification is essential for decision making. learnin g , and many other
co g nitive acts. Like reco g nition, classification depends on the ability to discover
common patterns amon g objects. This abilit y , in turn, must he acquired through
some learning process. Prominent feature patterns which cht'acterize classes of
objects must be discovered, generalized, and stored for subsequent recall and compari
son
We do not know exactly how humans learn to identify or classify objects.
however, it appears the following processes take place:
New objects are introduced to a human through activation of sensor stiiiiuti The
sensors. depending on their physical properties, are sensitive in varying degrees to
certain attributes which serve to characterize the objects, and the sensor output tends
to he proportional to the more prominent attributes Having perceived a new object,
a cognitive model is lormed from the stimuli patterns and stored in memory. Recurrent
experiences in perceiving the same or similar objects strengthen and refine the similarity
The Recognition and Classification Process 273
Sec. 13.2
There are two basic approaches to tne .-on problem. (I) the decision-
theoretic approach and (2) the syntactic approach.
atee
Clsi t ,ct
Learning ___..,Jton
Figure 13.1 The pattern recniton
process
I
274 Pattern Recognition Chap. 13
The decision theoretic approach is based on the use of decision functions to classify
objects. A decision function maps pattern vectors X into decision regions of D.
More formally, this problem can be stated as follows.
I. Given a universe of objects 0 = {o, 0,,..., o,,}, let each o have k observable
attributes and relations expressable as a vector V = ( V 1 . v 2 .....vi).
2. Determine (a) a subset of m k of the v,, say X = ( x1,
whose values uniquely characterize the o, and (b) c 2 groupings or classifica-
tions of the o, which exhibit high intraclass and low interclass similarities
such that a decision function I(X) can be found which partitions D into c
disjoint regions. The regions are used to classify each o, as belonging to at
most one of the c classes.
M—i' F—+ D
When there are only two classes, say C and C 2 . the values of the object's
pattern vectors may tend to cluster into two disjoint groups. In this case, a linear
decision function d(X) can often be used to determine an object's class. For example,
when the classes are clustered as depicted in Figure 13.2, a linear decision function
d is adequate to classify unknown objects as belonging to either C 1 or C, where
000 0f07 *
o o 0 oI/ + +
o 0 o A' + + +
C 000/++**++ c,
0 0 0/ + +
o 7* +
/ + + -4-
ooI'+ I
000 ., +++++
00/ +++
f d( X) + X2W2 * 0 Figure 13.2 A linear decision function.
Sec. 13.2 The Recognition and Classification Process 275
belonging to class C2 when d(X) > 0. When d(X) = 0 the classification is indeterini-
nate, so either (or neither) class may be selected.
When class reference vectors, prototypes R 1 . j = I .......are available.
decision functions can be defined in terms of the distance of the X from the reference
sectors. For example, the distance
d,(X) = (X - R,)'(X - R,)
could be computed for each class C, and class CA would then be chosen when
dA = min{C}.
For the general case of c ^ 2 classes. C 1 , C- C_ a decision function
may be defined for each class d 1 , 6, ,...,d,. A class . decision rule in this case
would he defined to select class c1 when
< d(X) for ij = I. 2..... c, and i ^6 j.
When a line d (or more generally a hyperplane in Jr-space) can he found that
separates classes into two or more groups as in the case of Figure 13.2. e
the classes are linearly separable. Classes that overlap each other or surround one
another, as in Figure 13.3, cannot generally be classified -,kith the uc ol irnple
linear decison functions. For such cases, more general nonlinear (or piece\ 'c linear)
functions may be required. Alternatively, some other selection technique t like hcuri'.-
tics) may be needed.
The decision function approach described above is an example of detcrrninitie
recognition since the x, are deterministic variables. In cases where the attribute
values are affected by noise or other random fluctuations, it ma y he more upprupriale
to define probabilistic decision functions In such cases, the attribute vectors X are
treated as random variables., and the decision functions are defined as measures ol
likelihood of class inclusion. For example, using Bayes' rule, one can compute the
v
conditional probability P(C,IX) that the class of an object o is C, gi en the ubsersed
value of X for •,. This approach requires a knowledge of the prior prabahiltt:
P(C,), the probability of the occurrence of samples from C, as shell as !i X (
C
xl
. +4 + 4. 4- • *4. ++
+ •000+ * + 4- 00.
0++0000 4+ 4*000 4- .-
o o + + + 0 0 0 0 +.+ * +O000***
000+4.4000+4+0 + 4-400004+4.
000+4+000+ #00 44+000 * +
o 0 0 0 +1 + 0 0 0 4 0 0 44*4.4+4+
0 0 0 01* + 0 0 0 0 0 4*4*44-
o olo + + * + + 0
000o00
jpeg I
00
Figure 13.3 Examples of nsnIrnearty separable classes.
276 Pattern Recognition Chap. 13
(Note that the C, are treated like random variables here. This is equivalent to the
assumption made in Bayesian classification where the distribution parameter 0 is
assumed to be a random variable since C, may be regarded as a function of 0). A
decision rule for this case is to choose class C1 if
X) > PC, I X) for all i 7^ j.
A more comprehensive probabilistic approach is one which is based on the
use of a loss or risk Bayesian function where the class is chosen on the basis of
minimum loss or risk. Let the loss function L, denote the loss incurred by incorrectly
classifying air actually belonging to class C, as belonging to C1 . When I.,, is
a constant for all i. I. I j. a decision rule can be formulated using the likelihood
ratio defined as (see Chapter 6)
P(XICk)
PXIC,
The rule is to choose class Co whenever the relation
P(X Ck) >
holds for all j ^ k
P(XIcJ ) P((-1,)
Probabilistic decision rules may be constructed as either parametric or nonpara-
metric depending on knowledge of the distribution forms, respectively. For a compre-
hensive treatment of these methods see (Duda and Hart, 1973) or (lou and Gonzales.
1974).
Syntactic Classification
V r : a e
/''\ I
bt
,...- g )
Sec. 13.3 Learning Classification Patterns 277
A A aafagaad6ccid
(
B eghf
C = eghf Figure 13.4 Syntactic characterization
() (). of objects.
Using syntactic analysis, that is parsing and analyzing the string structures,
classification is accomplished by assigning an object to class C1 when the string
describing it has been generated by the grammar Q. This requires that the string
be recognized as a member of the language L(G). If there are only two classes, it
is sufficient to have a single grammar G (two grammars. are needed when strings
of neither class can occur).
When classification fore ^_- 2 classes is required. c — I (ore) different grammars
are needed for class recognition. The decision functions in this case are based on
grammar , recognition functions which choose class C, if the pattern string is found
to be generated by grammar G. that is. if it is a member of L(G). Patterns not
recognized as a member of a defined language are indeterminate.
When patterns are noisy or subject to random fluctuations, ambiguities may
occur since patterns belonging to different classes may appear to he the same. In
such cases, stochastic or fuzzy grammars may be used. Classification for these
to
cases may be made on the basis of least cost transform an input string into a
valid recognizable string, by the degree of class set inclusion or with a similarity
measure using one of the methods described in Chapter 10.
Before a system can recognize objects. it must possess knowledge of the characteristic
features for those objects. This means that the s y stem designer must ether build
the necessary discriminating rules into the s y stem or the system must learn them.
In the case of a linear decision function, the weights that define class boundaries
must be predefined or learned. In the case of syntactic recognition, the class grammars
must he predefined or learned.
Learning decision functions., grammars, or other rules can be performed in
either of two ways. through supervised learning or unsupervised learning. Supervised
learning is accomplished by presenting training examples to a learning unit. The
examples are labeled beforehand with their correct identities or class.. The attribute
values and object labels are used by the learning component to inductively extract
and determine pattern criteria for each class. This knowledge is used to adjust
parameters in decision functions or grammar rewrite rules. Supervised learning con-
cepts are discussed in some detail in Part V. Therefore, we Concentrate here on
some of the more important notions related to unsupervised learning.
In unsupervised learning, labeled training examples are not available and little
218 Pattern Recognition Chap. 13
is known beforehand regarding the object population. In such cases, the system
must be able to perceive and extract relevant properties from the otherwise unknown
objects, find common patterns among them, and formulate descriptions or descrimina-
tion criteria consistent with the goals of the recognition process.
This form of learning is known as clustering. It is the first step in any recognition
process where discriminating features of objects are not known in advance.
(T) - t )"
When m is unknown, the number of arrangements increases as the sum of the 5_ that is. as S". F.':
example when,: = 25, the number of arrangements is more than
280 Pattern Recognition Chap. 13
clusters respectively. During the clustering process, the thresholds are used to deter-
mine if a cluster should be split into two clusters, merged with other clusters or
discarded (when too small). The algorithm is given with the follow,ing steps.
I. Select ,n samples as seed points for initial cluster centers. This can be done
by taking the first rn points, selecting random points or by taking the first m
points which exceed some mutual minimum separation distance d.
2. Group each sample with its nearest cluster center.
3. After all samples have been grouped, compute new cluster centers for each
group. The center can be defined as the centroid (mean value of the attribute
vectors) or some similar central measure.
4. If the split threshold t 1 is exceeded for any cluster, split it into two parts and
recompute new cluster centers.
S. If the distance between two cluster centers is less than t 2 , combine the clusters
and recompute new cluster centers.
6. If a cluster has fewer than t 3 members, discard the cluster. It is ignored for
the remainder of the process.
7. Repeat steps 3 through 6 until no change occurs among cluster groupings or
until some iteration limit has been exceeded.
Measures for determining distances and the center location need not be based
on ordered variates. They may be one of the measures described in Chapter ID
(including probabilistic or fuzzy measures) or some measure of similarity between
graphs, strings, and even FOPL descriptions. In any case, it is assumed each object
o is described by a unique point or event in the feature space F.
Up to this point we have ignored the problem of attribute scaling. It is possible
that a few large valued variables may completely dominate the other variables in a
similarity measure. This could happen, for example, if one variable is measured in
units of meters and another variable in millimeters or if the range and scale of
variation for two variables are widely different. This problem is closely related to
the feature selection problem, that is, in the assignment of weights to feature variables
on the basis of their importance or relevance. One simple method for adjusting the
scales of such variables is to use a diagonal weight matrix W to transform the
representation vector X to X' = WX. Thus, for all of the measures described
above, one should assume the representation vectors X have been appropriately
normalized to account for scale variations.
To summarize the above process, a subset of characteristic features which
represent the a, are first selected. The features chosen should be good discriminators
in separating objects from different classes, relevant, and measurable (observable)
at reasonable cost. Feature variables should be scaled as noted above to prevent
any swamping effect when combined due to large valued variables. Next, a suitable
metric which measures the degree of association or similarity between objects should
be chosen, and an appropriate clustering algorithm selected. Finally, during the
Sec. 13.4 Recognizing and Understanding Speech 231
clustering process, the feature variables may need to be weighted to reflect the
relative importance of the feature in affecting the clustering.
Developing systems that understand speech has been a continuing goal of Al research-
ers. Speech is one of our most expedient and natural forms of communication, and
so understandably, it is a capability we would like Al systems to possess. The
ability to communicate directly with progqtms offers several advantages. It eliminates
the need for keyboard entries and speeds up the interchange of information between
user and system. With speech as the communication medium, users are also free
to perform other tasks concurrently with the computer interchange. And finally.
more untrained personnel would be able to use computers in a variety of applications.
The recognition of continuous waveform patterns such as speech begins with
sampling and digitizing the waveforms. In this case the feature values are the sampled
points x, = f)
as illustrated in Figure. 13.5.
It is known from information theory that a sampling rate of twice the highest
speech frequency is needed to capture the information content of the speech wave-
forms. Thus, sampling requirements will normally be equivalent to 20K to 30K
bytes per second. While this rate of information in itself is not too difficult to
handle, this, added to the subsequent processing, does place some heavy requirements
on real time understanding of speech.
Following sample digitization, the signals are processed 'at different levels of
abstraction. The lowest level deals with phones (the smailest unit of sound), allophones
(variations of the phoneme as they actually occur in words), and syllables. Higher
level processing deals with words, phrases, and sentences.
The processing approach may be from the bottom, top, or a combination of
both. When bottom processing is used the input signal is segmented into basic
speech units and a search is made to match prestored patterns against these units.
Knowledge about the phonetic composition of words is stored in a lexicon for
comparisons. For the top approach, syntax. semantics (the domain), and pragmatics
(context) are used to anticipate which words the speaker is likely to have said and
f(t)
direct the search for recognizable patterns. A combined approach which uses both
methods has also been applied successfully.
Early research in speech recognition concentrated on the recognition of isolated
words. Patterns of individual words were prestored and then compared to the digitized
input patterns. These early systems met with limited success. They were unable to
tolerate variutions.in speaker voices and were highly susceptible to noise. Although
important. this early work helped little with the general problem of continuous
speech understanding since words appearing as part of a continuous stream differ
significantly from isolated words. In continuous speech, words are run together,
modified, and truncated to produce a great variation of sounds. Thus, speech analysis
must be able to detect different sounds as being part of the same word, but in
different contexts. Because of the noise and variability, recognition is best accom-
plished with some type of fuzzy comparison.
In 1971 the Defense Advanced Research Projects Agency (DARPA) funded
a live year program for continuous speech understanding research (StiR). The objec-
tive of this research was to design and implement systems that were capable of
accepting continuous speech from several cooperative speakers using a limited vocabu-
lary of some 1000 words. The systems were expected to run at slower than real
time speeds. A product of this research were several systems including HEARSAY
I and II, HARPY, and HWIM. While the systems were only moderately successful
in achieving their goals, the research produced other important byproducts as well,
particularly in systems architectures, and in the knowledge gained regarding control.
The HEARSAY system was important for its introduction of the blackboard
architecture (Chapter 15). This architecture is based on the cooperative efforts of
several specialist knowledge components communicating by way of a blackboard
in the solution of a class of problems. The specialists are each expert in a different
area. For example, speech analysis experts might each deal with a different level
of the speech problem. The solution process is opportunistic, with each expert making
a contribution when it can. The solution to a given problem is developed as a data
structure on the blackboard. As the solution is developed, this data structure is
modified by the contributing expert. A description of the systems developed under
StiR is given in Barr and Feigenbaum (1981).
13.5 SUMMARY
Pattern recognition systems are used to identify or classify objects on the basis of
their attribute and attribute-relation values. Recognition may be .ccomplished with
decision functions or structural grammars. The decision functions as well as the
grammars may be deterministic, probabilistic, or fuzzy.
Before recognition can be accomplished, a system must learn the criteria for
object recognition. Learning may be accomplished by direct designer encoding,
supervised learning, or unsupervised learning. When unsupervised learning is re-
quired. some form of clustering may be performed to learn the object class characteris-
tics.
Speech understanding first requires recognition of basic speech patterns. These
patterns are matched against lexicon patterns for recognition. Basic speech Units
such as phonemes are the building blocks for longer units such as syllables and
words.
EXERCISES
13.1. Choose three common Objects and determine live of their most discriminating visual
attributes.
13.2. For the previous problem. determine three additional nonvisual attributes for the
objects which are most discriminating
13.3. Find a linear decision function which separates the following - v points into two
distinct classes.
Visual Image
Understanding
Vision is perhaps the most remarkable of all of our intelligent sensing capabilities.
Through our visual system, we are able to acquire information about our environment
without direct contact. Vision permits us to acquire information at a phenomenal
rate and at resolutions that are most impressive. For example, one only needs to
compare the resolution of a TV camera system to that of a human to see the difference.
Roughly speaking, a TV camera has a resolution on the order of 500 parts per
square cm, while the human eye has a limiting resolution on the order of some 25
X 106 parts per square cm. Thus, humans have a visual resolution several orders
of magnitude better (more than 10,000 times finer) than that of a TV camera.
What is even more remarkable is the ease with which we humans sense and perceive
a variety of visual images. It is so effortless, we are seldom consciou a of the act.
In this chapter, we examine the processes and the problems involved in building
computer vision systems. We look at some of the approaches taken thus far and at
some of the more successful vision systems constructed to date.
14.1 INTRODUCTION
Because of its wide ranging potential, computer vision has become one of the most
intensely studied areas of Al and engineering during the past few decades. Some
typical areas of application include the following.
285
Chap. 14
286 Visual Image Understanding
MANUFACTURING
MEDICAL
DEFENSE
BUSINESS
ROBOTICS
SPACE EXPLORATION
Illumination
Retina
Transparent
lens
together with some form of inference. The basic vision process as it occurs in
humans is depicted in Figure 14. 1
Light from illuminated objects is collected by the transparent lens of the eye,
focused, and projected onto the retina where Some 250 million light sensitive sensors
(cones and rods) are excited. When excited, the sensors send impulses through the
optic nerve to the visual cortex of the occipital lobes of the brain where the images
are interpreted and recognized.
Computer vision systems share some similarities with human visual systems,
at least as we now understand them. They also have a number of important differences.
Although artificial vision systems vary widely with the specific application, we
adopt a general approach here, one in which the ultimate objective is to determine
a high-level description of a three-dimensional scene witha competency level compara-
ble to that of human vision systems. Before proceeding farther we should distinguish
between a scene and an image of a scene. A scene is the set of physical objects in
a picture area, whereas an image is the projection of the scene onto a two-dimensional
plane.
With the above objectives in mind, a typical computer vision system should
be able to perform the following operations:
uWU:_T]
Intermedate High level Sernant
Image sensor Low level
level descrr,t,n
The input to a vision system is a two dimensional image collected on some form
of light sensitive surface. This surface is scanned by some means to produce a
continuous voltage output that is proportional to the light intensity of the image on
the surface. The output voltage fix, y) is sampled at a discrete number of x and
points or pixel (picture element) positions and converted to numbers. The numbers
coirespond to the gray level intensity for black and oxhite images. For color images,
the intensity value is comprised of three separate arrays of numbers, one for the
intensity value of each of the basic o!ors (red. green, and blue).
Thus, through the digitization process.
' the image is transformed from a continu-
ous light source into an airay of numbers s'.hich correspond to the local image
in'ens:tlCs at the corresponding s-s piscl positions on the light sensitise surface.
sing the array of number'. certlia low level operations are performed. such
as smoothing of nighhoring points to reduce noise. finding outlines of oh1ls or
edge elements. thresholding recordit'e niav:Ii1um and minimum values only. depend-
ing on some fixed intensit y ihi lcvel i. and determining texture, color, and
other object features. These irihial processing steps are ones which are LiSC to
locate and accentuate object boundaries and other structure within the image.
The next stage of processing. the intermediate level, involves connecting.
tilling in, and combining boundaries. detcrmining regions. and assigning descriptise
labels to objects that have been accentuated in the first stage. This stage builds
higher level structures from the lower level elements of tile first stage. When complete.
it passes on labeled surfaces such as geometrical objects that may be capable of
identification.
High-level image processing consists of identifying the important objects in
:s.age and their relationships for subsequent dc.;cription as well-defined knoss ledge
strcures and hence, for use by a reasoning component.
Special types of vision systems may also require three dimensional processing
and analysis as well as motion detection and analysis.
Sec. 14.1 Introduction 289
The ultimate goats of computer image understanding is to build systems that equal
or exceed the capabilities of human vision systems. Ideally, a computer vision
system would be capable of interpreting and describing any complex scene in complete
detail. This means that the system must not only be able to identify a myriad of
complex objects, but must also be able to reason about the objects, to describe
their function and purpose, what has taken place in the scene, why any visible or
implied events occurred, what is likely to happen, and what the objects in the
scene are capable of doing.
Figure 14.3 presents an example of a complex scene that humans can interpret
well with little effort. It is the objective of many researchers in computer vision to
build systems capable of interpreting, describing, and reasoning about scenes of
this type in real time. Unfortunately, we are far from achieving this level of compe-
tency. To he sure, some interesting vision systems have been developed, but they
are quite crude compared to the elegant vision systems of humans.
Like natural language understanding, computer vision interpretation is a difficult
problem. The amount of processing and storage required to interpret and describe
a complex scene can be enormous. For example, a single image for a high resolution
aerial photograph may result in some four to nine million pixels (bytes) of information
and require on the average some 10 to 20 computations per pixel. Thus, when
several frames must be stored during processing, as many as 100 megab ytes of
storage may be needed, and more than 100 million computations performed,
In this section, we examine the first stages of processing. This includes the process
of forming an image and transforming it to an array of numbers which can then be
operated on by a computer. In this first stage, only local processing is performed
on the numbers to reduce noise and other unwanted picture elements, and to accentuate
object boundaries.
The first step in image processing requires a transformation of light energy to numbers,
the language of computers. To accomplish this, some form of light sensitive transducer
is used such as a vidicon tube or charge-coupled device (CCD).
A vidicon tube is the type of sensor typically found in home or industrial
video systems. A lens is used to project the image onto a flat surface of the vidicon.
The tube surface is coated with a photoconductive material whose resistance is
inversely proportional to the light intensity falling on it. An electron gun is used to
produce a flying-spot scanner with which to rapidly scan the surface left to right
and top to bottom. The scan results in a time varying voltage which is proportional
to the scan spot image intensity. The continuously varying output voltage is then
fed to an analog-to-digital converter (ADC) where the voltage amplitude is periodically
sampled and converted to numbers. A typical ADC unit will produce 30 complete
digitized frames consisting of 256 x 256. or 512 x 512 (or more) samples of an
image per second. Each sample is a number (or triple of numbers in the case of
color systems) ranging from (ito 64 (six bits) or 0 to 255 (eight bits). The image
conversion process is depicted in Figure 14.4.
A CCD is typical of the class of solid state sensor devices known as charge
transfer devices that are now being used in many vision systems. A CCD is a
rectangular chip consisting of an array of capacitive photodeteCtorS, each capable
of storing an electrostatic charge. The charges are scanned like a clock-driven shift
register and converted into a time varying voltage v,hich Is proportional to the
incident light intensity on the detectors. This voltage is sampled and converted to
integers using an ADC unit as in the case of the vidicon tube. The density of the
I 'array of numbers produced from the image sensing device may be thought of
as the Jowct, T1Oct primitive level of abstraction in the vision understanding process.
The next step in the P r
ocessing hierarchy is to find some structure among the
pixels
such as pixel clusters W1ii<h define object boundaries or regions within
the image.
Thus, it is necessary to transform the array of raw pixel data into regions of discont
ties and hom j nuj
ogeneity, to find edges and other delimiters of these object regions.
A raw digitized image Will contain some noise
and distortion, lheretofe, compu-
tations to reduce these effects may be necessary before locating edges and regions.
Depending on the particular application, low level processing will often require
local smoothing of the array to eliminate this noise. Other low level operations
include threshold processing to help define homogeneous regions, and different forms
of edge detection to define boundaries. We examine some of these low level methods
next.
Thresholding is the process of
t ransforming a gray level representation to a
binary representation of the image. All digitized array values above some threshold
level T are set equal to the maximum gray-level value (black), and values less
than or equal to I are set equal to zero (white). For simplicity, assume gray-level
values have been normalized to range between zero and one, and suppose a threshold
level of T = 0.7 has been chosen. Then all array values
to 1 and values g(x,y) 0.7 g(x,v) > 0.7 are set equal
are set equal to 0. The result is an array of binary 0
and I values. An example of an image that has been thresholded at
0.7 to produce
a binary image is illustrated in Figure 14.5.
Thresholding is one way to segment the image into sharpen object regicns by
enhancing some portions and reducing others like noise and other unwanted features.
Thresholding can also help to simplify subsequent processing steps. And in many
cases, the use of several different threshold levels may be necessary since low
intensity object surfaces will be lost to high threshold levels, and unwanted background
will be picked up and enhanced by low threshold levels.
T hresholding at several
levels may be the best way to determine different regions in the image when it is
necessary to compensate for variations in illumination or poor Contrast.
Selecting one or more appropriate threshold level settings 1', will require addi-
tional co
mputations, such as first producing a histogram of the image gray-level
intensities. A histogram gives the frequencies of occurrence of different intensity
(or some other feature) levels within the image. An analysis of a histogram can
reveal where concentrations of different intensity levels occur, where peaks and
broad fiat levels occur and where abrupt differences in level occur. From this informa-
Visual Image Understanding Chap. 14
292
Binary image
(b)
(a)
Figure 14.5 Threshold transformation of an Image.
,,ice of I values are often made apparent For example, a histogram
tisn the best (:l ) rations between intensity levels that have a relatively
with two or more clear sepa
high frequency of occurrence will usually suggest the best threshold levels for object
identification and separation. This is seen in Figure 14.6.
Smoothing is a form a
Next, we turn to the question of image smoothing.
digital filtering. It is used to reduce noise and other unwanted features and to enhanc
certain image features. Smoothing is a form of image transformation that tends tc
eliminate spikes and flaten widely fluctuating intensity values. Various forms of
smoothing techniques have been employed, including local averaging, the use of
models, and parametric form fitting.
One common method of smoothing is to replace each pixel in an arra) witl'
a weighted average of the pixel and its neighboring values. This can be accomplishe
with the use of filter masks which use some configuration of neighboring pixe
values to compute a smoothed replacement value. Two typical masks consist o
either four or eight neighboring pixels whose intensity values are used in the weightin
Potaible
threthold
Fregoenes
CrayIeel intensity
Regions belonging to the same object are usually distinguishable by one or more
Chap. 14
features which are relatively homogeneous throughout, such as color, texture, three-
dimensional how effects, or intensity.
Boundaries which separate adjoining regions represent a disco'ntiflUitY in one
or more of these features, a fact that can be exploited by measuring the rate of
change of a feature value over the image surface. For example, the rate of change
and vertical directions can be measured
or gradient in intensity in the horizon tal
defined as
with difference functions D, and D
= f(x.y) -fix -
D, =jiv.y)
= tan
are most easily computed by application of the equivalent
For n = I, D, and D
respectiVelY.
weighting masks; the two element masks are (—I ) and
An example of the application of these two masks to an image array is illustrated
in Figure 14.8 where a vertical edge is seen to be quite pronounced. Masks such
as these have been generalized to measure gradients over wider regions covering
several pixels. This has the effect of reducing spurious noise and other sharp spikes.
Two masks deserving particular attention are the Prewitt (1970) and Sobel
(1970) masks as depicted in Figure 14.9. These masks are used to compute a broadened
normalized gradient than the simple masks given above. We leave the details of
the computations as one of the exercises at the end of this chapter.
We return now to the methods of edge detection which employ smoothing
followed by an application of the gradient. For this, the Continuous case is considered
first.
Result of Dx and Dy
Original Array I Mask Applied to I
(a) (b)
Convolving the two functions f and g is similar to computing the cross con-elation,
a process that reduces random noise and enhances coherent or structural changes.
One particular form of weighting function g has a symmetric bell shape or
normal form, that is the Gaussian distribution. The two dimensional form of this
function is given by
g(u,v) = ce2*2)2
where c is a normalizing constant.
Because of their rotational symmetry, Gaussian filters produce desirable effects
— 1 0 1 —1 0 1
P,.= —1 0 1 S —2 0 2
—1 0 1 —1 0 1
1 1 1 1 2 1
p,= o 0 sr = 0 0 0
—1 —1 —1 —1 —2 —1
Figure 14.9 Generalized edge detection
Prewitt Masks Sobel Masks masks. -
296 Visual Iffige Understanding Chap. 14
J image
I_Gradient
$J'U-
Second order Gradient applied
intensity gradient to convolution
fl
Applying this transform to an array of intensity values produces an array of
complex numbers that correspond to the spatial frequency components of the image
(sums of sine and cosine terms). The transformed array will contain all of the
information in the original intensity image, but in a form that is more easily used
to identify regions that contain different frequency components. Filtering with the
Fourier transform is accomplished by setting the high (or low) values of u and i.
equal to zero. For example, the value F(v.i') = F(0,0) corresponds to the zero
frequency or the DC component. and higher values of u and L correspond to the
high frequency components. As with intensity image arrays, thresholding of trans-
formed arrays can be used to separate different frequency components.
The original intensity image with any modifications, is recovered with the
inverse transform given by
1 n — I n —I
flx.v) = - F(u.u) exp - (xu + vv)
1? L fl
Image Transformation and Low-Level Processing 297
Sec. 14.2
As suggested earlier, texture and color are also used to identify regions and boundaries.
Texture is a repeated pattern of elementary shapes occurring on an object's surface.
Texture may appear to be regular and periodic, random, or partially periodic. Figure
14.11 illustrates some examples of textured surfaces.
The structure in texture is usually too fine to be resolved, yet still course
enough to cause noticeable variation in the gray levels. Even so, methods of analysis
for texture have been developed. They are, commonly based on statistical analyses
of small groups of pixels, the application of pattern matching, the use of Fourier
transforms, or modeling with special functions known as fractals. These methods
are beyond the scope of our goals in this chapter.
The use of color to identify and interpret regions requires more than three
Limes as much processing as gray-level processing. First, the image must be separated
into its three primary colors with red, green, and blue filters (Figure 14. 12).
The separate color images must then be processed by sampling the intensities
and producing three arrays or a single array of tristimulus values. The arrays are
then processed separately (in some cases jointly) to determine common color regions
and corresponding boundaries. The processes used to find boundaries and regions,
and to interpret color images is similar to that of gray-level systems.
Although the additional computation required in color analysis can be significant,
the added information gained from separate color intensity arrays may be warranted,
depending on the application. In complex. scene analysis, color may be the most
effective method of segmentation and object identification. In Section 14.6 we describe
Red
3reen
_=_S__H
Figure 14.12 Color separation and
FiIter processing.
an interesting color scene analyser which is based on a rule based inferencing system
(Ohta. 1985).
A stereoscopic vision system requires two displaced sensors to obtain two views of
objects from different perspectives. The differences between the views makes it
possible to estimate distances and derive a three-dimensional model of a scene.
The displacement of a pixel from one image to a different location in another image
is known as the disparity. It is the dispatity between the two views that permit the
estimation of the distance to objects in the scene. The human vision system is
somehow able to relate the two different images and form a correspondence that
translates to a three-dimensional interpretation. Figure 14.13 illustrates the geometric
relationships used to estimate distances to objects in stereoscopic systems.
The distance k from the lens to the object can be estimated from the relationships
that hold between the sides of the similar triangles. Using the relations i 1 / e 1 = f/ k,
i / e, f/k, and d = e 1 + e2 we can write
k = fd/(j 1 +i)
Since f and d are relatively constant, the distance k
is a function of the disparity,
or sum of the distances it and i2.
In computer vision systems, determining the required correspondence between
the two displaced images is perhaps the most difficult part in determining the disparity.
foci length
of lens
dstnce to
object P
P object
',, it are the
two Image, , Figure 14.13 Disparity in stereoscopic
systems.
299
Intermediate-Level Image processing
Sec. 14.3
Corresponding pixel groupings in the two images must be located to determine the
disparity from which the distance can be estimated. In practice, methods based on
correlation, gray-level matching. template matching, and edge contour comparisons
have been used to estimate the disparity between stereo images.
scene analysis which
Optic flow is an alternative approach to threedimenSiOfla l
is based on the relative motion of a sensor and objects in the scene. If a sensor is
moving (or objects are moving past a sensor), the apparent continuous flow of the
objects relative to the sensor is known as optical flow. Distances can be estimated
from the change in flow or relative velocity of the sensor and the objects. For
example, in Figure 14.14 if the velocity of the sensor is constant, the change in
x2 is proportional to the change in size of the
distance dx between points x and
power lines h, through the relation
dx/dt=k(dhldt)
The next major level of analysis builds on the low-level or early processing steps
described above. It concentrates on segmenting the image space into larger global
egions and boundaries formed from
structures using homogeneous features in pixel r
pieces of edges discovered during the low-level processing. This level requires that
pieces of edges be combined into contiguous contours which form the outline of
objects, partitioning the image into coherent regions, developing models of the
segmented objects, and then assigning labels which characterize the object regions.
One way to begin defining a set of objects is to draw a silhouette or sketch
of their outlines. Such a sketch has been called the raw primal sketch by Man
(1982). It requires connecting up pieces of edges which have a high likelihood o:
forming a continuous boundary. For example, the problem is to decide whethe:
two edge pieces such as
300
Visual Image Understanding Chap. 14
(edge (location 21 103)
(edge (location 18 98)
(intensity 0.8)
(intensity 0.6)
(direction 46)) (direction 41
should be connected. This general process of forming contours from pieces of edges
is called segmentation
Graphical methods can be used to link up pieces of edges One approach is to use
a minimum spanning tree (MST). Starting at any cluster of pixels known to be
part of an edge, this method performs a search in the neighborhood of the cluster
for graupings with similar feature values. Each such grouping corresponds to a
node in an edge tree. When a number of such nodes have been found they are
Connected using the MST algorithm.
An MST is found by connecting an arc between the first node selected and
its closest neighbor node and labeling the two nodes accordingly. Neighborhoods
of both connected nodes are then searched. Any node found closest to either of the
two connected nodes (below some threshold distance) is then used to form the
next branch in the tree. A second arc is constructed between the newly found node
and the closest connected node, again labeling the new node. This process is repeated
until all nodes having are distances less than some value (such as a function of the
average arc distances) have been connected. An example of an MST is gisen in
Figure 14.15.
Another graphical approach is based on the assignment of a cost or other
measure ofmerit to pisel groupin g
s,The cost assignment can he based on a simple
function of features such as intensity. orientation, or color. A best-first (branch-
and-bound) or other form of graph search is then performed using sonic heuristic
function to determine a least-cost path which represents the edge contour.
Other edge finding approaches are based on fitting a low degree pol
y nomial
to a number of edge pieces which have been found through local searches. The
resultant polynomial curve se g
ment is then taken as the edge boundary. This approach
is similar to one which compares edge templates to short groupings of pieces. If a
particular matching template scores above some threshold, the template pattern is
then used to define the contour.
where
where K(s,t) is the Cost at stage n, and C., 1 (t) is the minimum Cost for stages
n + I to the terminal stage.
The computation process is best understood through an example. Consider
the following 5 X 5 array of pixel cost values.
19 7 6 5 I
13 7 2 7 I
41521
6 4 3 7 7
87223
Suppose we wish to find the optimal cost path from the lower left to the upper
right corner of the array. We could work from either direction, but we arbitrarily
choose to work forward from the lower left pixel with cost value 8. We first set
all values except 8 equal to some very large number, say M, and compute the
minimum cost of moving from the position with the 8 to all other pixels in the
bottom row by adding the cost of moving f rom pixel to neighboring pixel. This
results in the following cost array.
302 Visual Image Understanding Chap. 14
MMM.MM
MM M M M
M M M M M
M M M M M
8 15 17 19 22
Next, we compute the minimum neighbor path cost for the next to the last row to
obtain
MM M M M
M M M M M
M M M M M
14 14 17 24 29
8 15 17 19 22
Note that the minimum cost path to the second, third and fourth positions in this
row is the diagonal path (position 5,1 to 4,2) followed by a horizontal right traversal
in the same row, whereas the minimum cost path for the last position in this row
is the path passing through the rightmost position of the bottom row. The remaining
minimum path Costs are computed in a similar fashion, row by row, to obtain the
final cost array.
27 24 23 22 21
18 22 17 24 20
18 iS 19 19 26
14 14 17 24 29
8 15 17 19 22
From this final minimum cost array, the least cost path is easily found to be
27 24 2322=2l
18 22 17 24 20
18 15r19 19 26
17 24 29
17 19 22
Rather than defining regions with edges, it is possible to build them. For example,
global structures can be constructed from groups of pixels by locating, connecting,
and defining regions having homogeneous features such as color, texture, or intensity.
The resulting segmented regions are expected to correspond to surfaces of objects
in the real world. Such coherent regions do not always correspond to meaningful
regions, but they do offer another viable approach to the segmentation of an image.
When these methods are combined with other segmentation techniques, the confidence
level that the regions represent meaningful objects will be high.
Once an image has been segmented into disjointed object areas, the areas
can be labeled with their properties and their relationships to other objects, and
then identified through model matching or description satisfaction.
Region segmentation may be accomplished by region splitting, by region grow-
ing (also - called region merging), or by a combination of the two. When splitting
is used, the process proceeds in a top-down manner. The image is split successively
into smaller and smaller homogeneous pieces until some criteria are satisfied. When
growing regions, the process proceeds in a bottom-up fashion. Individual pixels or
small groups of pixels are successively merged into coittiguous, homogeneous areas.
A combined splitting-growing approach will use both bottom-up and top-down tech-
niques.
Regions are usually assumed to be disjointed entities which partition the image
such that (I) a given pixel can appear in a single region only, (2) subregions are
composed of connected pixels, (3) different regions are disjoint areas, and (4) the
complete image area is given by the union of all regions. Regions are usually
defined by some homogeneous property such that all pixels belonging to the region
satisfy the property, and pixels not satisfying the property lie in a different region.
Note that a region need not consist of contiguous pixels only since some objects
may be split or covered by occluding surfaces. Condition 2 is needed to insure
that all regions are accounted for and that they fill up the complete image area.
In region splitting, the process begins with an entire image which is successively
divided into smaller regions which exhibit some coherence in features. One effective
method is to apply multiple thresholding levels which can isolate regions having
homogeneous features. Histograms are first obtained to establish the threshold levels.
This may require masking portions of the image to achieve effective separation of
complex objects. Each threshold level can then produce a binary image consisting
of all of the objects which exceed the thresholded level. Once the binary regions
are formed, they are easily delineated, separated, and marked for subsequent process-
ing. This whole process of masking, computing, and analyzing a histogram, threshold
ing, defining an area, masking, and so on can be performed in a recursive manner.
The process terminates when the masks produce monomodal histograms with the
image fully partitioned.
Segmentation techniques based on region growing start with small atomic
regions (one or a few pixels) and build coherent pixel regions in a bottom-up fashion:
304 Visual Image Understanding Chap. 14
Local features such as the intensity of a group of pixels relative to the average
intensity of neighboring pixels are used as criteria for the merging operation. A
low level of contrast between contiguous groups gives rise to the merging of areas,
while a higher level of contrast, such as found at boundaries, provides the criteria
for region segregation.
Split-and-merge techniques attempt to gain the advantages of both methods.
They combine top-down and bottom-up processing using both region splitting and
merging until some split-merge criterion no longer exists. At each step in the process,
split and merge threshold values can be compared and the appropriate operation
performed. In this way, over-splitting and under-merging can be avoided.
We continue in this section with further intermediate-level processing steps all aimed
at building higher levels of abstraction. The processing steps here are related to
describing and labeling the regions.
Once the image has been segmented into disjointed regions, their shapes,
spatial interrelationships, and other characteristics can be described and labeled for
subsequent interpretation. This process requires that the outlines or boundaries, ver-
tices, and surfaces of the objects he described in some wa y . It should be noted,
however, that a descript i on for a region can be based on a two- or three-dimensional
image interpretation. Initially, we focus on the two-dimensional interpretation.
Typically, a region description will include attributes related jo size, shape,
mnd genera' appearance. For example some or all of the following features might
included.
Region area
Contour length (perimeter) and orientation
Location of the center of alas
Minimum bounding rectangle
Com p actness (area divided by perimeter squared)
Fitted sc.trer matrix of pixels
Number and characteri s tics of holes or internal occlusions
Minimum bounding rectangle
Degree and type of texture
Average intensity (Or average intensities of base colors)
Type of boundary serments (sh:rp, iiziy, and Sc on) and their Iocon
Boundar y contrast
Chain code (described below)
Shape classification number (task specific)
Position and types of vertices (number of adjoining segments)
Desàribing Boundaries
(a)
1^1^ fbI
^"Flgure 14.17 ('ure 6tting stth linear
(c) (di segments
2-
306 Visual Image Understanding Chap. 14
I. Starting with the two end points of the boundary curve, construct a straight
line between the points.
2. At successive intervals along the curve, compute the perpendicular dic
to the constructed line. If the maximum distance is within some specified
limit, stop and use the segmented line as an approximation to the houndar)
3. Otherwise, choose the point on the curve at which the largest distance occurs
and use this as a breakpoint with which to construct two new line semeni
which connect to the two endpoints. Continue the process recursvet's v. ui
each subcurve until the stopping condition of Step 2 is satisfied.
Chain Codes
Some other descriptive features include the area, intensity, orientation. center ul
mass, and hounding rectangle. These descriptions are determined in the toIui
way.
20
1. The area of a region can be given by a count of the number of pixels contained
in the region.
2. The average region intensity is just the average gray-level intensity taken over
all pixels in the region. If color is used in the image, the average is given as
the three base color intensity averages.
3. The center of mass M. for a region can be computed as the average x-v vector
position (denoted as P1), that is
M = (I In) P1
Three-Dimensional Descriptions
possible. Once a match was obtained and all objects identified, the program demon-
strated its "understanding'' of the scene by producing a graphic display of it on a
monitor screen.
Guzman wrote a program called SEE which examined how surfaces from the
same object were linked together. The geometric relationships between different
types of line junctions (vertices) helped to determine the object types, Guzman
identified eight commonly occurring edge junctions for his three-dimensional blocks
world objects. The junctions were used by heuristic rules in his program to classify
the different object b y type (Figure 14.19).
Huffman and Cloes. working independently, extended this work by developing
a line labeling scheme which systematized the classification of polyhedral objects.
Eheim 'scheme was used to classify edges as either concave. convex, or occluding.
Concave edges are produced by two adjacent touching surfaces which produce a
concave (less than 180 depth change. Conversely, convex edges produce a convexly
viewed depth change (greater than 180 0 ) . and an occluding edge outlines a surface
that obstructs other objects.
To label a concave edge. a minus sign is used. Convex edgs are labeled
with a plus sign. and a right or left arrow is used to label the occluding or boundary
edges. By restricting vertices to be the intersection of three object faces (trihedral
vertices), it is pos s ible to reduce the number of basic vertex types to only tour: the
L. the T. the Fork. and the Arrow (Figure 14.20). Different label combinations
assigned to these tour types then assist in the classification and identification of
objects.
When a three-dimensional object is viewed from all possible positions. the
four junction types. togcther with the valid edge labels, give rise to eighteen different
permissible junction configurations as depicted in Figure 14.20. From a dictionary
of these valid junction types, a program can classify objects by the sequence of
bounding vertices which describe it. Impossible object configurations such as the
one illustrated in Figure 14.21 can also be detected.
Geometric constraints, together with a consistePlabeling scheme, can greatly
simplify the object identification process. A set of labeling rules which greatly
facilitates this process can be developed for different classes of objects. For example.
using the labels described above, the following rules will apply for many polyhedral
L Y
The L The T The fork The X
x.
T A
The arrow The psi The peak The multi
L typea
L L L L LL
Fork ty
YYYYY
I 4
I tYpes -
/-j 1\
+ + -
Arrow types
/+ \ /
Figure 14.20 Valid junction labels for three-dimcn'ionaI shapes.
objects: ( I ) the arrow should be directed to mark boundaries by traversing the object
in a clockwise direction (the object face appears on the right of the arrow). (2)
unbroken lines should have the same label assigned at both ends. (3) when a fork
is labeled with a ± edge, it must have all three edges labeled as +, and (4) arrow
junctions which have a -. label on both barb edges must also have a + label on
the shaft.
These rules can he applied to a pol y gonal object as illustrated in Figure 14.22.
El
Starting with any edge having an object face on its right, the external boundary is
labeled with the in a clockwise direction. Interior lines are then labeled with +
or - consistent with the other labeling rules.
Continuing with this early work, David Waltz developed a method of vertex constraint
propagation which establishes the permissible types of vertices that can be associated
with a certain class of objects. He broadened the class of images that could be
anal y zed by relaxing lighting conditions and extending the labeling vocabulary to
accommodate shadows, some multiline junctions and other types of interior lines.
His constraint satisfaction algorithm was one of his most important contributions.
To see how this procedure works, consider the image drawing of a pyramid
as illustrated in Figure 14.23. At the right side of the pyramid are all possible
lahe!ings for the four junctions A, B. C, and D.
Using these labels as mutual constraints on connected junctions, permissible
labels for the whole pyriniid can be determined. The constraint satisfaction procedure
works as follows:
A/N
AD C/
-. Consequently, two of the possible label ings can be eliminated with the remaining
four being
This reduction in turn, places a new restriction on BC, permitting the elimination
of one C label. since BC must now he labeled as a -f only. This leaves the remaining
C labels as
Continuing with the above procedure, it will be found that further label elimina-
tions are not possible since all constraints have been satisfied. The above process
is completed by finding the different combinations of unique labelings that can be
assigned to the figure. This can be accomplished through a tree search process. A
simple enumeration of the remaining labels shows that it is possible to find only
Visual Image UnderStanding Chap. 14
312
Template Matching
HIGH-LEVEL PROCESSING
Before proceeding with a discussion of the final (high-level) steps in vision processing,
we shall briefly review the processing stages up to this point. We began with an
image of gray-level or tristimulus color intensity values and digitized this image to
obtain an array of numerical pixel values. Next, we used masks or some other
transform (such as Fourier) to perform smoothing and edge enhancement operations
to reduce the effects of noise and other unwanted features. This was followed by
edge detection to outline and segment the image into cohernt regions. The product
of this step is a primal sketch of the objects. Region splitting and/or merging, the
dual of edge finding, can also be used separately or jointly with edge finding as
part of the segmentation process.
Histogram computations of intensity values and subsequent analyses were an
important part of the segmentation process. They help, to establish threshold levels
which serve as cues for object separation. Other techniques such as minimum spanning
tree or dynamic programming are sometimes used in these early processing stages
to aid in edge finding.
Following the segmentation process, regions are analyzed and labeled with
their characteristic features. The results of these final steps in intermediate-level
Sec. 14.5 High-Level Processing 313
processing is a set of region descriptions (data structures). Such structures are used
as the input to the final high-level image processing stage. A summary of the data
structures produced from the lowest processing stage to the final interpretation stage
then can be depicted as follows.
Scene
t
Objects
t
Regions
t
Edges or subregions
t
Pixels
David Man and his colleagues (1982, 1980, and 1978) proposed a theory of vision
which emphasized the importance of the representational scheme used at each stage
of the processing. His proposal was based on the assumption that processing would
be carried out in several steps similar to the summary description given above.
The steps, and the corresponding representations are summarized as follows.
High-.evol Processing
High-level processing techniques are less mechanical than either of the prececdng
ima g e processing levels. They are more closely related to classical Al symbolic
methods. In the high-level processing stage, the intermediate-level region descriptions
are transformed into high-level scene descriptions in one of the knowledge repiesenta-
lion formalisms described earlier in Part II (associative nets, frames. FOPL statements,
and SO on: see Figure 1424).
The end objective of this stage is to create high-level knowledge structures
which can he used by an inference program. Needless to say, the resulting structures
should uniquely and accurately describe the important objects in an image including
their interrelationships. In this regard, the particular vision application will dictate
the appropriate level of detail, and what is considered to he important in a scene
description.
There are various approaches to the scene description problem. At one extreme,
it will be sufficient to simply apply pattern recognition methods to classif y certain
objects within a scene. This approach may require no more than application of the
methods described in the preceding chapter. At the other extreme, it may be desirable
to produce a detailed description of some general scene and provide an interpretation
of the function, purpose, intent, and expectations of the objects in the scene. Although
this recuiremeni is beyond the current stae-of-the-art, we can sa y that it will require
a gre. many prestored pattern descriptions and much general world knowledge. It
will also require improvements on many of the processing techmques described in
this chapter.
lregion6
(mass-center 2348)
(shape-code 24)
(area 245)
(number-boundary-segments 6)
(chain-code 1133300011. . .1
(orientation 85)
(borders )region4 (position left-of) (contrast 5))
(region7 (position above) (contrast 2))
(mean-intensity 0.6)
(texture light regular)
linear I Scene
bodary
trn
building
,/ \\
region 1 mad' brick
divided
Color has-
matches rather than absolute ones. Rule conclusions will be rated by likelihood or
certainty factors instead of complete certainty. Identification of objects can then be
made on the basis of a likelihood score. In Figure 14.26 (a) pairs of numbers are
given in the antecedent to suggest acceptable condition levels comparable to Dempster-
Shafer probabilities (the values in the figure are arbitrarily chosen with a scale of 0
to 1.0)
When rule-based identification is used, the vision system may be given an
initial goal of identifying each region. This can be accomplished with a high-level
goal statement of the followig type.
(label region
(or (rgn building)
(rgn = bushes)
(rgn = car)
(rgn = house)
(rgn road)
(rgn shadow)
(rgn tree)))
Other forms of matching may also be used in the interpretation process. For
example, a decision tree may be used in which region attributes and relation values
determine the branch taken at each node when descending the tree The leaves of
the decision tree are labeled with the object identities as in Figure 14.27.
(R10-sky
(and (location upper rgn)
(intensity rgn bright (0.4 0.8))
(color rgn (or (blue grey)) (0.7 1.0)1
(textiral rgn low (0.8 1.0))
(linear-boundary rgn rgn2 (0.4 0.7())
Sec. 14.6 Vision System Architectures 317
./esire
/I\ /\ /I\ I\ /\
large medium small
lom
/I\
A\
}esture
/\
\/ /\
sky
VYvvvv tree road car building sidewalk lawn
Objects, with their attributes and relations are then used to construct an associa-
tive net scene, a frame network, or other structure.
In this section we present two vision systems which are somewhat representative
of complete system architectures. The firsi system is a model-based system, one of
the earliest successful vision systems. The second is a color region analyzer recently
developed at the Universit y of Kyoto. Japan.
318 Visuai im3ge. Understanding Chap. 14
The user descriptions are parsed al transformed by the system Into geolTietrIc
and algebraic network representations. These repreentations provide volumetric de-
scriptions in local coordinate systems. A graphic pre.entatiofl, the systems interpreta-
tion of the input models created by the user, provides feedback to the user during
the modeling process. The completed representsikm.s are used by the system to
predict what features (e.g. shape, orientation, and position) of the modeled objects
can be observed from the input image components. The predicted models are stored
as prediction graphs.
The visual input consists of gray-level image processing arrays, a Line finder,
and an edge linker. This part of the system provides descriptions of objects as
defined by segmented edge structures. The descriptions created from this unit are
represented as observation graphs. One output front predictor serves as an input
Sec. 14.6 Vision System Architectures 319
algebra
/
user -... parser ---s.. predictor -
._ edge mapping module
graphicsqeometry
iflterp rein
to the edge mapping and linking module. This unit uses the predicted information
(predicted edges, ribbons, or ellipses in the modeled objects) to assist in finding
and identifying image objects appearing in the input image. Outputs from both the
predictor and the edge mapper and linker serve as inputs to the interpreter. The
interpreter is essentially a graph matcher. It tries to find the most matches among
suhgraphs of the image observation graph and the prediction graph.
Each match
becomes an interpretation graph. Partial matching is accommodated in the interpreta-
tion process through consistency checks.
The basic interpretation process is summarized in Figure 14.29 where models
are given for two wide bodied aircraft. (a Boeing 747 and a Lockheed L-101 I).
and the interpretation of an aircraft from gray-level image to ACRONYM's interpreta-
tion is shown.
(a)
(b) (c)
L1i (d)
nm
Preliminary
segmentation
Bottom-up
process
- Structured data network
Plan
Top-down
process
Production system
22-
M Visual Image Understanding Chap. 14
between objects. The rules also have weights which indicate the level of uncertainty
of the knowledge. Each rule in the top-down set is a condition-action pair, where
the condition is a fuzzy predicate which examines the Situation of the data base.
The action pert includes operations to construct the scene descntion. An agenda
manages the activation of production rules and schedules the executable actions.
Examples of a typical property rule and a relation rule are as follows:
The first rule is a property rule about the color of the sky (blue or gray). The
second rule is a relation rule about the boundary between a building and the sky.
The boundary between the two has a lot of linear parts, and the building is not on
the upper side of that boundary.
The final product of the analyzer is, of course, a description of the scene.
This is constructed as a hierarchical • network as illustrated in Figure 14.31.
Ohta's system has demonstrated that it can deal with fairly complex scenes,
including objects with substruitures. To validate this claim, a number of outdoot
scenes from the Kyoto University campus were analyzed correctly by the system.
scene
Obiect
region
Subregion
Pitch
Pi,.I
14.7 SUMMARY
EXERCISES
14.2. Describe the types of world knowledge a vision system must have to"comprehend"
the scene portrayed in Figure 14.3.
14.3. Suppose the CPU in a vision system takes 200 nanoseconds to perform memory/
register transfers and 500 nanoseconds to perform basic arithmetic operations. Esti-
mate the time required to produce a binary image for a system with a resolution of
256 x 256 pixels.
14.4. How much memory is required to produce and compare five different binary images,
each with a different threshold level? Assume a system resolution of 512 x 512.
Can the binary images be compressed in some way to reduce memory requirements?
14.5. Find the binary image for the array given below when the threshold is set at 35.
23 132 35
36 30 42 38
2 9 34 36
37 36 35 33
14.6. Given the following histogram, what are the most likely threshold points? Explain
why you chose the given points and rejected others.
H ito9ran
14.7. What is the value of the smoothed pixel for the associated mask'?
MASK PIXELS
3/16 [78 9
3/16 1/4 3/16 5 4 6
3/16 4 6 2
14.8. Compare the effects of the eight- and four-neighbor filters described in Section 14.2
when applied to the following array of pixel gray-level values.
58 8 10 12 29 32 30
47 8 9 10 9 30 29
58 7 S II 33 31 34
69 8 10 34 3' 29 33
68 9 32 30 29 5 6
87 31 32 32 28 6 7
7 8 33 33 29 7 8 7
9 30 32 31 28 8 8 9-
Chap. 14 Exercises 325
14.9. Low noise systems should use little or no filtering to avoid unnecessary blurring.
This means that more weight should be given to the pixel being smoothed. Define
two low-noise filters, one a four-neighbor and one an eight-neighbor filter, and compare
their effects on the array of Problem 14.5.
14.10. Using a value of n I, apply D and D (horizontally) to the array of Problem
14.5 and comment on the trace of any apparent edges.
14.11. Apply the vector gradient to the array of Problem 14.5 and compare the results to
those of Problem 14.7.
14.12. This problem relates to the application of template matching using correlation tech-
niques. The objective is to try to match an unknown two-dimensional curve or wave-
form with a known waveform. Assume that both waveforms are discrete and are repre-
sented as arrays of unsigned numbers. Write a program in any suitable language to
match the unknown waveform to the known waveform using the correlation function
given as
c,<x,zi>
lix ii iiZil
where X is the unknown pattern vector, Zi is the known pattern vector at position i,
< X,Z> denotes the inner product of X and Z, and liXil is the norm of X.
lxii =
14.13 Write a program to apply the Sobel edge detection mask to an array consisting of
256 x 256 pixel gray level values.
14.14 Color and texture are both potentially useful in defining regions. Describe an algorithm
that could be used to determine regions that are homogenious in color.
14.15 Referring to Problem 14.14, develop an algorithm that can be used to define regions
that are homogeneous in texture.
14.16 Referring to the two previous problems, develop an algorithm that determines regions
on the basis of homogeniety in both color and texture.
#1I
I-J
Expert Systems
Architectures
This chapter describes the basic architectures of knowledge-based systems with empha-
sis placed on expert systems. Expert systems are a recent product of artificial intelli-
gence. They began to emerge as university research systems during the early 1970s.
They have now become one of the more important innovations of Al since they
have been shown to be successful commercial products as well as interesting research
tools-
Expert systems have proven to be effective in a number of problem domains
which normally require the kind of intelligence possessed by a human expert. The
areas of application are almost endless. Wherever human expertise is needed to
solve problems. expert systems are likely candidates for application. Application
domains include law, chemistry, biology, engineering, manufacturing, aerospace,
military operations, finance, banking, meteorology, geology. geophysics. and more.
The list goes on and on.
In this chapter we explore expert system architectures and related building
tools. We also look at a few of the more important application areas as well. The
material is intended to acquaint the reader with the basic concepts underlying expert
systems and to provide enough of the fundamentals needed to build basic systems
or pursue further studies and conduct research in the area.
326
Sec. 15.1 Introduction 37
15.1 INTRODUCTION
Expert s'ystems differ from conventional computer systems in several important ways.
I. Expert systems use knowledge rather than data to control the solution process.
"In the knowledge lies the power" is a theme repeatedly followed and supported
throughout this book. Much of the knowledge used is heuristic in nature rather
than algorithmic.
2. The knowledge is encoded and maintained as an entity separate from the
control program. As such, it is not compiled together with the control program
itself. This permits the incremental addition and modification (refinement) of the
knowledge base without recompilation of the control programs. Furthermore, it is
possible in some cases to use different knowledge bases with the same control
programs to produce different types of expert systems. Such system .s are known as
expert system shells since they may be loaded with different knowledge bases.
3. Expert systems are capable of explaining how a particular conclusion was
reached, and why requested information is needed during a consultation. This is
important as it gives the user a chance to assess and understand the system's reasoning
ability, thereby improving the user's confidence in the system.
Background History
Expert systems first emerged from the research laboratories of a few leading U.S.
universities during the 1960 and 1970s. They were developed as specialized problem
328 Expert Systems Architectures Chap. 15
solvers which emphasized the use of knowledge rather than algorithms and general
search methods. This approach marked a significant departure from conventional
Al systems architectures at the time. The accepted direction of researchers then
was to use Al systems that employed general problem solving techniques such as
hill-climbing or means-end analysis (Chapter 9) rather than specialized domain knowl-
edge and heuristics. This departure from the norm proved to be a wise choice. It
led to the development of a new class of successful systems and special system
designs.
The first expert system to be completed was DENDRAL, developed at Stanford
University in the late 1960s. This system was capable of determining the structure
of chemical compounds given a specification of the compound's Constituent elements
and mass spectrometry data obtained from samples of the compound. DENDRAL
used heuristic knowledge obtained from experienced chemists to help constrain the
problem and thereby reduce the search space. During tests, DENDRAL discovered
a number of structures previously unknown to expert chemists.
As researchers gained more experience with DENDRAL, they found how
difficult it was to elicit expert knowledge from experts. This led to the development
of Meta-DENDRAL, a learning component for DENDRAL which was able to learn
rules from positive examples, a form of inductive learning described later in detail
(Chapters 18 and 19).
Shortly after DENDRA1. was completed, the development of MYCIN began
at Stanford University. MYCIN is an expert system which diagnoses infectious
blood diseases and determines a recommended list of therapies for the patient. As
part of the Heuristic Programming Project at Stanford, several projects directly
related to MYCIN were also completed including a knowledge acquisition component
called THEIRESIUS, a tutorial component called GUIDON, and a shell component
called EMYCIN (for Essential MYC[N) EMYCIN was used to build other diagnostic
systems including PUFF, a diagnostic expert for pulmonary diseases. EMYCIN
also became the design model for several commercial expert system building tools.
MYCIN's performance improved significantly over a several year period as
additional knowledge was added. Tests indicate that MYCIN's performance now
equals or exceeds that of experienced physicians. The initial MYCIN knowledge
base contained about only 200 rules. This number was gradually increased to more
than 600 rules by the early 1980s. The added rules significantly improved MYCIN's
performance leading to a 65% success record which compared favorably with experi-
,
enced physicians who demonstrated only an average 60% success rate (Lenat, 1984).
(An example of MYCIN's rules is given in Section 4.9, and the treatment of uncertain
knowledge by MYCIN is described in Section 6.5.)
Other early expert system projects included PROSPECTOR, a system that
assists geologists in the discovery of mineral deposits, and RI (aka XCON), a
system used by the Digital Equipment Corporation to select and configure components
of complex computer systems. Since the introduction of these early expert systems.
numerous commercial and military versions have been completed with a high degree
of success. Some of these application areas are itemized below.
Sec. 15.1 Introduction 329
Applications
Since the introduction of these early expert systems, the range and depth of applications
has broadened dramatically. Applications can now be found in almost all areas of
business and government. They include such areas as
The value of expert systems was well established by the early 1980s. A number of
successful applications had been completed by then and they proved to he cost
effective. An example which illustrates this point well is the diagnostic s stem
developed by the Campbell Soup Company.
Campbell Soup uses large sterilizers or cookers to cook soups and other canned
330 Expert Systems Architectures Chap. 15•
products at eight plants located throughout the country. Some of the larger cookers
hold up to 68,000 cans of food for short periods of cooking time. When difficult
maintenance problems occur with the cookers, the fault must be found and corrected
quickly or the batch of foods being prepared will spoil. Until recently, the company
had been depending on a single expert to diagnose and cure the more difficult prob-
lems, flying him tothe site when necessary. Since this individual will retire in a few
years taking his expertise with him, the company decided to develop an expert
system to diagnose these difficult problems.
After some months of development with assistance from Texas Instruments,
the company developed an expert system which ran on a PC. The system has about
150 rules in its knowledge base with which to diagnose the more complex cooker
problems. The system has also been used to provide training to new maintenance
personnel. Cloning multiple copies for each of the eight locations cost the company
only a few pennies per copy. Furthermore, the system cannot retire, and its perfor-
mance can continue to be improved with the addition of more rules. It has already
proven to be a real asset to the company. Similar cases now abound in many diverse
organizations.
The most common form of architecture used in expert and other types of knowledge-
based systems is the production system, also called the rule-based system. This
type of system uses knowledge encoded in the form of production rules, that is, if
then rules. We may remember from Chapter 4 that rules have an antecedent
or condition part, the left-hand side, and a conclusion or action part, the right-
hand side.
A&B&C&D—E&F
Each rule represents a small chunk of knowledge relating to the given domain
of expertise. A number of related rules collectively may correspond to a chain of
inferences which lead from some initially known facts to some useful conclusions.
When the known facts support the conditions in the rule's left side, the conclusion
or action part of the rule is then accepted as known (or at least known with some
degree of certainty). Examples of some typical expert system rules were described
in earlier sections (for example see Sections 4.9. 6.5, and 10.6).
Sec. 15.2 Rule-Based System Architectures 331
rence Case
history
Input file
wledge Working
ase memory
rning
dtile
i^v
The knowledge base contains facts and rules about some specialized knowledge
domain. An example of a simple knowledge base giving family relationships is
illustrated in Figure 15.2. The rules in this figure are given in the same LISP
format as those of Section 10.6 which is similar to the format given in the OPSS
language as presented by Bronston. Farrell. Kant. and Martin (1985). Each fact
and rule is identified with a name (al, a2.... . rt. r2, . . .). For ease in
reading. the left side is separated from the right by the implication symbol -.
Conjuncts on the left are given within single parentheses (sub!ists). and one or
more conclusions may follow the implication symbol. Variables are identified as a
symbol preceded by a question mark. It . should be noted that rules found in real
working systems may have many conjuncts in the LHS. For example, as many as
eight or more are not uncommon.
Expert Systems Architectures Chap. 15
(male ?x()
02 ((wife ?x ?y()
(female ?x))
03 ((wife x ?y))
(husband ?y ?x))
(r4 ((mother ?x ?y)
(husband ?z ?x((
(father ?z ?y)(
(rS ((father ?x ?y))
(wife ?z ?x))
(mother ?z ?y))
(r6 ((husband ?x ?y)(
(wife ?y ?x))
fri ((father ?x ?z(
(mother ?y 1z))
(husband ?x ?y))
(r8 ((father ?x ?z)
(mother ?y ?zfl
(wife ?y ?z((
(r9 ((father ?x ?y)
(father ?y 7z))
Figure 1.2 Facts and rules in a simple
(grandfather 'x ?z((( knowledge base.
In PROLOG, rules are written naturally as clauses with both a head and body.
For example, a rule about a patient's symptoms and the corresponding diagnosis
of hepatitis might read in English as the rule
The inference engine accepts user input queries and responses to questions through
the I/O interface and uses this dynamic information together with the static knowledge
(the rules and facts) stored in the knowledge base. The kno ledge in the knowledge
base is used to derive conclusions about the current case or situation as presented
by the user's input.
The inferring process is carried Out recursively in three stages: ( I ) match. (2)
select, and (3) execute. During the match stage, the contents of orking memory
are compared to facts and rules contained in the knowledge base. When consistent
matches are found, the corresponding rules are placed in a conflict set. To find an
appropriate and consistent match, substitutions (instantiations) may be required.
Once all the matched rules have been added to the conflict set during a given
cycle, one of the rules is selected for execution. The criteria for selection may be
most recent use, rule condition specificity. (the number of conjuncts on the left).
or simply the smallest rule number. The selected rule is then executed and the
right-hand side or action part of the rule is then carried out. Figure 15.3 illustrates
this match-select-execute cycle.
As an example, suppose the working memory contains the two clauses
When the match part of the cycle is attempted, a consistent match will be made
between these two clauses and rules r7 and r8 in the knowledge base. The match
is made by substituting Bob for ?x, Sam for ?z. and Sue for ?. Consequently.
since all the conditions on the left of both r7 and r8 are satisfied, these two rules
will be placed in the conflict set. If there are no other working memor y clauses to
match, the selection step is executed next. Suppose, for one or more ofthe selection
criteria stated above, r7 is the rule chosen to execute. The clause on the right side
of r7 is instantiated and the execution step is initiated. The execution step may
result in the right-hand clause (husband bob sue) being placed in working memory
or it may be used to trigger a message to the user. Following the execution step.
the match-select-execute cycle is repeated. -
Chap. 15
334 Expert Systems Architectures
e Working
It
1flImOrY1
1
conflict
I
2
MIW
I •X11t*
Figure 15.3 The production system
inference cycle.
As another example of matching, mppose the two facts (a6 (father sam bill))
and (a7 (father bill pam)) have been added to the knowledge base and the immediate
goal is a query about Pam's grandfather. When made, assume this query has resulted
in placement of the clause (grandfather ?x pam) into working memory. For this
goal to succeed, consistent substitutions must be made for the variables ?x and ?v
in rule r9 with a6 and a7. This will be the case if Sam and Bill are substituted for
?x and ?v in the subgoal left-hand conditions of r9. The right hand side will then
correctly state that Pam's grandfather is Sam. -
When the left side of a sequence of rules is instantiated first and the rules are
executed from left to right, the process is called forward chaining. This is also
known as data-driven inference since input data are used to guide the direction of
the inference process. For example, we can chain forward to show that when a
student is encouraged, is healthy, ard has goals, the student will succeed.
On the other hand, when the right side of the rules is instantiated first, the
left-hand conditions become subgoals. These subgoals may in turn cause sub-subgpals
to be established, and so on until facts are found to match the lowest subgoal
conditions. When this form of inference takes place, we say that backward chaining
is performed. This form of inference is also known as goal-driven inference since
an initial goal establishes the backward direction of the inferring.
Sec. 15.2 Rule-Based System Architectures
For example, in MYCIN the initial goal in a consultation is "Does the patient
have a certain disease?" This causes subgoals to be established such as "are certain
bacteria present in the patient?" Determining if certain bacteria are present may
require such things as tests on cultures taken from the patient. This process of
setting up subgoals to confirm a goal continues until all the subgoals are eventually
satisfied or fail. If satisfied, the backward chain is established thereby confirming
the main goal.
When rules are executed, the resulting action may be the placement of some
new facts in working memory, a request for additional information from the user.
or simply the stopping of the search process. If the appropriate knowledge has
been stored in the knowledge base and all required parameter values have been
provided by the user, conclusions will be found and will be reported to the user.
The chaining continues as long as new matches can be found between clauses in
the working memory and rules in the knowledge base. The process stops when no
new rules can be placed in the conflict set.
Some systems use both forward and backward chaining, depending on the
type of problem and the information available. Likewise, rules may be tested exhaus-
lively or selectively, depending on the control structure. In MYCIN. rules in the
KB are tested exhaustively. However, when the number of rules exceeds a few
hundred, this can result in an intolerable amount of searching and matching. In
such cases, techniques such as those found in the RETE algorithm (Chapter 10)
may be used to limit the search.
Many expert systems must deal with uncertain information. This will be the
case when the evidenoe supporting a conclusion is vague, incomplete, or otherwise
uncertain. To accommodate Uncertainties, some form of probabilities, certainty fac-
tors, fuzzy logic, heuristics, or other methods must bp introduced into the inference
process. These methods were introduced in Chapters 5 and 6. The reader is urged
at this time to review those methods to see how they may be applied to expñ
systems.
['he explanation module provides the user with an explanation of the reasoning
process'when requested. This is done in response to a how que' or a why query.
To respond to a how query, the explanation module traces The chain of rules
fired during a consultation with the user. The sequence of rules that led tQ the conclusion
is then printed for the user in an easy to understand human-language style. This
permits the user to actually see the reasoning process followed by the system in
arriving at the conclusion. If the user does not agree with the reasoning steps presented.
they may be changed using the editor.
To respond to a why query, the explanation module must be able to explain
why certain information is needed by the inference engine to complete a step in
the reasoning process before it can proceed. For example, in diagnosina car that
will not start, a system might be asked why it needs to know the status of the
336 Expert Systems Architectures Chap. 15
distributor spark. In response, the system would reply ihat it needs this information
to determine if the problem can be isolated to the ignition system. Again, this
information allows the user to determine if the system's reasoning steps appear to
be sound. The explanation module programs give the user the important ability to
follow the inferencing steps at any time during the consultation.
The editor is used by developers to create new rules for addition to the knowledge
base, to delete outmoded rules, or to modify existing rules in some way. Some of
the more sophisticated expert system editors provide the user with features not
found in typical text editors, such as the ability to perform some types of consistency
tests for newly created rules, to add missing conditions to a rule, or to reformat a
newly created rule. Such systems also prompt the user for missing information,
and provide other general guidance in the KB creation process.
One of the most difficult tasks in creating and maintaining production systems
is the building and maintaining of a consistent but complete set of rules. This
should be done without adding redundant or unnecessary rules. Building a knowledge
base requires careful planning, accounting, and organization of the knowledge struc-
tures. It also requires thorough validation and verification of the completed knowledge
base, operations which have yet to be perfected. An intelligent" editor can greatly
simplify We process of-building a knowledge base.
TEIRESIAS (Davis. 1982) is an example of an intelligent editor developed
to assist users in building a knowledge base directly without the need for an intermedi-
ary knowledge engineer. TEIRESIUS was developed to work with systems like
MYCIN in providing a direct user-to-system dialog. TEIRESIUS assists the user
in formulating, checking, and modifying rules for inclusion in the performance
program's knowledge base. For this. TEIRESIUS uses some metaknowledge, that
is, knowledge about MYCIN's knowledge. The dialog is carried out in a near English
form so that the user needs to know little about the internal form of the rules.
The input-output interface permits the user to communicate with the system in a
more natural way by permitting the use of simple selection menus or the use of a
restricted language which is close to a natural language. This means that the system
must have special prompts or a specialized vocabulary which encompasses the termi-
nology of the given domain of expertise. For example, MYCIN can recognize many
medical terms in addition to various common words needed to communicate. For
this, MYCIN has a vocabulary of some 2000 words.
Personal Consultant Plus, a commercial PC version of the MYCIN architecture,
uses menus and English prompts to communicate with the user. The prompts, written
in standard English, are provided by the developer during the system building stage
How andwhy explanations are also given in natural language form.
Sec. 15.3 NonproductiOfl System Architectures 337
The learning module and history file are not common components of expert
systems. When they are provided, they are used to assist in building and refining
the knowledge base. Since learning is treated in great detail in later chapters. no
description is given here.
Other, less common expert system architectures (although no less important) are
those based on nonproduction rule-representation schemes. Instead of rules, these
systems employ more structured representation schemes like associative or semantic
networks, frame and rule structures, decision trees, or 'even specialized networks
like neural networks. In this section we examine some typical s y stem architectures
based on these methods.
23-
338 Expert Systems Architectures Chap. 15
Angk 5IOre
Disease categories
Acute angle Ci,rsc angle
(
I/I \
Ciasstcation links / / F
1 !
\Cuulink
/ (I
(
pathophyttOCleel
/ / I
slates
c.^
Ang"
ii
Asioci.iional iinksti
I
1
I
LL__LI I
Patient observations pg,n
t7
i acuitF lop 45 mn,
TESTS
model as part of the cause and effect relationship relating symptoms and other
signs to diseases. -
Inference is accomplished by traversing the network, following the most plausi-
ble paths of causes and effects. Once a sufficiently strong path has been determined
through the network, diagnostic conclusions are inferred using classification tables
that interpret patterns of the causal network. These tables are similar to rule interpreta-
tions.
The CASNET system was never used much beyond the initial research stage.
At the time, physicians were reluctant to use computer systems in spite of performance
tests in which CASNET scored well.
Frame Architectures
Typical findings
Logical decision criteria
Complimentary relations to other frames
Differential diagnosis
Scoring
The patient findings are matched against frames, and when a close match is
found, a trigger status occurs. A trigger is a finding that is so strongly related to a
disorder that the system regards it as an active hypothesis, one to be pursued further.
A spe.:al is-sufficient slot is used to confirm the presence of a disease when key
findings co-.elate with the slot contents.
340 Expert Systems Architectures Chap. 15
Knoedge for expert systems may be stored in the form of a decision tree when
the knowledge can be structured in a top-to-bottom manner. For example, the identifi-
cation of objects (equipment faults, physical objects, diseases, and the like) an he
made through a decision tree structure. Initial and intermediate nodes in the tree
correspond to object attributes, and terminal nodes correspond to the identities of
objects. Attribute values for an object determine a path to a leaf node in the tree
which contains, the object's identification. Each object attribute corresponds to a
nontemiinal node in the tree and each branch of the decision tree corresponds to
all attribute value or set of vilues
A segment of a decision tree knowledge structure taken from an expert system
used to identify objects such as liquid chemical waste products is illustrated in
Figure 15.5 Patterson, 197). Each node in the tree corresponds to an identif'yin
attribute such as molecular weight, boiling point, burn test color, or solubilit y test
results. Each branch emanating froni a node corresponds to a value or ranee of
values for the attribute such as 20-37 degrees C, yellow, or nonsoluble in sulphuric
acid.
An identifica;ioii is made by traversing a path through the free (or network
until the path k-ads to a un i que leaf node which corresponds to the unknown object,
identity.
The knowledge base, which is the decision tree for an identification system.
can be constructed with a special tree-building ediior or with a learning module. In
either case, a set of the most discriminating attributes for the class of objects being
identified should he selected. Only those attributes that discriminate well among
different objects need be used. PriiiissihIe values for each of the attributes arc
grouped into separable sets, and each such set determines a branch front attribute
node to the next node.
New nodes and branches can be added to the tree when additional attributes
attrtxn me . 1
onie
Ves no Y_ no
oIubiiily tell
are needed to further discriminate among new objects. As the system gains experience,
the values associated with the branches can be modified for more accurate results.
1. There are a number of knowledge sources which are separate and independent
sets of coded knowledge. Each knowledge source may be thought of as a
specialist in some limited area needed to solve a given subset of problems.
The sources may contain knowledge in the form of procedures, rules, or other
schemes.
2. A globally accessible data base structure, called a blackboard, contains the
current problem state and information needed by the knowledge sources (input
data, partial solutions, control data, alternatives, final solutions). The knowledge
sources make changes to the blackboard data that incrementally lead to a
solution. Communication and interaction between the knowledge sources takes
place solely through the blackboard.
3. Control information may be contained within the sources, on the blackboard.
or possibly in a separate module. (There is no actual control unit specified as
Control information
H. Penny Nil (1986a) has aptly described the blackboard problem solving
.trategy through the following analogy.
Imagine a room with a large blackboard on which a group of experts are piecing
together a jigsaw puzzle. Each of the experts has some special knowledge about solvhnv
puzzles (e.g.. a border expert, a shape expert, a color expert. etc.). Each member
examines his or her pieces and decides if they will fit into the partially completed
puzzle. Those members having appropriate pieces go up to the blackboard and update
the evolving solution. The whole puzzle can be solved in complete silence with no
direct communication among members of the group. Each person is self-activating.
knowing when he or she can contribute to the solution. The solution evolves in this
incremental way with each expert contributing dynamicall y on an opportunistic basis,
that is, as the opportunity to contribute to the solution arises.
The objects on the blackboard are hierarchically organized into levels which facilitate
analysis and solution. Information from one level serves as input to a set ot knnns ledge
sources. The sources modify the knowledge and place it on the same or dnticrcnt
levels. -
The control information is used by the control module to determine the focus
of attention. This determines the next item to be processed. The focus of attention
can be the choice of knowledge sources or the blackboard objects or both. If both,
the control determines which sources to apply to which objects.
Problem solving proceeds with a knowledge source making changes to the
blackboard objects. Each source indicates the contribution it can make to the nc
solution state. Using this information, the control module chooses a focus of attention.
It the focus of attention is a knowledge source, a blackboard object is chosen as
the context of its invocation. If the fOCUS of attention is a blackboard object, a
knowledge source which can process that object is chosen. If the focus ol attention
is both a source and an object, that source is executed within that context.
Blackboard systems have been gaining some popularity recently. they hae
been applied to a number of different application areas. One of the first applications
was in the HEARSAY family of projects, which are speech understanding systems
(Reddy et al.. 1976). More recently, systems have been developed to analyze complex
scenes, and to model the human cognitive processes (Nh. 1986b).
Little work has been done in the area of analogical reasoning systems. Yet this is
one of the most promising areas for general problem solving. We humans make
extensive use of our previous experience in solving everyday problems. This is
because new problems are frequently similar to previously encountered problems.
Sec. 15.3 NonprOdUCtiOfl System Architectures
Neural networks are large networks of simple processing elements or nodes which
process information dynamically in response to external inputs. The nodes are simpli-
fied models of neurons. The knowledge in a neural network is distributed throughout
the network in the form of internode connections and weighted links which form
the inputs to the nodes. T he link weights serve to enhance or inhibit the input
stimuli values which are then added together at the nodes. If the sum of all the
inputs to a node exceeds some threshold value T, the node executes and produces
an output which is passed on to other nodes or is used to produce some output
response. In the simplest case, no output is produced if the total input is less than
T. In more complex models, the output will depend on a nonlinear activation function.
- Neural networks were originally inspired as being models of the human nervous
system. They are greatly simplified models to be sure (neurons are known to be
fairly complex processors). Even so, they have been shown to exhibit many "intelli-
gent" abilities, such as learning, generalization, and abstraction.
A single node is illustrated in Figure 15.7. The inputs to the node are the
values t, x. ..,X, which typically take on values of - I. 0. I. or real values
within the range (-1.1). The weights w 1 , w, .......,,. correspond to the synaptic
strengths of a neuron. They serve to increase or decrease the effects of the correspond-
n, serve as
ing x, input values. The sum of the prodiicts x w,, I = I. 2, . . . .
the total combined input to the node. If this sum is large enough to exceed the
threshold amount T, the node fires, and produces an output y, an activation function
value placed on the node's output links. This output may then be the input to
other nodes or the final output response from the network.
Figure 15.8 illustrates three layers of a number of interconnected nodes. The
first layer serves as the input layer, receiving inputs from some set of stimuli. The
second layer (called the hidden layer) receives inputs from the first layer and produces
a pattern of Inputs to the third layer, the output layer. The pattern of outputs from
the final layer are the network's responses to the input stimuli patterns. Input links
to layer = 1.2.3) have weights w for = 1,2..... . n.
General multilayer networks having n nodes (number of rows) in each of m
layers (number of columns of nodes) will have weights represented as an n x in
matrix W. Using this representation, nodes having no interconnecting links will
have a weight value of zero. Networks consisting of more than three layers would,
of course, be correspondingly more complex than the network depicted in Figure
l5..
A neural network can be thought of as a black box that transforms the input
vector x to the output vector y where the transformation performed is the result of
the pattern of connections and weights, that is, according to the values of the weight
matrix W.
Consider the vector product
X * W=
whee JxJ denotes the norm or length of the vector x. Note that this product is
maximum when both vectors point in the same.directjon, that is, when 0 0. The
I,,
V.,
Figure 15.8 A mulillayer neuraf
layer 1 layer 2 layer 3 network
Sec. 15.3 NonproductiOn System Architectures 345
W flW = Wok, + a * D X2
Ix'
where 0 < a < I is a learning constant that determines the rate of learning. When
the difference D is large, the adjustment to the weights W is large, but when the
output response y is close to the target response y' the adjustment will be small.
When the difference D is near zero, the training process terminates at which point
the network will produce the correct response for the given input patterns x.
In unsupervised learning, the training examples Consist of the input vectors x
only. No desired response y' is available to guide the system. Instead, the learning
process must find the weights w,, with no knowledge of the desired output response.
We leave the unsupervised learning description until Chapter IS where learning is,
covered in somewhat more detail.
gr
Figure 15.14) A irnsp!e neural network eper1 systern Frrsru S L (iiILsnl ACM
Communications. Viii. 31, No. 2, p. 152. I95. By permission.
of + I (true). Negative symptoms are given an input value of - I (false), and unknown
symptoms are given the value 0. Input symptom values are multiplied h their
corresponding weights R',. Numbers within the nodes are initial bias weights o•,_
and numbers on the links are the other node input weights. When the sum of the
weighted products of the inputs exceeds 0. an output will be present on the correspond-
ing node output and serve as an input to the next layer of nodes.
As an example. suppose the patient has swollen feet (u + I) but not red
ears (u, = - I) nor hair loss (u 3 = - l. This gives a value of u 7 = + I (since
O+2( l)+(-2)(-- l)+(3)(— I) = I). suggesting the patient has superciliosis.
When it is also known that the other symptoms of the patient are false (o =
a6 = - I), it may be concluded that namatosis is absent (a 5 = - I), and
therefore that birambio (u 10 = +1) should be prescribed while placibin should not
be prescribed (u9 = - I). In addition, it will be found that posiboost should also
be prescribed (U 11 = +1).
The intermediate triangular shaped nodes were added by the training algorithm.
These additional nodes are needed so that weight assignments can be made which
permit the computations to work correctly for all training instances.
Knowledge Acquisition and Validation 347
Sec. 15.5
Deductions can be made just as well when only partial information is available.
For example, when a patient has swollen feet and suffers from hair loss, it may be
concluded the patient has superciliosis, regardless of whether or not the patient has
red ears. This is so because the unknown variable cannot force the sum to change
to negative.
A system such as this can also explain how or why a conclusion was reached.
For example, when inputs ,pd outputs are regarded as rules, an output can be
explained as the conclusion to a rule. If placibin is true, the system might explain
why with a statement such as
One of the most difficult tasks in building knowledge-based systems is in the acquisi-
tion and encoding of the requisite domain knowledge. Knowledge for expert systems
must be derived from expert sources like experts in the given field, journal articles,
texts, reports, data bases, and so on. Elicitation of the right knowledge can take
several man years and cost hundreds of thousands of dollars. This process is now
recognized as one of the main bottlenecks in building expert and other knowledge-
based systems. Consequently, much effort has been devoted to more effective methods
of acquisition and coding.
Pulling together and correctly interpreting the right knowledge to solve a set
348 Expert Systems Architectures Chap. 15
of complex tasks is an onerous job. Typically, experts do not know what specific
knowledge is being applied nor just how it is applied in the solution of a given
problem. Even if they do know, it Is likely they are unable to articulate the problem
solving process well enough to capture the low-level knowledge used and the inferring
processes applied. This difficulty has led to the use of Al experts (called knowledge
engineers) who serve as intermediaries between the domain expert and the system.
The knowledge engineer elicits information from the experts and codes this knowledge
into a form suitable for use in the expert system.
The knowledge elicitation process is depicted in Figure 15.11. To elicit the
requisite knowledge. a knowlege engineer conducts extensive interviews with domain
experts. During the interviews., the expert is asked to solve typical problems in the
domain of interest and to explain his or her solutions.
Using the knowledge gained front experts and other sources, the knowledge
engineer codes the knowledge in the form of rules or some other representation
scheme. This knowledge is then used to solve sample problems for review and
validation by the experts. Errors and omissions are uncovered and corrected, and
additional knowledge is added as needed. The process is repeated until a sufficient
body of knowledge has been collected to solve a large class of problems in the
chosen domain. The whole process may take as many as tens of person years.
Penny Nit, an experienced knowledge engineer at Stanford University, has
described some useful practices to follow in solving acquisition problems through
a sequence of heuristics she uses. They have been summarized in the book The
Fifth Generation by Feigenbaum and McCorduck (1983) as follows.
You can't be your on expert. By examining the process of your own expertise ou
risk becoming like the Centipede who got tangled up in her own legs and siopped
dead when she tried to figure out how she moved a hundred legs in harmony.
From the beginning, the knowledge engineer must count on throwing efforts away.
Writers make drafts, painters make preliminary sketches; knowledge engineers are no
different.
The problem must be well chosen. Al is a young field and isn't rcad to take on
evers problem the world has to offer. Expert systems work best when the problem is
well bounded, which is computer talk to describe a problem for which large amounts
of specialized knowledge may be needed, but not a general knowledge of the world.
If you want to do any serious application you need to meet the expert more than half
way; if he's had no exposure to computing, your job will be that much harder.
If none of the tools you normally use works, build a new one.
Dealing with anything but facts implies uncertainty. Heuristic knowledge is not hard
and fast and cannot be treated as factual. A weighting procedure has to be built into
the expert system . to allow for expressions such as "1 strongly believe that ..." or
"The evidence suggests that......
A high-performance program, or a program that will eventually be taken over by the
expert for his own use, must have very easy ways of allowing the knowledge to be
modified so that new information can be added and out-of-date information deleted.
The problem needs to be a useful, interesting one. There are knowledge-based programs
to solve arcane puzzles, but who cares? More important, the user has to understand
the system's real value to his work.
When Nii begins a project, she first persuades a human expert to commit the
considerable time that is required to have the expert's mind mined. Once this is
done, she immerses herself in the given field, reading texts, articles, and other
material to better understand the field and to learn the basic jargon used. She then
begins the interviewing process. She asks the expert to describe his or her tasks
and problem solving techniques. She asks the expert to choose a moderately difficult
problem to solve as an example of the basic approach. This information is then
collected, studied, and presented to other members of the development team so
that a quick prototype can be constructed for the expert to review. This serves
several purposes. First, it helps to keep the expert in the development loop and
interested. Secondly, it serves as a rudimentary model with which to uncover flaws
and other problems. It also helps both expert and developer in discovering the real
way the expert solves problems. This usuall y leads to a repeat of the problem
solving exercise, but this time in a step-by-step walk through of the sample problem.
Nil tests the accuracy of the expert's explanations by observing his or her behavior
and reliance on data and other sources of information. She is concerned more with
the manipulation of the knowledge than with the actual facts. Keeping the expeit
focused on the immediate problem requires continual prompting and encouragement.
During the whole interview process Nii is mentally examining alternative ap-
proaches for the best knowledge representation and inferencing methods to see how
well each would best match the expert's behavior. The whole process of elicitation,
coding, and verification may take several iterations over a period of several months.
Recognizing the acquisition bottleneck in building expert systems, researchers
and vendors alike have sought new and better ways to reduce the burden and reliance
placed on knowledge engineers, and in general, ways to improve and speed up the
development process. This has led to a niunber of sophisticated building tools which
we consider next.
Since the introduction of the first successful expert systems in the late 1970s. a
large number of building tools have been introduced, both b y the academic commuttty
and industry. These tools range from hi g h level programming languages to intelligent
editors to complete shell environment systems. A number of commercial products
350 Expert Systems Architectures Chap. 15
are now available ranging in pi ice from a few hundred dollars to tens of thousands
of dollars. Some are capable of running on medium size PCs while others require
larger systems such as LISP machines. minis, or even main frames.
When evaluating building tools for expert system development The developer
should consider the following features and capabilities that may be offered in systems.
3. User interface characteristics (editor flexibility and ease of USC, use of menus,
use of pop-up windows, developer provided text capabilities for prompts and
help messages, graphics capabilities, consistency checking for newly entered
knowledge. explanation of how and why capabilities, system help facilities,
screen formatting and color selection capabilities, network representation of
knowledge base, and forms of compilation available, batch or interactive).
4. General system characteristics and supp: ri available (types of applications
with which the system has been success tflly used, the base programming
language in which the system was written, the types of hardware the systems
are supported on, general utilities available, debugging facilities, interfacing
flexibility to other languages and databases, vendor training availability and
cost, strength of software suppdrt, and company reputation).
A family of Personal Consultant expert system shells was developed by Texas Instru-
ments. Inc. (TI) in the early 1980s. These shells are rule-based building tools patterned
after the MYCIN system architecture and developed to run on a PC as well as on
larger systems such as the TI Explorer. The largest and most versatile of the Personal
Consultant family is Personal Consultant Plus.
Personal Consultant Plus permits the use of structures called frames (different
Sec. 15.6 Knowledge System Building Tools 351
Electrical
appliance
microwave I II
Iron I Food Toasterr
cooker I blender I
Mechanical Electrical
I I Figure 15.12 Hierarchical frame
Isubsystem
structure in PC Plus.
'fr
352 Expert Systems Architectures Chap. 15
Radian Rulemaster
The Rulemaster system developed in the early 1980s by the Radian Corporation
was written in C language to run on a variety of mini- and microcomputer systems.
Rulemaster is a rule-based building tool which consists of two main components:
Radial, a procedural, block structured language for expressing decision rules related
to a finite state machine, and Rulemaker, a knowledge acquisition system which
induces decision trees from examples supplied by an expert. A program in Rulemaster
consists of a collection of related modules which interact to affect changes of state.
The modules may contain executable procedures, advice, or data. The building
system is illustrated in Figure 15.13.
Rulemaster's knowledge can be based on partial certainty using fuzzy logic
or heuristic methods defined by the developer. Users can define their own data
types or abstract types much the same as in Pascal. An explanation facility is provided
to explain its chain of reasoning. Programs in other languages can also be called
from Rulemaster.
One of the unique features of Rutemaster is the Rulemaker component which
has the ability to induce rules from examples. Experts are known to have difficulty
in directly expressing rules related to their decision processes. On the other hand,
they can usuall y come up with a wealth of examples in which they describe typical
solution steps. The examples provided by the expert offer a more accurate wa's in
,cti,.ri tiIe
Ruleujake,
Assebje,
Expert system of
hierarchical
radial dulea -
'-S.-
Comptetiori
---p-
txcer,;A n,c-rams -
which the problem solving process is carried out. These examples are transformed
into rules by Rulemaker through an induction process.
KEE is one of the more popular building tools for the development of larger-scale
systems. Developed by lntellicorp, this system employs sophisticated representation
schemes structured around frames called units. The frames are made up of slots
and facets which contain object, attribute values, rules, methods, logical assertions,
text, or even other frames. The frames are organized into one or more knowledge
bases in the form of hierarchical structures which permit multiple inheritance down
hierarchical paths. Rules, procedures, and object oriented representation methods
are also supported.
Inference is carried out through inheritance, forward-chaining, backward-chain-
ing, or a mixture of these methods. A form of hypothetical reasoning is also provided
through different viewpoints which may be explored concurrently. The viewpoints
represent different aspects of a situation, views of the same situation taken at different
times, hypothetical Situations, or alternative courses of action. This feature permits
a user to compare competing courses of action or to reason in parallel about par1i*1
solutions based on different approaches.
KEE's support environment includes a graphics-oriented debugging pack:ige
flexible end-user interfaces using 'windows, menus, and an explanation capability
with graphic displays which can show inference chains. A graphics-based simulation
package called SimKit is available at additional cost.
KEE has been used for the development of intelligent user interfaces. genetics.
diagnosis and monitoring of complicated systems.., planning, design, process control,
scheduling, and simulation. The system is LISP based, developed for operation on
systems such as Symbolics machines. Xerox I lOOs. or TI Explorers. Systems can
also be ported to open architecture machines which support Common LISP without
extensive modification.
OPS5 System
The OPS5 and other OPS building tools were developel at Cniegie Mellon University
in conjunction with DEC during the late 1970s. This ssm was developed to
build the RIiXCON expert system which configures Vax arid ;r DEC minicomputer
systems. The system is used to build rule-based production s y stems h'ch use forward
chaining in the inference process (backward and mixed chaining is a'so possible).
The system was written in C language to run on the DEC Vax and other minicomputers.
It uses a sophisticated method of indexing rules (the Rete algorithm) to reduce the
matching times during the match-select-execute cycle. Examples of OPS5 rules
were given above in Section 15.2, and a description of the Pete match al?orithm
was given in Section 10.6.
24
354
Expert Systems Architectures Chap. 15
15.7 SUMMARY
Expert and other knowledge-based Systems are usually composed of 4t least a knowl-
edge base, an inference engine, and some form of user interface. The knowledge
base, which is separate from the inference and control components, contains the
expert knowledge coded in some form such as production rules, networks of frames
or other representation scheme. The inference engine manipulates the knowledge
structures in the knowledge base to perform a type of symbolic reasoning and draw
useful conclusions relating to the current task. The user interface provides the means
for dialog between the user and system. The user inputs commands, queries, and
responses to system messages, and the system, in turn, produces various messages
for the user. In addition to these three components, most systems have an editor
for use in creating and modifying the knowledge base structurea and an explanaton
module which provides the user with explanations of how a conclusion was reached
or why a piece of knowledge s needed. /-. te" s—terns also have some learning
capability and a ca . history file with whie . ,Omr'' " lonsultatio,'s.
A variety of expert system architectureshave been construied including rule-
based systems, frame-based systems, decision tree (discrimination network) systems,
analogical reasoning systems, blackboard architectures, theorem proving systems,
and even neural network architectures. These systems may differ in the direction
of rule chaining, in the handling of uncertainty, and in the search and pattern matching
methods employed. Rule and frame based systems are by far the most popular
architectures used.
Since the introduction of the first expert systems in the late 1970s, a number
of building tools have been developed. Such tools may be as unsophisticated as a
bare high level language or as comprehensive as a complete shell development
environment. A few representative building tools have been described and some
general characteristics of tools for developers were given.
The acquisition of expert knowledge for knowledge-based systems remains
one of the main bottlenecks in building them. This has led to a new discipline
called knowledge engineering. Knowledge engineers build systems by eliciting knowl-
edge from experts. coding that knowledge in an appropriate form, validating the
knowledge, and ultimately constructing a s y stem using a variet y of building tools.
EXERCISES
15.1. What are the main advantages in keeping the knowledge base separate from the
control module in knowledge-based systems?
15.2. Why is it important that an expert system be able to explain the why and how
questions related to a problem solving session?
15.3. Give an example of the use of meiakriowledge in expert systems inference.
Chap. 15 Exercises 355
15.4. Describe and compare the different types of problems solved by four of the earliest
expert systems DENDRAL. MYCIN. PROSPECTOR, and RI.
15.5. Identify and describe two good application areas for expert systems within a university
environment.
15.6. How do rules in PROLOG differ from general production system rules?
15.7. Make up a small knowledge-base of facts and rules using the same syntax as that
used in Figure 15.2 except that they should relate to an office working environment.
15.8. Name four different types of selection criteria that might be used to select the most
relevant rules for firing in a production system.
15.9. Describe a method in which rules could be grouped or organized in a knowledge
base to reduce the amount of search required during the matching pars of the inference
cycle.
15.10. Using the knowledge base of Problem 1.7, simulate three match-select-execute cycles
for a query which uses several rules andior facts.
15.11. Explain the difference between forward and backward chaining and under what condi-
tions each would be best to use for a given set of problems.
15,12. Under what conditions would it make sense to use both for\\
Give an example where both are used.
15.13. Explain why you think associative networks were never very popular forms of knowl-
edge representations in expert systems architectures.
15.14. Suppose you are diagnosing automobile engines using a system having a frame type
of architecture similar to PIP. Show how a trigger condition might be satisfied for
the distributor Ignition system when it is learned that the spark at all spark plugs is
weak.
15.15. Give the advantages of expert system architectures based on decision trees over
those of production rules. What are the main disadvantages?
15.16. Two of the main problems in validating the knowledge contained in the knowledge
bases of expert systems are related to completeness and consistency, that is, whether
or not a system has an adequate breadth of knowledgeto solve the class of problems
it was intended to solve and whether or not the knowledge is consistent. Is it easier
to check decision tree architectures or production rule systems for completeness and
consistency? Give supporting information for your conclusions.
15.17. Give three examples of applications for which blackboard architectures are well suited.
15.18. Give three examples of applications for which the use of analogical architectures
would be suitable in expert systems.
15.19. Consider a simple fully connected neural network containing three input nodes and
a single output node. The inputs to the network are the eight possible binary patterns
000. 001 .....Ill. Find weights u for which the network can differentiate between
the inputs by producing three distinct outputs.
15.20. For the preceding problem, draw projection vectors on the unit circle for the eight
different inputs using the weights determined there.
15.21. Explain how uncertaint y is propagated through a chain of rules during a consultation
with an expertsystem which is based on the MYCIN architecture.
15.22. Select a problem domain that requires some special expertise and consult with an
356 Expert Systems Architectures Chap. 15
expert in the domain to learn how he or she solves typical problems. After collecting
enough knowledge to solve a small subset of problems, create rules which could be
used in a knowledge base to solve the problems. Test the use of the rules on a few
problems which have been suggested by the expert and then get his other confirmation.
15.23. Relate each of the heuristics given by Penny Nii in Section 15.5 to a real expert
system solving problem.
15.24. Discuss how each of the features of expert system building tools given in Section
15.6 can affect the performance of the systems developed.
15.25. Obtain a copy of an expert system building tool such as Personal Consultant Plus
and create an expert system to diagnose automobile engine problems. Consult with
a mechanic to see if your completed system is reasonably good.
PART 5
Knowledge Acquisition
'It—'
'U
General Concepts
in Knowledge
Acquisition
The success of knowledge-based systems lies in the quality and extent of the knowledge
available to the system. Acquiring and validating a large corpus of consistent, corre-
lated knowledge is not a trivial problem. This has given the acquisition process an
especially important role in the design and implementation of these systems. Conse-
quently, effective acquisition methods have become one of the principal challenges
for the Al research community.
16.1 INTRODUCTION
The goals in this branch of Al are the discovery and development of efficient, Cost
effective methods of acquisition. Some important progress has recently been made
in this area with the development of sophisticated editors and some impressive
machine learning programs. But much work still remains before truly general purpose
acquisition is possible. In this chapter, we consider general concepts related to
acquisition and learning. We begin with a taxonomy of learning based on definitions
of behavioral learning types, assess the difficulty in collecting and assimilating large
quantities of well correlated knowledge, describe a general model for learning, and
examine different performance measures related to the learning process.
357
358
General Concepts in Knowledge Acquisition Chap, 16
Definitions
We all learn new knowledge through different methods, depending on the type of
material to be learned, the amount of relevant knowledge we already possess, and
the environment in which the learning takes place. It should not come as a surprise
to learn that many of these same types of learning methods have been extensively
studied in Al.
In what follows, it will be helpful to adopt a classification or taxonomy of
learning types to serve as a guide in studying or comparing differences among
them. One can develop learning taxonomies based on the type of knowledge represen-
tation used (predicate calculus, rules, frames), the type of knowledge learned (con-
cepts, game playing, problem solving), or by the area of application (medical diagno-
sis, scheduling, prediction, and so on). The classification we will use, however, is
intuitively more appealing and one which has become popular among machine learning
researchers. The classification is independent of the knowledge domain and the
representation scheme used. It is based on the type of inference strategy employed
or the methods used in the learning process.
The five different learning methods under this taxonom y are
The third type Listed, analogical learning, is the process of learning a new
concept or solution through the use of similar known concepts or solutions. We
use this type of learning when solving problems on an exam where previously
learned examples serve as a guide or when we learn to drive a ttuck using our
knowledge of car driving. We make frequent use of analogical learning. This form
of learning requires still more inferring than either of the previous forms, since
difficult transformations must be made between the known and unknown situations.
The fourth type of learning is also one that is used frequently by humans. It
is a powerful form of learning which, like analogical learning, also requires more
inferring than the first two methods. This form of learning requires the use of
inductive inference, a form of invalid but useful inference. We use inductive learning
when we formulate a general concept after seeing a number of instances or examples
of the concept. For example, we learn the concepts of color or sweet taste after
experiencing the sensations associated with several examples of colored objects or
sweet foods.
The ti'ial type of acquisition is deductive learning. It is accomplished through
a sequence of deductive inference steps using known facts.. From the known facts.
new facts or relationships are logically derived. For example, we could learn deduc-
tively that Sue is the cousin of Bill, if we have knowledge of Sue and Bill's parents
and rules for the cousin relationship. Deductive learning usually requires more infer-
ence than the other methods. The inference method used is, of course, a deductive
type, which is a valid form of inference.
In addition to the above classification, we will sometimes refer to learning
methods as either weak methods or knowledge-rich methods. Weak methods are
general purpose methods in which little or no initial knowledge is available. These
methods are more mechanical than the classical Al knowledge-rich methods. They
often rely on a form of heuristic search in the learning process. Examples of Some
weak learning methods are given in the next chapter under the names of Learning
Automata and Genetic Algorithms. We will be studying these and many of the
more knowledge-rich fonns of learnin g in more detail later, particularly various
types of inductive learning.
One of the important lessons learned by Al researchers during the 1970s and early
1980s is that knowledge is not easily acquired and maintained. It is a difficult and
time-consuming process. Yet expert and other knowledge-based systems require an
abundant amount of well correlated knowledge to achieve a satisfactory level of
intelligent performance. Typically, tens of person years are often required to build
up a knowledge base to an acceptable level of performance. This was certainly
true for the early expert systems such as MYCIN, DENDRAL. PROSPECTOR,
and XCON. The acquisition effort encountered in building these systems provided
the impetus for researchers to search for new efficient methods of acquisition. It
helped to revitalize a new interest in general machine learning techniques.
Sec. 16.4 General Learning Model 361
Early expert systems initially had a knowledge base consisting of a few hundred
rules. This is equivalent to less than 106 bits of knowledge. In contrast, the capacity
of a mature human brain has been estimated at some 10" bits of knowledge (Sagan.
1977). If we expect to build expert systems that are highly competent and possess
knowledge in more than a single narrow domain, the amount of knowledge required
for such knowledge bases will be somewhere between these two extremes, perhaps
as much as lO'° bits.
If we were able to build such systems at even ten times the rate these early
systems were built, it would still require on the order of lO person years. This
estimate is based on the assumption that the time required is directly proportional
to the size of the knowledge base, a simplified assumption. since the complexity
of the knowledge and the interdependencies grow more rapidly with the size of the
knowledge base.
Clearly, this rate of acquisition is not acceptable. We must develop better
acquisition and learning methods before we can implement such systems within a
realistic time frame. Even with the progress made in the past few years through
the development of special editors and related tools, more significant breakthroughs
are needed before truly large knowledge bases can be assembled and maintained.
Because of this, we expect the research interest in knowledge acquisition and machine
learning to continue to grow at an accelerated rate for some years in the future.
It has been stated before that a system's performance is strongly dependent
on the level and quality of its knowledge, and that "in knowledge lies power." If
we accept this adage, we must also agree that the acquisition of knowledge is of
paramount importance and, in fact, that 'the real power lies in the ability to acquire
new knowledge efficiently." To build a machine that can learn and continue to
improve its performance has been a tong time dream of mankind. The fulfillment
of that dream now seems closer than ever before with the modest successes achieved
by Al researchers over the past twenty years.
We will consider the complexity problem noted above again from a different
point of view when we study the different learning paradigms.
Strnuli
examples
rFeedback
component
Environment 1'
or teacher —i Critic
Knowledge performance
base evaluator
Response
Performance
component
Tasks Figure 16.1 General learning rniidcl.
organized training source such as a teacher which provides carefully selected training
examples for the learner component. The actual form of environment used will
depend on the particular learning paradigm. In any case, some representation language
must be assumed for communication between the environment and the learner. The
language may be the same representation scheme as that used in the knowledge
base (such as a form of predicate calculus). When they are chosen to be the same.
we say the single representation trick is being used. This usually results in a simpler
implementation since It is not necessar y to transform between two or more different
representations.
For some systems the environment ma y he a user working at a keyboard.
Other systems will use program modules to simulate a particular environment. In
even more realistic cases, the system will have real physical sensors which interface
with some world environment.
Inputs to the learner component may be physical stimuli of some t y pe or
descriptive, symbolic training examples. The information conveyed to the learner
CollipofleflI is used to create and modif y knowledge structures in the knowledge
base. This same knowledge is used by the performance component to carry Out
soniC [asks. such as solving a problem, pla y ing a game. or classifying instances of
some concept.
When given it task, the performance component produces a response describing
its actions in performing the task. The critic module then evaluates this response
relative to an optimal response.
Feedback, indicating whether or not the performance was acceptable, is then
sent by the critic module to the learner component for its subsequent use in modifying
the structures in the knowledge base. If proper learning was accomplished. the
system's performance will have improved with the changes made to the knowledge
base. -)
The cycle described above may be repeated a number of times until the perfor-
mance of the system has reached some acceptable level, until a known learning
goal has been reached, or until changes cease to occur in the knowledge base after
some chosen number of training examples have been observed.
Sec. 16.4 General Learning Model 363
There are several important factors whth influence a system's ability to learn
in addition to the form of representation used. They include the types of training
provided, the form and extent of any initial background knowledge. the type of
feedback provided, and the learning algorithms used (Figure 16.2).
The type of training used in a system can have a strong effect on perfoi-manc,
much the same as it does for humans. Training may consist of randomly selected
instances or examples that have been carefullyselected and ordered for presentation.
The instances may be positive examples of some concept or task being learned,
they may be negative, or they may be a mixture of both positive and negative.
The instances may be well focused using onl y relevant informatipn, or they may
contain a variety of facts and details including irrelevant data.
Many forms of learning can be characterized as a search through a space of
possible hypotheses or solutions (Mitchell, 1982). To make learning more efficient,
it is necessary to constrain this search process or reduce the search space. One
method of achieving this is through the use of background knowledge which can
be used to Constrain the search space or exercise control operations which limit the
search process. We will see several examples of this in the next three chapters.
Feedback is essential to the learner component since otherwise it would never
know if the knowledge structures in the knowledge base were improving or if they
were adequate for the performance of the given tasks. The feedback may be a
simple yes or no type of evaluation, or it may contain more useful information
describing why a particular action was good or bad. Also, the feedback may he
Completely reliable, providing an accurate assessment of the performance or it may
contain noise; that is, the feedback may actually be incorrect some of the time.
Intuitively, the feedback must be accurate more than SO?C of the time; otherwise
the system would never learn. If the feedback is always reliable and carries useful
information, the learner should be able to build up a useful corpus of knowledge
quickly. On the other hand, if the feedback is noisy or unreliable, the learning
process may be very slow and the resultant knowled g e incorrect.
Finally, the learning algorithms themselves determine to a large extent how
successful a learning system will be. The algorithms control the search to find and
build the knowledge structures. We then expect that the algorithms that extract
much of the useful information from training examples and take advantage of any
background knowledge outperform those that do not. In the following chapters we
Backgroi,nd
knowledge
Feedback
Learning
Training algorithms performance
Scenario
Raprntaoon Figure 16.2 Factors affecting learning
Scheme
performance.
364 General Concepts in Knowledge Acquisition Chap. 16
will see examples of systems which illustrate many of the above points regarding
the effects of different factors on performance.
16.6 SUMMARY
EXERCISES
17.1 INTRODUCTION
Attempts to develop autonomous learning systems began in the 1950s while cybernet-
ics was still an active area of research. These early designs were self-adapting
systems which modified their own structures in an attempt to produce an optimal
response to some input stimuli. Although several different approaches were pursued
during this period, we will consider only four of the more representative designs.
One approach was believed to be an approximate model of a small network of
neurons.
A second approach was initially based on a form of rote learning. It was
later modified to learn by adaptive parameter adjustment. The third approach used
self-adapting stochastic automata models, while the fourth approach was modeled
after survival of the fittest through population genetics.
In the remainder of this chapter, we will examine examples of the four methods
and reserve our descriptions of more classical Al learning approaches for Chapters
18. 19. and 20. The systems we use as examples here are famous ones that have
received much attention in the literature. They ale Rosenblatt's perceptrons, Samuel's
checkers playing system, Learning Automata, and Genetic Algorithms.
367
368 Early Work in Machine Learning Chap. 17
17.2 PERCEPTRONs
Perceptions are pattern recognition or classification devices that are crude approxima-
tions otneural networks. They make decisions about patterns by summing up evidence
obtained from many small sources. They can he taught to recognize one or more
classes of objects throu g h the use of stimuli in the form of labeled training examples.
A simplified perceptron s y stem is illustrated in Figure 17. I. The inputs to
the s y stem are through an array of sensors such as a rectangular grid of light sensitive
pixels. These sensors are randomly connected in groups to associative threshold
units ATU where the sensor outputs are combined and added together. If the
combined outputs to an ATIJ exceed some fixed threshold, the ATU unit executes
and produces a binary output.
The outputs from the ATU are each multiplied by adjustable parameters or
weights se (i = I, 2.... . k) and the results added together in a terminal comparator
unit. If the input to the comparator exceeds a riven threshold level 1'. the perceptron
produces a positive response of I (yes) corresponding to a sample classification of
class . I. Otherwise, the output is 0 (no) corresponding to an object classification of
non-class-I
All components of the system are fixed except the weights is, which are adjusted
throu g h a punishment-reward process described below! This learning process contin-
Associative
Sensor array
.......
....
• • . .----. .s__•_.
Ys
No
0
o 00/
0/4.
1+ +
0 0/4. + -
00 /+ +
0 00/ +
+ + +
0/
ues until optimal values of w are found at which time the system will have learned
the proper classification of objects for the two different classes.
The light sensors produce an output voltage that is proportional to the light
intensity striking them. This output is a measure of what is known in pattern recognition
as the object representation space parameters. The outputs from the ATUs which
combine several of the sensor outputs (x e ) are known as feature value measurements.
These feature values are each multiplied by the weights w and the results summed
in the comparator to give the vector product r = IN * x w,. When enough
of the feature values are present and the weight vector is near optimal, the threshold
will be exceeded and a positive classification will result.
Finding the optimal weight vector value w is equivalent to finding a separating
hyperplane in k-dimensional space. If there is some linear function of the x, for
which objects in class-I produce an output greater than T and non-class-I objects
produce an output less than T, the space of objects is said to be linearly separable
(see Chapter 13 for example). Spaces which are linearl y separable can be partitioned
into two or more disjointed regions which divide objects based on their feature
vector values as illustrated in Figure 17.2. It has been shown (Minsky and Papert,
1969) that an Optimal w can always be found with a finite nuiither of training
examples if the space is linearly separable.
One of the simplest algorithms for finding an optimum w (w*) is based on the
following perceptron learning algorithm. Given training objects from two distinct
classes, class-I and class-2
= w,,, + d * x
where
25=
370 Early Work in Machine Learning Chap. 17
It should be recognized that the above learning algorithm is just a method for
finding a linear two-class decision function which separates the feature space into
two regions. For a generalized perceptron, we could just as well have found multiclass
decision functions which separate the feature space into c regions'. This could be
done by terminating each of the ATU outputs at c > 2 comparator units. In this
case, class j would be selected as the object type whenever the response at the jth
comparator was greater than the response at all otherj - I comparators.
Perceptrons were studied intensely at first but later were found to have severe
limitations. Therefore, active research in this area faded during the late 1960s.
However. the findings related to this work later proved to be most valuable, especially
in the area of pattern recognition. More recently. there has been renewed interest
in similar architectures which attempt to model neural networks (Chapter 15). This
is partly due to a better understanding of the brain and significant advances realized
in network dynamics as well as in hardware over the past ten years. These advances
have made it possible to more closely model large networks of neurons.
During the 1950s and 1960s Samuel (1959, 1967) developed a program which
could learn to play checkers at a master's level. This system remembered thousands
of board states and their estimated values. They provided the means to determine
the best move to make at any point in the game.
Samuel's system learns while playing the game of checkers, either with a
human opponent or with a copy of itself. At each state of the game, the program
checks to see if it has remembered a best-move value for that state. If not, the
program explores ahead three moves (it determines all of. its possible moves: for
each of these, it finds all of its opponent's moves: and for each of those, it determines
all of its next possible moves). The program then computes an advantage or win-
value estimate of all the ending board states. These values determine the best move
for the system from the current state. The current board state and its corresponding
value are stored using an indexed address scheme for subsequent recall.
The best move for each state is the move value of the largest of the minimums.
based on the theory of two-person zero-sum games. This move will always be the
best choice (for the next three moves) against an intelligent adversary.
As an example of the look-ahead process, a simple two move sequence is
illustrated in Figure 17.3 in the form of a tree. At board state K. the program
looks ahead two moves and computes the value of each possible resultant board
state. It then works backward by first finding the minimum board values at state K
+ 2 in each group of moves made from state K + I (minimums = 4. 3. and 2).
These minimums correspond to the moves the opponent would make from
each position when at state K + I. The program then chooses the maximum of
these minimums as the best (minimax) move it can make from the present board
Sec. 17.3 Checkers Playing Example 371
State
K K*1 K+2
10
7 nin•4
3 mm
9
nn 2
2
Figure 17.3 A two-move IooIheamJ
2 sequence.
state K (maximum = 4). By looking ahead three moves, the system can be assured
it can do no worse than this minimax value. The board state and the corresponding
minimax value for a three-move-ahead sequence are stored in Samuel's system.
These values are then available for subsequent use when the same state is encountered
during a new game.
The lock ahead search process could be extended beyond three moves; however,
the combinatorial explosion that results makes this infeasible. But, when many
board states have been learned, it is likely that any given state will already have
look-ahead values for three moves, and some of those moves will in turn have
lock-ahead values stored. Consequently, as more and more values are stored, look-
ahead values for six, nine, or even more moves may be prerecorded for rapid use.
Thus, when the system has played many games and recorded thousands of moves,
its ability to look ahead many moves and to show improved performance is greatly
increased.
The value of a board state is estimated by computing a linear function similar
to the perceptron linear decision function. In this case, however, Samuel selected
some 16 board features from a larger set of feature parameters. The features ere
typically checkers concepts such as piece advantage . , the number of kings and single
piece units, and the location of pieces. In the original system, the weighting parameters
were fixed. In subsunt experments, however, the parameters were adjusted as
part of the learning process much like the wcightiilg parameters were in the percep-
trOns.
372 Early Work in Machine Learning Chap. 17
The theory of learning automata was first introduced in 1961 (Tsetlin. 1961). Since
that time these systems have been studied intensely, both analvtidally and through
simulations (Lakshmivarahan. 1981). Learning automata systems are finite state
adaptive systems which interact iteratively with a general environment. Through a
probabilistic trial-and-error response process they learn to choose or adapt to a
behavior which produces the best response. They are, essentially, a form of weak.
inductive learners.
In Figure 17.4. we see that the learning model for learning automata has
been simplified to just two components, an automaton (learner) and an environment.
The learning cycle begins with an input to the learning automata system from the
environment. This input elicits one of a finite number of possible responses from
the automaton. Ihe environment receives and evaluates the response and then provides
some form of feedback to the automaton in return. This feedback is used by the
automaton to alter its stimulus-response mapping structure to improve its behavior
in a more favorable way.
As a simple example, suppose a learning automata is being used to learn the
best temperature control setting for your office each morning. It may select any
one often temperature range settings at the beginning of each day (Figure 17.5).
Without any prior knowledge of your temperature preferences, the automaton ran-
domly selects a first setting using the probability vector corresponding to the tempera-
ture settings.
Since the probability values are uniformly distributed, any one of the settings
will be selected with equal likelihood. After the selected temperature has stabilized.
the environment may respond with a simple good-baçl feedback response. If the
response is good, the automata will modify its probability vector by rewarding the
probability corresponding to the good setting with a positive increment and reducing
all other probabilities proportionately to maintain the sum equal to I. If the response
is bad, the automaton will penalize the selected setting by reducing the probability
corresponding to the bad setting and increasing all other values proportionately.
This process is repeated each day until the good selections have high probability
values and all bad choices have values near zero. Thereafter, the system will always
choose the good settings. If, at some point, in the future your temperature preferences
change, the automaton can easily readapt.
Learning automata have been generalized and studied in various ways. One
wons. $OflWIui
w Eniironmont
OiOfl
Figure 17.4 Learning automaton model.
Sec. 17.4 Learning Automata 373
50 55 60 65
Control neleeoon
70 75 80 85 90
Jr 95 100
I- I I I I I
Temperature range settings
such generalization has been given the special name of collective learning automata
(CLA). CLAs are standard learning automata systems except that feedback is not
provided to the automaton after each response. In this case, several collective stimulus-
response .actions occur before feedback is passed to the automaton. It has been
argued (Bock. 1976) that this type of learning more closely resembles that of human
beings in that we usually perform a number or group of primitive actions before
receiving feedback on the performance of such actions, such as solving a complete
problem on a test or parking a car. We illustrate the operation of CLAs with an
example of learning to play the game of Nim in an optimal way.
Nirn is a two-person zero-sum game in which the players alternate in removing
tokens from an array which initially has nine tokens. The tokens are arranged into
three rows with one token in the first row, three in the second row, and five in the
third row (Figure 17.6).
The first player must remove at least one token but not more than all the
tokens in any single row. Tokens can only be removed from a single row during
each player's move The second player responds by removing one or more tokens
remaining in any row. Players alternate in this way until all tokens have been
removed; the loser is the player forced to remove the last token.
We will use the triple (n 1 .n.n 3 ) to represent the stares of the game at a given
time where a, n,, and n 3 are the numbers of tokens in rows I. 2. and 3, respectively.
We will also use a matrix to determine the moves made by the CLA for any given
state. The matrix of Figure 17.7 has heading columns which correspond to the
state of the game when it is the CLA's turn to move, and row headings which
correspond to the new game state after the CLA has completed a move. Fractional
entries in the matrix are transition probabilities used by the CLA to execute each
of its moves. Asterisks in the matrix represent invalid moves.
Beginning with the initial stale (1.3.5). suppose the CLA's opponent removes
two tokens from the third row resulting in the new state (1.3.3). If the CLA then
0 0 0
removes all three tokens from the second row, the resultant state is (1.0.3). Suppose
the opponent now removes all remaining tokens from the third row. This leaves
the CIA with a losing configuration of (1.0.0).
At the start of the learning sequence, the matrix is initialied such that the
elements in each column are equal (uniform) probability values. For example. since
there are eight valid moves from the state (1.3,4) each column element under this
state corresponding to a valid move has been given an initial value of A . In a
similar manner all other columns have been given uniform probabilit y values con-c-
sponding to all valid moves for the given column state.
The CLA selects moves probabilistically using the probability values in each
column. So, for example. if the CLA had the first move, any row intersecting
with the first column not containing an asterisk would be chosen with probability
4. This choice then determines the new game state from which the opponent must
select a move. The opponent mi g ht have a similar matrix to record game states
and choose moves. A complete game is played before the CLA is given any feedback.
at which time it is informed whether or not its responses were good or bad. This
is the collective feature of the CLA.
If the CLA wins a game, all moves made by the CLA during that game are
rewarded by increasing the probability value in each column corresponding to the
winning move. All nonwinning probabilities in those columns are reduced equally
to keep the sum in each column equal to I. If the CLA loses a game, the moves
leading to that loss ar penalized by reducing the probabilit y values corresponding
to each losing move. All other probabilities in the columns having a losing move
are increased equally to keep the column totals equal to I.
After a number of games have been played by the CLA. the matrix elements
Current state
NEENNEEN
mmmmmmmm
which correspond to repeated wins will increase toward one, while all other elements
in the column will decrease toward zero. Consequently, the CLA will choose the
winning moves more frequently and thereby inprove its performance.
Simulated games between a CLA and various types p 1 opponents have been
performed and the results plotted Bock. 1985). It was shown, for example. that
two CLAs playing against each other required about 300 games before each learned
to play optimally. Note, however, that convergence to optimality can be accomplished
with fewer games if the opponent always plays optimally (or poorly). since, in
such a case, the CLA will repeatedly lose (win).and quickly reduce (increase) the
losing (winning) no elements to zero one. It is also possible to speed up the
learning process through the use of other techniques such as learned heuristics.
Learning systems based on the learning automaton or CLA paradigm are fairly
general for applications in which a suitable state representation scheme can be found.
They are also quite robust learners. In fact, it has been shown that an LA will
converge to an optimal distribution under fairly general conditions lithe feedback
is accurate with probability greater than 0.5 (Narendra and Thathuchar. 1974). Of
course, the rate of convergence is strongly dependent on the reliability of the feedback.
Learning automata are not very efficient learners as was noted in the game
playing example above. They are, however, relatively easy to implement. provided
the number of states is not too large. When the number of states becomes large.
the amount of storage and the computation required to update the transition matrix
becomes excessive.
Potential applications for learning automata include adaptive telephone routing
and control. Such applications have been studied using sirnulationprograms (Narendra
et al.. 1977). Although they have been given favorable recommendations. ie if
any actual systems have been implemented. however.
Genetic algorithm learning methods are bused on models of natural adaptation and
evolution. These learning systems improve their performance through processes hich
model population genetics and survival of the fittest. They have been studied since
the early 1960s (Holland. 1962. 1975L
In the field of genetics. a population is subjected to an environment hich
places demands on the members. The members which adapt well are selected for
mating and reproduction. The offspring of these better performers inherit genetic
traits from both their parents. Members of this second g eneration of offspring hich
also adapt well are then selected for mating and repduction and the evolutionary
cycle continues. Poor performers die off without leaving offspring. Good performers
produce good offspring and they. in turn, perform well. After some number of
generations. the resultant population will have adapted optimally or at least very
well to the environment.
Genetic algorithm systems stall with a fixed size population of data structures
376 Early Work in Machine Learning Chap. 17
which are used to perform some given tasks. After requiring the structures to execute
the specified tasks some number of times, the structures are rated on their performance,
and a new generation of data structures is then created. The new generation is
created by mating the higher performing structures to produce bffspring. These
offspring and their parents are then retained for the next generation while the poorer
performing structures are discarded. The basic cycle is illustrated in Figure 17.8.
Mutations are also performed on the best performing structures to insure that
the full space of possible structures is reachable. This process is repeated for a
number of generations until the resultant population consists of only the highest
performing structures.
Data structures which make up the population can represent rules or any other
suitable types of knowledge structure. To illustrate the genetic aspects of the problem,
assume for simplicity that the population of structures are fixed-length binary strings
such as the eight bit string 11010001. An initial population of these eight-bit strings
would be generated randomly or with the use of heuristics at time zero. These
strings, which might be simple condition and action rules, would then be assigned
some tasks to perform (like predicting the weather based on certain physical and
geographic conditions or diagnosing a fault in a piece of equipment).
After multiple attempts at executing the tasks, each of the participating structures
would be rated and tagged with a utility value u commensurate with its performance.
The next population would then be generated using the higher performing structures
as parents and the process would be repeated with the newly produced generation.
After.many generations the remaining population structures should perform the desired
tasks well.
Mating between two strings is accomplished with the crossover operation which
randomly selects a bit position in the eight-bit string and concatenates the head of
one parent to the tail of the second parent to produce the offspring. Suppose the
two parents are designated as xxxxxxxx and yyyyyyyy respectively, and suppose
the third bit position has been selected as the crossover point (at the position of
the colon in the structure xxx:xxxxx). After the crossover operation is applied, two
offspring are then generated namely xxxyyyyy and yyyxxxxx. Such offspring and
their parents are then used to make up the next generation of structures.
A second genetic operation often used is called inversion. Inversion is a transfor-
mation applied to a single string. A bit position is selected at random, and when
applied to a structure, the inversion operation concatenates the tail of the string to the
head of the same string. Thus, if the sixth position were selected (x,x2x3x4xx6:x7x).
the inverted string would be x7x8x,x2x3x4x5x6.
A third operator, mutation, is used to insure that all locations of the rule
space are reachable, that every potential rule in the rule space is available for evalua-
tion. This insures that the selection process does not get caught in a local minimum.
For example, it may happen that use of the crossover and inversion operators will
only produce a set of structures that are better than all local neighbors but not
optimal in a global sense. This can happen since crossover and inversion may not
be able to produce some undiscovered structures. The mutation operator can overcome
this by simply selecting any bit position in a string at random and changing it.
This operator is typically used only infrequently to prevent random wandering in
the search space.
The genetic paradigm is best understood through an example. To illustrate
similarities between the learning automaton paradigm and the genetic paradigm we
use the same learning task of the previous s.ction, namely learning to play the
game of nim optimally. We use a slightly different representation scheme here
since we want a population of structures that are easily transformed. To do this.
we let each member of the population consist of a pair of triplets augmented with
a utility value a, ((n 1 ,n2 .n 3 ) (m,.m:.m3)u), where the first pair is the game state
presented to the genetic algorithm system prior to its move, and the second triple
is the state after the move. The a values represent the worth or current utility of
the structure at any given time.
Before the game begins, the genetic system randomly generates an initial
population of K triple-pair members. The population size K is one of the important
parameters that must be selected. Here, we simply assume it is about 25 or 30,
which should be more that, the number of moves needed for any optimal play. All
members are assigned an initial utility value of 0. The learning process then proceeds
as follows.
and stores the triple pair, the input, and newly generated triple as an addition
to the population.
3. The above two steps are repeated until the game is terminated, in which case
the genetic system is informed whether a win or loss occurred. If the system
wins, each of the participating member moves has its utility value increased.
If the system lose ! , each participating member has its utility value decreased.
4. The above steps are repeated until a fixed number of games have been played.
At this time a new generation is created.
The new generation is created from the old population by first selecting a
fraction (say one half) of the members having the highest utility values. From these,
offspring are obtained by application of appropriate genetic operators.
The three operators. crossover, inversion, and mutation, randomly modify
the parent moves to give new offspring move sequences. (The best choice of genetic
operators to apply in this example is left as an exercise). Each offspring inherits a
utility value from one of the parents. Population members having low utility values
are discarded to keep the population size fixed.
This whole process is repeated until the genetic s y stem has learned all the
optimal moves. This can be determined when the parent population ceases to change
or when the genetic system repeatedly wins.
The similarity between the learning automaton and genetic paradigms should
be apparent from this example. Both rely on random move sequences, and the
better moves are rewarded while the poorer ones are penalized.
For additional details related to the overall pcoecss Of el iciting. coding. organizing, and refining
knowledge from domain experts, see Chapter 13 which examines expert sy"emarchitectures and building
tools.
Chap. 17 ExercIses 379
IntPIIgsnt I
F,H
Oemaln Eaprt sy,t
dsdftor
motor
base Figure 17.9 Acquisition using an
intelligent editor.
an expert can create, modify, and delete rules without a knowledge of the internal
structure of the rules.
The editor assists the expert in building and refining a knowledge base by
recalling rules related to some specific topic, and reviewing and modifying the
rules, if necessary, to better fit the expert's meaning and intent. Through the editor,
the expert can query the expert system for conclusions when given certain facts. If
the expert is unhappy with the results, a trace can be obtained of the steps followed
in the inference process. When faulty or deficit knowledge is found, the problem
can then be corrected.
Some editors have the ability to suggest reasonable alternatives and to prompt
the expert for clarifications when required. Some editors also have the ability to
make validity checks on newly entered knowledge and detect when inconsistencies
occur. More recently, a few commercial editors have incorporated features which
permit rules to be induced directly from examples of problem solutions. These
editors have greatly simplified the acquisition process, but they still require much
effort on the part of domain experts.
17.7 SUMMARY
EXERCISES
17.1. Given a simple perceptron with a 3-x .3 input sensor array, compute six learning
cycles to show how the weights w change doting the learning process. Assign random
weights to she initial w, values.
380 Early Work in Machine Learning Chap. 17
17.2. For the game of checkers with an assumed average number of 50 possible moves per
board position, determine the difference in the total number of moves for a four
move look-ahead as compared to a three move look-ahead system.
17.3. Design a learning automaton that selects TV channels based on-day of week and
time of day (three evening hours only) for some family you are familiar with.
17,4. Write a computer program to simulate the learning automaton of the previous problem.
Determine the number of training examples required for the system to converge to
the optimal values.
17.5. Describe how a learning automaton could be developed to learn how to play the
game of tic-tac-toe optimally. Is this a CIA or a Simple learning automaton system?
17.6. Describe the similarities and differences between learning automata and genetic algo-
rithms. Which learner would be best at finding optimal solutions to nonlinear functions?
Give reasons to support your answer.
17.7. Explain the difference of the genetic operators inversion, crossover, and mutation.
Which operator do you think is most effective in finding the optimal population in
the least time?
17.8. Explain why some editors can be distinguished as "intelligent"
17.9. Read an article on TEIRESIUS and make a list of all the intelligent functions it
performs that differ from so called nonintelligent editors.
A f
10
Learning by Induction
18.1 INTRODUCTION
Consider playing the following game. I will choose some concept which will remain
fixed throughout the game. You will then be given clues to the concept in the
form of simpledescriptions. After each clue, you must attempt to guess the concept
I have chosen. I will continue with the clues until you are sure you have made the
right choice.
For your first guess did you choose the concept beautiful? if so you are wrong.
381
382 Learning by Induction Chap. 18
The above illustrates the process we use in inductive learning, namely, inductive
inference, an invalid, but useful form of inference.
In this chapter we continue the study of machine learning, but in a more
focused manner. Here, we study a single learning method, learning by inductive
inference.
Inductive learning is the process of acquiring generalized knowledge from
examples or instances of some class. This form of learning is accomplished through
inductive inference, the process of reasoning from a part to a whole, front
instances to generalizations, or from the individual to the universal. It is a powerful
form of learning which we humans do almost effortlessly. Even though it is not a
valid form of inference, it appears to work well much of the lime. Because of its
importance, we have devoted two complete chapters to the subject.
When we conclude that October weather is always pleasant in El Paso after having
observed the weather there for a few seasons, or when we claim that all swans are
white after seeing only a small number of white swans, or when we conclude that
all Sots are tough negotiators after conducting business with only a few, we are
learning by induction. Our conclusions may not always be valid, however. For
example, there are also black Australian swans and some weather records show
that October weather in El Paso was inclement. Even so, these conclusions or
rules are useful They are correct much or most of the time, and they allow us to
adjust our behavior and formulate important decisions with little cognitive effort.
One can only marvel at our ability to formulate a general rule for a whole
class of objects, finite or not, after having observed only a few examples. How is
it we are able to make this large inductive leap and arrive at an accurate conclusion
so easily? For example, how is it that a first time traveler to France will conclude
that all French people speak French after having spoken only to one Frenchman
named Henri'? At the same time, our traveler would not incorrectly conclude that
all Frenchmen are named Henri!
Examples like this emphasize the fact that inductive learning is much more
than undirected search for general hypotheses. Indeed, it should be clear that inductive
generalization and rule formulation are performed in some context and with a purpose.
They are performed to satisfy some objectives and, therefore, are guided by related
background or other world knowledge. If this were not so, our generalizations would
be on shaky ground and our class descriptions might be no more than a complete
listing of all the examples we had observed.
The inductive process can be described symbolically through the use of predi-
cates P and Q. If we observe the repeated occurrence of events Pta 1 ), P(a).
we generalize by inductively concluding that Yx P(x), i.e. if (canary_
I color yellow), (canary - 2 color yellow).....(canary.,,k color yellow) then (forail
x (if canary x)(x color yellow)). More generally, when we observe the implications
See. 183 Some Definitions
P(a l ) — Q(b1)
P(a2) -9 Q(b)
P(a) - Q(bk)
In this section we introduce some important terms and concepts related to inductive
learning. Many of the terms we define will have significance in other areas as
well.
Our model of learning is the model described in Section 16.4. As you may
recall, our learner component is presented with training examples from a teacher
or from the environment. The examples may be positive instances only or both
positive and negative. They may be selected, well-organized examples or they may
be presented in a haphazard manner and contain much irrelevant information Finally.
the examples may be correctly labeled as positive or negative instances of some
concept, they may be unlabeled, or they may even contain erroneous labels. Whatever
the scenario, we must be careful to describe it appropriately.
Given (I) the observations. (2) certain background domain knowledge, and
(3) goals or other preference criteria, the task of the learner is to find an inductive
assertion or target concept that implies all of the observed examples and is consistent
384 Learning by Induction Chap. 18
with both the background knowledge and goals. We formalize the above ideas
with the following definitions (Hunt et al.. 1966, and Rendell. 1985).
Definitions
Target concept. The target is one concept which correctly classifies all
objects in the universe.
Positive instances. These are example objects which belong to the target
concept.
Consistent classification rule. This is a rule that is true for all positive
instances and false for all negative instances.
o, 1/02 V0 3 Va
0 V02
O Vo
Figure 18.1 Lattice of object classes. Each node represents a disjunction of objects
from a universe of four objects;
In this section we consider some techniques which are essential for the application
of inductive learning algorithms. Concept learning requires lhat a guess or estimate
of a larger class, the target concept, be made after having observed only some
fraction of the objects belonging to that class. This is essentially a process of generaliza-
tion, of formulating a description or a rule for a larger class but one which is still
26-
386 Learning by Induction Chap. 18
consistent with the observed positive examples. For example, given the three positive
instances of objects
a proper generalization which implies the three instances is blue cube. Each of the
instances satisfies the general description.
Specialization is the opposite of generalization. To specialize the concept blue
cube, a more restrictive class of blue cubes is required such as small blue cube or
flexible blue cube or any of the original instances given above. Specialization may
be required if the learning algorithm over-generalizes in its search for the target
concept. An over-generalized hypothesis is inconsistent since it will include some
negative instances in addition to the positive ones.
There are many ways toiorm generalizations. We shall describe the most
commonly used rules below. They will be sufficient to describe all of the learning
paradigms which follow. In describing the rules, we distinguish between two basic
types of generalization, comparable to the corresponding types of induction (Section
18.2), selective generalization and constructive generalization (Michalski, 1983).
Selective generalization rules build descriptions using only the descriptors (attributes
and relations) that appear in the instances, whereas constructive generalization rules
do not. These concepts are described further below.
Generalization Rules
Since specialization rules are essentially the opposite of rules for generalization,
to specialize a description, one could change variables to constants, add a conjunct
or remove a disjunct from a description, and so forth. Of course, there are other
means of specialization such as taking exceptions in descriptions (a fish is anything
that swims in water but does not breathe air as do dolphins). Such methods will be
introduced as needed.
These methods are useful tools for constructing knowledge structures. They
give us methods with which to formulate and express inductive hypotheses. Unfortu-
nately, they do not give us much guidance on how to select hypotheses efficiently.
For this, we need methods which more directly limit the number of hypotheses
which must be considered.
Selective Generalization
Changing Constants to Variables.
An example of this rule was given in Section 18.1. Given instances of a
description or predicate P(a l ). P(a2).....P(a) the constants a1 are changed to
a variable which may be any value in the given domain, that is. Yx P(x).
Sec. 18.4 Generalization and Specialization 317
Droppisg Condition.
Dropping one or more conditions in a description has the effect of expanding
or increasing the size of the set. For example, the set of all small red spheres is
less general than the set of small spheres. Another way of stating this rule when
the conjunctive description is given is that a generalization results when one or
more of the conjuncts is dropped.
Adding an Alternative.
This is similar to the dropping condition rule. Adding a disjunctive term general-
izes the resulting description by adding an alternative to the possible objects. For
example, transforming red sphere to (red sphere) V (green pyramid) expands the
class of red spheres to the class of red spheres or green pyramids. Note that the
internal disjunction could also be used to generalize. An internal disjunction is one
which appears inside the parentheses such as (red V green sphere).
Anything
sing things
Ma
Elephant Whale
.where ii, < 4, have been observed, the generalization D = . 4,1 can be
made that is, D can be any value in the interval d, to d,.
Constructive Generalization
Generating Chain Properties.
If an order exists among a set of objects, they may be described by their
ordinal position such as first, second, ....n th . For example, suppose the relations
for a four story building are given as
above(f2 ,f1 ) & above(f3 j) & above(f4J1)
then a constructive generalization is
mosLabove(f,) & leasLabove(f1).
The most above, least above relations are created. They did not occur in the original
descriptors.
Other forms of less frequently used generalization techniques are also available
including combinations of the above. We will introduce such methods as we need
them.
descriptions based on a semantic net can be very large. For example, a representation
using only two levels of gray (light and dark) in a 1024 by 1024 pixel array will
have a hypothesis space in excess of 210. Compare this with the semantic net
space which uses no more than 10 to 20 objects and a limited number of position
relationships. Such a space would have no more than 10 4 or 105 object-position
relationships. We see then that the difference in the size of the search space for
these two representations can be immense.
Another simple example which limits the number of hypotheses is illustrated
in Figure 18.4. The tree representation on the left Contains more information and,
therefore, will permit a larger number of object descriptions to be created than
with the tree on the right. On the other hand, if one is only interested in learning
general descriptions of geometrical objects without regard to details of size, the
tree on the right will be a superior choice since the smaller tree will result in less
search.
Methods based on the second general type of bias limit the search through
preferential hypotheses selection. One way this can be achieved is through the use
of heuristic evaluation functions. If it is known that a target concept should not
contain some object or class of objects, all hypotheses which contain these objects
can be eliminated from consideration. Referring again to Figure 18. 1, if it is known
that object 03 (or the description of 03) should not be included in the target set, all
nodes above 03 and connected to 03 can be eliminated from the search. In this
case, a heuristic which gives preferential treatment would not choose descriptions
which contain 03.
Another simple example which relates to learning an optimal play of the game
of Nim might use a form of preference which introduces a heuristic to block consider-
ation of most moves which permit an opponent to leave only one token. This eliminates
a large fraction of the hypotheses which must be evaluated (see Chapter Ii).
Bias can be strong or weak, correct or incorrect. A strong bias is one which
focuses on a relatively small number of hypotheses. A weak bias does not. A
correct bias is one which allows the learner to consider the target concept, whereas
an incorrect bias does not. Obviously, a learner's task is simplified when the bias
is both strong and correct (Utgoff, 1986). Bias can also be implemented in a program
as either static or dynamic. When dynamic bias is employed, it is shifted automatically
by the program to improve the learner's performance. We will see different forms
of bias used in subsequent sections.
Many learning programs have been implemented which construct descriptions com-
posed of conjunctive features only. Few have been implemented to learn disjunctive
descriptions as well. This is because conjunctive learning algorithms are easier to
implement. Of course, a simple implementation for a disjunctive concept learner
would be one which simply forms the disjunction of all positive training instances
as the target concept. Obviously, this would produce an awkward description if
there were many positive instances.
There are many concepts which simply cannot be described well in conjunctive
terms only. One of the best examples is the concept of uncle since an uncle can be
either the brother of the father or the brother of the mother of a child. To state it
any other way is cumbersome.
The system we describe below was first implemented at M.I.T (Iba. 1979).
It is a more traditional Al type of learner than the systems of the previous chapter
in that it builds symbolic English-like descriptions and the learning process is more
algorithmic in form. This system learns descriptions which are essentially in disjunc.
tive normal form. Consequently, a broad range of descriptions is possible. Further-
more, the system can learn either concept descriptions from attribute values or
structural descriptions of objects.
The training set we use here consists of a sequence of labeled positive and
negative instances of the target concept. Each instance is presented to the learner
as an unordered list of attributes together with a label which specifies whether or
not the instance is positive or negative.
For this first example, we require our learner to learn the disjunctive concept
"something that is either a tall flower or a yellow object." One such instance of
Sic. 1U Example of an Inductive Learn., 351
cluster:de.cnption:
examples:
negative—par:
examples:
this concept is represented as (short skinny yellow flower +); whereas a negative
instance is (brown fat tall weed —). Given a number of positive and negative training
instances such as these, the learner builds frame-like structures with groups of slots
we will call clusters as depicted in Figure 18.5.
The target concept is given in the concept name. The actual description is
then built up as a group of slots labeled as clusters. All training examples, both
positive and negative, are retained in the example slots for use or reuse in building
up the descriptions. An example will illustrate the basic algorithm.
Each cluster in the frame of Figure 18.5 is treated as a disjunctive term, and descriptions
within each cluster are treated as conjuncts. A complete learning cycle will clarify
the way in which the clusters and frames (concepts) are created. We will use the
following training examples from a garden world to teach our learner the concept
"tall flower or yellow object."
After accepting the first training instance, the learner creates the tentative
concept hypothesis a tall fat brown flower." This is accomplished by creating a
cluster in the positive part of the frame as follows:
With only a single example, the learner has concluded the tentative concept
must be the same as the instance. However, after the second training instance, a
new hypothesis is created by merging the two initial positive instances. Two instances
are merged by taking the set intersection of the two. This results in a more general
description, but one which is consistent with both positive examples. It produces
the following structure.
The next training example is also a positive one. Therefore, the set intersection
of this example and the current description is formed when the learner is presented
with this example. The resultant intersection and new hypothesis is an over generaliza-
tion, namely, the null set, which stands for anything.
The fourth instance is a negative one. This instance is inconsistent with the
current hypothesis which includes anything. Consequently, the learner must revise
its hypothesis to exclude this last instance.
Sec. 18.6 Example of an Inductive Learner 393
It does this by splitting the first cluster into two new clusters which are then
both compatible with the negative instance. Each new cluster corresponds to a disjunc-
tive term in this description.
To build the new clusters, the learner uses the three remembered examples
from the first cluster. It merges the examples in such a way that each merge produces
new consistent clusters. After merging we get the following revised frame.
The reader should verify that this new description has now excluded the negative
instance.
The next training example is all that is required to arrive at the target concept. To
complete the description, the learner attempts to combine the new positive instance with
each cluster by merging as before, but only if the resultant merge is compatible with
all negative instances (one. in this case). If the new instance cannot be merged
with any existing cluster without creating an inconsistency, a new cluster is created.
Merging the new instance with the first cluster results in the same cluster.
Merging it with the second cluster produces a new, more general cluster description
of yellow. The final frame obtained is as follows.
cluater:description:(yellow)
:examples:(skinny short yellow weed)
(fat yellow flower tall)
negative—part:
:examples:(tall fat brown weed)
The completed concept now matches the target concept 'tall flower or yellow object."
394 Learning by Induction Chap. 18
The above example illustrates the basic cycle but omits some important factors.
First, the order in which the training instances are presented to the learner is important.
Different orders, in general, will result in different descriptions and may require
different numbers of training instances to arrive at the target concept.
Second. when splitting and rebuilding clusters after encountering a negative
example, it is possible to build clusters which are not concise or maximal in the
sense that some of the clusters could be merged without becoming inconsistent.
Therefore, after rebuilding new clusters it is necessary to check for this maximality
and merge clusters where possible without violating the inconsistency condition.
Another brief example will illustrate this point. Here we want to learn the concept
"something that is either yellow or spherical." For this, we use the following
training instances from a blocks world.
After the first three training examples have been given to the learner, the resultant
description is the empty Set.
Since the next two examples, are also positive, the only change to the above
frame is the addition of the fourth and fifth training instances to the cluster examples.
However, the sixth training instance is negative. This forces a split due to the
inconsistency. In rebuilding the clusters this time, we rebuild starting with the last
positive (fifth) example and work backwards as though the examples were put on a
Sec. 18.6 Example of an Inductive Learner 395
stack. This is actually the order used in the original system. After the fifth and
fourth examples are processed, the following frame is produced.
negative-part:
examples:
Note that we still have not arrived at the target concept. The last training instance,
a negative one,
is needed to do the trick. The first cluster is inconsistent with this instance. Therefore,
it must be split. Alter completing this split we get the new frame
396 Learning by Induction Chap. 18
All of the new clusters are now consistent with the negative examples. But
the clusters are not maximal, since it is possible to merge some clusters without
violating the inconsistency condition. To obtain maximality, the clusters must be
rewritten and merged where possible. This rewrite is accomplished by copying the
first cluster and then successively merging or copying the other clusters in combination.
Of course, a merge can be completed only when an inconsistency does not result.
The first two clusters cannot be merged as we know from the above. The
first and third clusters can be merged to give yellow. The second and fourth clusters
can also be merged to produce sphere. These are the only merges that can be
made that are compatible with both negative examples. The final frame then is
given as
It may have been noticed already by the astute reader that there is no reason
why negative-part clusters could not be created as well. Allowing this more symmetric
structure permits the creation of a broader range of concepts such as 'neither yellow
nor spherical" as well as the positive type of concepts created above. This is imple-
mented by building clusters in the negative part of the frame using the negative
examples in the same way as the positive examples. In building both descriptions
concurrently, care must be taken to maintain consistency between the positive and
negative parts. Each time a negative example is presented, it is added to the negative
part of the model, and a check is made against each cluster in the positive part of
the model for inconsistencies. Any of the clusters which are inconsistent are split
into clusters which are maximal and consistent and which contain all the original
examples among them. We leave the details as an exercise.
Network Representations
((nodes (a b C)
(links (ako a brick)
(ako h brick)
(ako c brick)
(supports a C)
(supports b c) +)
Since an arch can support materials other than a brick, another positive example
of the concept arch might he identical to the one above except for the object supported.
say a wedge. Thus, substituting (ako c wedge) for (ako c brick) above we get a
second positive instance of arch. These two examples can now he generalized into
a single cluster by simply dropping the differing coniunctive ako terms to get the
following.
((nodes (a b c)
links (ako a brick)
(ako I, brick)
(supports a c)
(supports b c))
398 Learning by Induction Chap. 18
((nodes (a b C)
links (ako a brick)
(ako b brick)
(ako c sphere)
(supports a C)
(supports b C) —)
This example satisfies the current description of an arch. However, it has caused
an inconsistency. Therefore, the cluster must be split into a disjunctive description
as was done before in the previous examples. The process is essentially the same
except for the representation scheme.
In a similar manner the concept of uncle can be learned with instances preserted
and corresponding clusters created using a network representation as follows.
((nodes (a b c).
links (ako a person)
(ako b person)
(ako c person)
(male c)
(parent—of b a)
(brother_ofcb) +)
18.7 SUMMARY
EXERCISES
18.1. Define inductive learning and explain why we still use it even though it is not a
"valid" form of learning.
18.2. What is the difference between a class, and a concept?
18.3. What is the difference between selective. constructive, and expedient induction? Give
examples of each.
18.4. What is the purpose of inductive.bmas?
18.5. Give three examples in which inductive bias can be applied to constrain search..
18.6. Use the following training examples to simulate learning the concept "green flower.
or skinny object" Build up the corwept description in clusters using We same method
as that described in Section 18.6.
(green tall fat flower +)
(skinny green slxat flower +)
(tall skinny greent flower +).
(red skinny those weed +)
(green short fat weed -)
(tall green flower skinny +)
13.7. Work out an example of concept learning using network structures. The concept to
be learned is the concept wife. Create both positive and negative training examples.
18.8. Write a computer program in USP to build concept descriptions in the form of
clusters the examples of Section 18.7.
18.9. The method described in Section 18.7 for learning concepts depends on the c.rder in
which examples are presented. State what modifications would be required to make
the learner build the same structures independent of the order in which training
examples are presented.
400 Learning by Induction Chap. 18
18.10. Compare each of the generalization methods described in Section 18.4 and explain
when each method would be appropriate to use.
18.11. Referring to the previous problem, rank the generalization methods by estimated
computation time required to perform each.
18.12. Give an example of learning the negative of the concept"tall flower or red object.'
that is, something that is 'neither a tall flower nor a red object."
'li-I
19.1 INTRODUCTION
In this chapter we continue with our study of inductive learning. Here, we review
four other important learning systems based on the inductive paradigms.
PcrI'aps the most significant difference among these systems is the type of
knowledr,e representation scheme and the learning algorithms used. The first system
we describe, 103, constructs a discrimination tree for use in classifying objects.
The second system, LEX, creates and refines heuristic rules for carr y ing Out symbolic
integrations. The third system, INDUCE, constructs descriptions in an extended
form of predicate calculus. These descriptions are then used to classify objects
such as soybean diseases. Our final system, Winston's Arch, forms conjunctive
network structures similar to the ones described in the previous chapter.
103 was developed in the late 1970s (Quinlan, 1983) to learn object classifications
from labeled training examples. The basic algorithm is based on earlier research
programs known as Concept Learner Systems or CLSs (Hunt et al., 1966). This
401
27-
402 Examples of Other Inductive Learners Chap. 19
system is also similar in many respects to the expert system architecture described
in Section 15.3.
The CLS algorithms start with a set of training objects 0 = { 0 1 , 02.....
o,,} from a universe U. where each object is described by a set of m attribute
values. An attribute A1 having a small number of discrete values ajj , aj2.....
is selected and a tree node structure is formed to represent Ar The node has k,
branches emanating from it where each branch corresponds to one of the a1, values
(Figure 19.1). The set of training objects 0 are then partitioned into at most k1
subsets based on the object's attribute values. The same procedure is then repeated
recursively for each of these subsets using the other m - I attributes to form
lower level nodes and branches. The process stops when all of the training objects
have been subdivided into single class entities which become labeled leaf nodes of
the tree.
The resulting discrimination treeor decision tree can then be used to classify
new unknown objects given a description consisting of its m attribute values. The
unknown is classified by moving down the learned tree branch by branch in concert
with the values of the object's attributes until a leaf node is reached. The leaf
node is labeled with the unknown's name or other identity.
1D3 is an implementation of the basic CLS algorithm with some modifications.
In the 1D3 system, a relatively small number of training examples are randomly
selected from a large set of objects 0 through a window. Using these training
examples, a preliminary discrimination tree is constructed. The-tree is then tested
by scanning all the objects in 0 to see if there are any exceptions to the tree. A
new subset or window is formed using the original examples together with some
of the exceptions found during the scan. This process is repeated until no exceptions
are found. The resulting discrimination tree can then be used to classify new objects.
Another important difference introduced in lD3 is the way in which the attributes
are ordered for use in the classification process. Attributes which discriminate best
are selected for evaluation first. This requires computing an estimate of the expected
information gain using all available attributes and then selecting the attribute having
the largest expected gain. This attribute is assigned to the root node. The attribute
having the next largest gain is assigned to the next level of nodes in the tree and
so on until the leaves of the tree have been reached. An example will help to
illustrate this process.
For simplicity, we assume here a single-class classification problem, one where
all objects either belong to class C or U-C. Let h denote the fraction of objects
A1
Now define
H,(h) = —1; * log/z (I /i) * log( I I,)
with H(0) = 0. and
as the information content for class C and attribute a,, respecti'.els Fhcii the expected
value (mean) of H,k is just
E(HA) = *
=H-
Each G, is computed and ranked. The ones having the largest values determine
the order in which the corresponding attributes are selected in building the discrimina
tiOn tree.
In the above equation, quantities computed as
are known as the information theoretic entropy wherep, is the probability of occurrence
of some event i. The quantities H provide a measure of the dispersion or surprise
in the occurrence of a number of different events. The gains G 1 measure the information
to be gained in using a given attribute.
In using attributes which contain much information, one should expect that
the size of the decision tree will be minimized in some sense, for example. in total
number of nodes. Therefore, choosing those attributes which contain the largest
gains will, in general, result in a smaller attribute set. This amounts to choosing
those attributes which are more relevant in characterizing given classes of objects.
In conclud.ng this section, an example of a small decision tree for objects
described by..ur attributes is given in Figure 19.2. The attributes and their values
are horned = {yes, no}, color = {black, brown, white, grey}, weight = {heasy.
medium, light}, and height = {tall. short}. One training set for this example consists
404 Examples of Other Inductive Learners Chap. 19
Color
0.361
bk
G 0205/\
/\ /\
Figure 19.2 Discrimination tree for three attributes ordered by information g ain. Ahhreviutions
are br brown. bk black, w white. g gray. V = yes. n = no. h hcas, ni
medium. I light, t tall, and short.
of the eight instances given below where members of the class C have been labeled
with + and nonmembers with - (Class (' mi g ht- for example. he the class of
cows).
CA c + dft 1,l log f, log gk 1
kI brown 0 I 0 -_ - - 0
2 black 2 3 2/3 —0.585 —I.85 0.918
3 white I 2 1/2 —1.0 —1.0 1.0 0.361
4 grey 0 2 0 - - 0
k = I tall 3 6 1/2 -1.0 - 1.0 1.0
2 short 0 2 0 - . - I) 0.205
k I heavy 2 4 1/2 —1.0 —1.0 1.0
2 medium I 2 1/2 —1.0 —1.0 1.0 0.205
3 light 0 2 0 - - 0
1 I yes 3 5 3/5 —0.737 —1.322 0.971
2 no 0 3 0 - - 0 0.348
The other gain values for horned, weight, and height are computed in a similar
manner.
LEX was developed during the early 1980s .:Mitchell et al., 1983) to learn heuristic
rules for the solution of symbolic integration problems. The system is given about
40 integration operators which are expressed in the form of rewrite rules. Some of
the rules are shown in Figure 19.3a. Internal representations for some typical integral
expressions are given in Figure 19.3b:
Each of the operators has preconditions which must be satisfied before it can
be applied. For example, before 0P6 can be applied, the general form of the integrand
must be the product of two real functions, that is udv = f1(x) * f,(x)dx. Each
operator also has associated with it the resultant states that can be produced by
that operator. For example, 0P6 can have
406 Examples of Other Inductive Learners Chap. 19
the result obtained from the opposite bindings. The choice of the result obtained
with is bound tof and dv (of, or of an incorrect or poor operator at a given stage
in the solution will lead to failure or possibly to a lengthy solution. The learning
problem then is to create or refine heuristic rules which suggest when an operator
should be used.
All heuristics in LEX are of the form
ftw * trig(x) eli Apply OPfl with bindings it = fix). and dv = tri g (x) dx
Part of the refinement problem is the general) ation or specialt,.ation of the
heuristics to apply to as many consistent Instances as possible. Generalization and
specialization are achieved in LEX through the use of a hierarchical description
tree. A segment of this tree is depicted in Figure 194. Thus, when a rule applies
to more than a single trig function such as to both sin and cos, the more general
term trig would be substituted in the rule. Likewise, when a rule is found sshich
applies to both log and exp functions, the exp_log description would be used.
LEX is comprised of four major components as illustrated in Figure 19.5.
The Problem Generator selects and generates informative integration problems which
are submitted to the Problem Solver for solution. The Generator was included as
part of the system to provide well-ordered training examples and to make the system
a fully automatic learner. The Problem Solver attempts to find a solution to this
problem using available heuristics and operators. (A solution has been found when
an operator produces an expression not containing an integral.
fni.,g)
trig
/N /N trans
eitp_iog monom
POIYOni
monom
+
poly
tin cos tan In log
Figure 19.4 A segment of the LEX generalization tree grammar.
Output from the Problem Solver is some solution together with a complete
trace of the solution search. This is presented to the Critic unit for evaluation. The
Critic then analyzes the solution trace, comparing it to a least-cost path and passes
related positive or negative training instances to the Generalizer. A positive instance
is an operator which lies on the least-cost path, while a negative instance is one
lying off the path. Given these examples, the Generalizer modifies heuristics to
improve the selection of operators for best application during an attempted solution.
During the learning process each operator is given a version space of heuristic
rules associated with it. Rules in the version space give the conditions under which
the operator applies as well as the domain of states that can be produced. The
version space is actually stored as two bounding rules, a rule G which gives the
most general conditions of application and a rule S which gives the most specific
conditions. Between these two bounds are implicitly contained all plausible versions
of the heuristic. As the system learns, the bound S is made more general to include
all positive instances presented by the Critic while the bound G is made more
specific to exclude all negative instances. When the two bounds become equal
(G = 5), the correct heuristic has been learned.
generator Solver
As an example of heuristic refinement for the operator 0P6, suppose the Problem
Generator has submitted the following two integrals for solution.
f2 * sin(x)dx
f7 * cos(x) dx
Let the version space for the 0P6 heuristic be initialized to the G and S
bounds illustrated in Figure 19.6. The functions f 1 (x) and f2 (x) are any real-valued
functions of x, and S has been set to the first specific problem instance. (Operators
between G and S are implicitly contained in the version space.)
In solving the first integral above, the Problem Solver finds that operator
0P6, integration by parts, is applicable. For this operator, two different bindings
are possible
(a) a = 2x (b) u = sin(x)
dv = sin(x)dc dv=2xd.x
If the bindings given by a are used, 0P6 produces the new expression
* ( — cos(x)) - f2 * (—cos(x))dx
which can be further reduced using 0P2. ON and other simplification operators to
give the correcf solution
* cos(x) + 2 * sin(x) + C
For this binding, the Critic will label this as a positive instance.
On the other hand, if the variable bindings given in b are used. 0P6 produces
the more complex expression
ff,1t
/J\
fpoIy() I f,t*)
fk • in(.)d* f2 trig()dr
In this case the Critic will label that instance as a negative one. This negative
instance will be used to adjust the version space to exclude this instance by specializing
the G integrand to either
poly(x) * f2(x) dx or to f1(x) * tran(x) dx
Several versions of the INDUCE system were developed beginning in the late 1970S
(Larson and Michalski, 1977. and Dietterich and Michalski, 1981). INDUCE is a
system which discovers similar patterns among positive examples of objects and
formulates generalized descriptions which characterize the class patterns. The descrip-
tion language and the internal representation used in the system are an extension
of first order predicate calculus. This is one of the unique features of INDUCE.
Before outlining how the system operates, we introduce a few new terms:
This expression states that a particular leaf contains spots, is ye!lo.i, curled at the
sides, is 4 cm in length and 2.3 cm wide.
mation to create the new descriptors. These are then added to the ranked list at
their appropriate locations.
4. Each of the descriptions in the list is then tested for consistency and complete-
ness. A description is consistent if it does not cover any negative object. It is
complete if it covers all the (positive) objects. Those that pass the test are removed
from the ranked list and put on a solutions list. Incomplete but consistent descriptions
are put on a consistent list. Any descriptors remaining are specialized by appending
selectors from the original list. These modified descriptions are tested for coisistency
and completeness and the process is repeated until predefined list size limits are
exceeded or until all descriptors have been put on either the solutions or the consistent
list.
S. Each of the descriptors on the consistent list is made more generic using
generalizations such as climbing a generalization tree or closing an interval. The
generalizations are then ranked and pruned using a Lexicographic Evaluttion Function
(LEF) and the best m of these are chosen as the description. The LEF uses criteria
established by the user such as maximum examples covered, simplest decriptions
(fewest terms), or user defined least cost. The final descriptions on the solutions
list are the induced (generalized) descriptions which cover all the training instances.
The following example will illustrate this process.
Assume the following three descriptions have been given for the object instances
displayed in Figure 19.7.
30.0. [color(o 1 ) = green) tshape(o 1 ) = sphere j l size( o 1) = large]
green •
red
/ 4:
Using the procedure outlined above. INDUCE would discover the generalized
description for the examples as "a red object supports another object," that is
3t.v Icolor(x) redllsupports( r sit
Winston's Arch system was dcveloped early in 1970 (Winston. 1975) This work
has been noted as one of the most influential projects in recent Al research and
has been cited as being responsible for stimulating renewed research in the area o
machine learning.
The Arch system learns concepts in the form of associative network representa-
tions (Figure 19.8) much like the cluster network representations of the previous
chapter. The Arch system, however, is not able to handle disjunctive descriptions.
Given positive and negative training examples like the ones in Figures 19.9
and 19. 10, the system builds a generalized network description of an arch such as
.ports
cw
initial concept description. The next example is then matched against this description
using a graph-matching algorithm. This produces a common subgraph and a list of
nodes and links which differ. The unmatched nodes and links are tagged with com-
ments which are used to determine how the current description should be modified.
If the new example is positive, the description is generalized (Figure 19.9) by
either dropping nodes or links or replacing them with more generalized ones obtained
from a hierarchical generalization tree If the new example is a negative one, the
description is specialized to exclude that example (Figure 19.10).
The negative examples are called near misses, since they differ from a positive
example in only a single detail. Note the form of specialization used in Figure
19.10. This is an example of specialization by taking exception. The network represen-
tation for these exceptions are must and must no, links to emphasize the fact that
an arch must not have these features.
19.6 SUMMARY
Examples of four different inductive learning paradigms were presented in this chapter.
In the first paradigm, the 11)3 system, classifications were learned from a set of
positive examples only. The examples were described as attribute values of ohjec;.
The classifications were learned in the form of discrimination tree. Once created,
the 1D3 system used the tree to classify new unknown objects. Attributes are selected
414 Examples of Other Inductive Learners Chap. 19
on the basis of the information Gain expected. This results in a minimal tree size.
LEX, the second system described, learned heuristics to choose when certain operators
should be used in symbolic integration problems. One of the interesting features of
LEX is the use of a version space which bounds the Set of plausible heuristics that
are applicable in a given problem state. LEX uses a syntactic 'form of bias, its
grammar, to limit the size of the hypothesis space. In LEX, generalizations are
found by climbing a hierarchical description tree.
The third system considered, INDUCE, formed generalized descriptions of
class of objects in an extended form of predicate calculus. This system builds both
attribute and structural types of descriptions. One weakness of this system, however,
is the amount of processing required when creating descriptions by removing the
selectors in all possible ways. When the number of object descriptions becomes
large, the computation becomes excessive.
The fourth and final inductive learner described in this chapter was Winston's
Arch. This system builds associative net representations of structural concepts. One
of the unique aspects of this system is the use of near miss negative examples
which differ from positive examples in only a single feature. This simplifies the
learning process somewhat. This system is similar in some ways to the one described
in the previous chapter. It is not able to build disjunctive descriptions like that
system, however.
Similar features shared by all of these systems include the use of symbolic
representations, and the methods of generalization and specialization. The principal
differences among these paradigms are the forms of symbolic representation schemes
used and the algorithms employed for generalizing and specializing.
EXERCISES
19.1. Derive the discrimination tree of Figure 19.2 using attributes arranged in the following
order: height, weight, horned, and color,
19.2. Prove that the entropy H is maximum when p = p, for all i, j = 1,2.... . n,
where
H = p * log p
19.3. Plot the entropy as a function of p for the case n = 2; that is. plot H as p ranges
from 0 to 1.
19.4. Describe how LEX generalizes from tan. cos. In, and log to transc (transcendental).
19.5. Use the concepts related to version space bounding to illustrate how a system can
learn the concept of a large circle from both positive and negative examples of
objects described by the attributes shape (circle, square, triangle), size (small, medium,
large), and color (red, blue, green).
19.6. What is the difference between a standard disjunction and an internal disjunction'
19.7. What is the significance of a cover, and how does it relate to a concept?
Chap. 19 Exercises 415
20.1 INTRODUCTION
416
Sec. 20.2 Analogical Reasoning and Learning 417
familiar experience as a guide in dealing with the new experience. And, Since
SO
many of our acts are near repetitions of previous acts, analogical learning has gained
a prominant place in our learning processes.
EBL methods of learning differ from other methods in that the learned knowledge
is valid knowledge. It is derived from a set of facts through a deductive reasoning
process and, therefore, is justified, consis'ent knowledge These EBL methods will
most likely find use in conjunction with other learning paradigms where it is important
to validate newly ?earned knowledge.
A - B
(A is like B) or more generally
A:B C:D
(A is to B as C is to D)
28-
Analogical and Explanation-Based Learning Chap. 20
418
example, the type of word-object
where one of the components is missing. For
found in aptitude or GRE tests are given by
and geometrical analogies typically
(a) (b)
(c) (d)
green go #0
red :_ 0
Examples of more abstract analogies are the planetary system or atomic model
noted above, the proof of a theorem based on a similar known proof, solving a
new problem from knowledge of an old familiar problem solving technique, learning
to play the card galvie of bridge from a knowledge of hearts, or producing a new
algorithm for a program using previously learned programming examples and con-
cepts.
Although applications of analogical methods have received less attention in
Al than other methods,some important results have been published including (Bur-
stein. 1986. Carbonell, 1983 and 1986, Greiner. 1988, Kling. 1971, McDermott,
9. and Winston, 1980) Researchers in related fields have also made important
197
contributions cognitive science (Gentner. 1983), and psychology (Rumethart and
Norman. 1981) Several of the researchers in Al have produced working programs
based on their models of the analogical process. We examine some of these models
in the following section In the remainder of this section we investigate the analogical
reasoning process in some detail.
GE (1:1,
2. Access and recall: The similarity of the new problem to previously experienced
ones serves as an index with which to access and recall one or more candidate
experiences (analogues).
3. Selection and mapping: Relevant Parts of the recalled experiences are selected
for their similarities and mapped from the base to the target domain.
4. Extending the mapped experience: The newly mapped analogues are modified
and extended to fit the target domain situation.
5. Validation and generalization: The newly formulated solution is validated for
its applicability through some form of trial process (such as theorem provers
or simulation). If the validation is supported, a generalized solution is formed
which accounts for both the old and the new situations.
\1V
\V
(b)
Next, the newly created solution must be tested for its suitability- This test
can be as informal as a simulated solution trial in the target domain or as formal
as a deductive test of logical entailment.
Finally, having found an analogy and tested it successfully, the resultant episode
should be generalized if possible and then summarized, encoded, indexed and stored
for subsequent use in reasoning or learning.
A simple example will help to illustrate this process. Suppose we are given
the problem of determining the flow rate of a thud from a simple Y junction of
pipes (Figure 20.2b). We are asked to determine the value of Q3 given only knowledge
of the flow rates Qt and Q2 . A description of this unknown prob4ern reminds us of
a similar known problem, that of finding the flow rate of electrical current in a
circuit junction. We recall the solution method to the electrical problem as being
that based on Kirchhoff's current flow law, namely, that the -win the currents at
a junction is zero. We use this knowledge in an attempt to solve the hydraulic
flow problem using the same principles, that is, we map the electncai flow solution
to the hydraulic flow problem domain. This requires that coiTespoAding objects,
attributes and relations be suitably mapped from the electrical to the hydraulic domain
We then test the conjectured solution in some way.
In the reminding process, we may alternatively be given a ditecl hint from a
teacher that the hydraulic flow problem is like the electrical flow.probiem. Otherwise,
we must infer this likeness in some way and proceed with the conjecture that they
are alike, justifying the likeness on the basis of the consistency of nature or some
other means.
Next, we examine some iepresentative examples of analogical learning systems.
Sec. 20.3 Examples of Analogical Learning Systems
Winston's System
Patrick Winston (1980) developed programs that reason about relationships, motives,
and consequent actions that occur among people. Using relationships and act 's of
actors in one story (such as Macbeth) the program was able to demonstrate that
analogous results occurred in different stories (such as Hamlet) when there were
similarities among the relationships and motives of the second group of characters.
The programs could also learn through the analogical reasoning process. For example.
when a teacher declared that voltage, current, and resistance relationships were
like those of water pressure, flow, and pipe resistance, the system was able to
learn basic results in electrical circuits and related laws such as Ohm's law (the
opposite of the learning problem described above).
The analogical mapping and learning process for this example is illustrated
in Figure 20.3. The items in the figure labeled as voltage-value3, current-value-3,
and resistance-value-3 represent specific values of voltage, current, and resistance.
respectively.
The important features of Winston's system can be summarized as follows:
1. Knowledge representation: Winston's system used frame structures as part
of the Frame Representation Language (FRL) developed by Roberts and Goldstein
(1977). Slots within the frames were given special meanings, such as AKO. appears-
in, and the like. Individual frames were linked together in a network for easy access
to related items.
2. Recall of analogous Situations: When presented with a current situation.
candidate analogues were retrieved from memory using an hierarchical indexing
scheme. This was accomplished by storing a situations (frame) name in the slots
of all object frames that appeared in the situation. For example, in the Cinderella
story, the prince is one of the central parts. Therefore, prince would be used as a
node in a hierarchical tree structure with subtinks
and the Cinderella label Cl would be stored in an appears-in slot of the frames
belonging to the prince, the man, and the person. These slots were always searched
as part of the recall reminding process when looking for candidate analogues.
3. Similarity matching: In selecting the best of the known Situations during
the reminding process described above, a similarity matching score is computed
for each of the recalled candidates. A score is cor-puted for all slot pairings between
two frames, and the pairing having the highest score is selected as the proper analogue.
The scoring takes into account the number of values that match between slots as
Analogical and Explanation-Based Learning Chap. 20
422
S
Iw
r.fl,re,
deS, variable
1 prore pi lw voltagevaluei
proportional aolge
indec, Variable
flow
L- rmitae pipe l aw ji
resiflktS
reSiStafttV aloe)
electrical resistance
well as matching relationships found between like parts having causal relations as
noted in comment fields.
4. Mapping base-to-target situations: The base-to-target analogue mapping pro-
cess used in this system depends on the similarity of parts between base and target
domains and role links that can be established between the two. For example. when
both base and target Situations share the same domain, such as water-flow in pipes,
parts between the two are easily matched for equality. A specific pipe, pipet, is
matched with a general pipe, pipe#, specific water-flow, with general flow, and
so forth. The relationships from the general case are then mapped directly to the
specific case without change.
In cases where base and target are different domains, the mapping is more
difficult as partS, in general, will differ. Before mapping is attempted, links are
established between corresponding parts. For example, if the two domains are water-
flow and electricity-flow (it is known the two are alike), determining the electrical
Sec. 20.3 Examples of Analogical Learning Systems 423
wtI.prel1. otge
/
p sure.piMJsw voltag...2
1. A finite collection of Consistent propositions (rules, facts, and the like) called
a theory (Th).
2. An analogical hint about the source and target analogues A and B respectively,
written A -- B (A is like B). Here A = (a!.....aj and B = {b 1 ......
Q are sets of arbitrary formulae or knowledge related to some problem Situa-
tions.
3. A problem to be solved in the target domain (target problem) denoted as PT.
The output from the system is a Set of new propositions or conjectures 4(A)
related to the set B that can be used to solve PT.
The above process can be summarized as follows:
Th U {(bA)) =PT.
This definition of useful analogical learning is summarized in Figure 20.4
where we have named the conditions described in the above def'irror as unknown,
consistent, common, and useful.
These notions are illustrated in terms of our earl ici example for the hydraulics
or electrical flow problem (Figure 20.2). Thus. 13 here corresponds to the known
theory Th that given all Y junction as in Figure 20.2a, then 1 = l + J.
The problem P7' is to find Q, the fluid flow rate in a similar Y Junction of pipes
when Q 1 and Q, are known. The analo g ical hint,A B. is that the hydraulics
flow problem is like the electrical current flow problem. The useful formula needed
to solve the problem is of course 0, = + Q 2 . A more complex example would,
of course, require more than a single formula.
Since many analogies may satisfy the theory and the analogical hint, it is
necessary to restrict the analogies considered to those which are useful. For this.
NLAG uses heuristics to select only those formulae which are likely to be useful.
One heuristic used by the system is based on the idea that relationships in
Th,A-8H,t'(A)
one domain should bold in other, similar domains (in two domains governed by
physical laws). In our example, this is the law related to zero flow rates into a
junction and the corresponding reusable zero-sum formula for general, physical
flows. Thus, only those uninstantiated abstract formulas found in the base analogue
would be permitted as conjectures in the target analogue solution. Furthermore.
this same heuristic requires that the selected formulae be atomic formulae of the
form f(a 1 , a2.....ak ) are permitted, whereas formulae containing multiple con-
junctive Or disjunctive terms such as
fT (a,b) &f,(b,c) Vf3(,c,d.e)
are not permitted. These formulae must also satisfy the usefulness condition Figure
20.4).
A heuristic which helps to further prune the abstract soluti,ns described above
uses the target domain problem PT, to suggest a related query in the source domain.
The conesponding source domain query PS is then used in turn to select only
those formulae which are both abstractions (as selected above) and are also relevant.
For example, the target query PT = "Find the flowrate in the given pipe structure"
is used to find the analogous source domain query PS = "Find the current in the
given electrical structure." This query is then used to determine which facts are
used to solve PS. These facts are then used as a guide in selecting fts (formulae)
for the target domain to solve PT.
Other system heuristics require that formulae must all come from th.' same
domain, that more general abstractions be preferred over less geaera one., and
that for a given abstraction, instances which require the fewest conjectures be chosen.
In general, these heuristics are all based on choosing only enough new information
about the target analogue to solve PT and then stop.
CarboneWs Systems
Carbonell developed two analogical systems for problem solving, each based on a
different perception of the analogical process (1983 and 1986). The first system
was based on what he termed transformational analogy and the second on derivational
analogy. The major differences between the two methods he in the amount of details
remembered and stOred for situations such as problem solution traces and the methods
used in the base-to-target domain transformational process.
Both methods used a knowledge representation and memory indexing and
recall_scheme similar to the memory organization packets or MOPs of Roger Schank
(Chapter Ii). Both methods also essentially followed the five-step analogical learning
process outlined in the previous section. The main differences between the two
methods can be summarized as follows:
initial state, a goal state, and a sequence of actions (operators) which, when applied
to the initial and intermediate states, result in a transformation to the goal state.
When a i,ew problem is encountered, it is matched against potentially relevant
known, ones using a suitable similarity measurement. The partial match producing
the highest similarity measure is transformed to satisfy the requirements of the
new problem. in finding a solution to the new problem using the known mapped
solution, it is often nec.sary to disturb some states to find operators which reduced
the current-to-goal state differences. With this method, the focus is on the sequence
of actions in a given solution, and not on the derivation process itself.
One of the most active areas of learning research to emerge during the early 1980s
is explanation-based learning (also known as explanation-based generalization). This
is essentially a form of deductive generalization. It has been successfully demonstrated
for the general area of concept learning.
In EBL, four kinds of information must be known to the learner in advance.
I,
LIGHTER(ObJI,0b12)
Continuing with the explanation process, we find rules relating the weights
with volume and density and to an endtable. After instantiation of these terms, we
Analogical and Explanation-Based Learning Chap. 20
Knoim
Concept Definition:
Pairs of objects <x,y> are SAFE-TO-STACK when
SAFE-TO-STACK(x.y)... — FRAGILE(y) V LIGHTER(5,y).
Training Example:
ON(obj1.ob2)
ISA(objl,box)
ISA)obj2.endtable)
CCL OR (obj 1, red)
COLOR(obj2,b)ue)
VOLIJME(objlfl
DENSITY(Objl,01)
Domain Theory:
VOLUME(p,v,) & DENSliY)p 1 ,d) -. WElGHT(pvd)
WE I GHT(p,.w) & WEIGHTIp 2,w2 1 & LESS)w,,w2) -.
LIGHTER(p1,p2)
ISA(p 1 ,endtable) - WEIGI4T(p 1 ,5) (i.e. a default)
LESS)O.1,51 -
Operationality Criteria:
The learned concept should be expressed in terms of the same predicates used to describe
the example, i.e. COLOR, DENS)rY. VOLUME, etc, or simple predicates from the domain
theory (e.g LESS).
Determine:
A genecaizaton of the training example that is a sufficient goal concept definition which
- satisfies the operationality Criteria.
are led to the complete explanan tree structure given below in which the r()Ø
terms are seen to satisfy the operationaihy criteria.
LIGHTER)obj 1.ob11
The next step in developing the final target concept is to generalize the above
structure by regressing or back-propagating formulae through rules in the above
structure step by step, beginning with the top expression SAFE-TO-STACK and
regressing SAFE-TO-STACK(X,Y) through the goal rule (the FRAGILE disjunct
is omitted since it was not used). LIGHTER(p 1 .p 2 ) -. SAFE-TO-STACK(P.p
yields the term LIGHTER(x,y) as a sufficient condition for inferring SAFE-TO-
STACK(x,y). In a similar way LIGHTER(x.y) is regressed back through the next
step to yield WEIGHT(x,w 1 ) & WEIGHT(y.w 2 ) & LESS(w 1 .w). This expression
is then regressed through the final steps of the explanation structure to yield the
following generalized, operational, definition of the safe-to-stack concept.
VOLUME(x.vl)
& DENSITY(x,di)
& LESSIPd1,5)
& ISA(y,endtabte) -. SAFE.TOSTACKIX.Y)
The complete regression process is summarized in Figure 20.6 where the under-
lined expressions are the results of the regression steps and the substitutions are as
noted within the braces.
In summary. the process described above produces a justified generalization
of a single traifling example as a learned concept. It does this in a two step process.
The first step creates an explanation that contains only relevent predicates. The
second step uses the explanation structure to establish constraints on the predicate
values that are sufficient for the explanation to apply in general. This differs from
inductive learning in that a single training example is adequate to learn a valid
description of the concept. Of course there is a trade-off here, The EBL method
also requires appropriate domain knowledge as well as a definition of the target
concept.
Goal concept: SAFE-TO-STACKI'r,y) / p,. y
SAFE -TO -STACK (P,P,)
1
LIGHTER(P,.P) s/P.. VP.
3
WEIGHT(p,,w,) LESS(w.w2) WEIGMT(i.wa)
WEIGHT( LESS2 Tlywl
414
VO LUME(p. 2 DENS ITY(p,.d,) 1___ ISAtP2efldt,Ie)
V0LUME(a,'.j DENSITY( LESSh', d,.S) ndtae)
There is an apparent paradox in the EBL method in that it may appear that
no actual learning takes place since the system must be given a definition of the
very concept to be learned! The answer to this dilemma is that a broader, generalized,
and more useable definition is being learned. With the EBL method, existing knowl-
edge is being transformed to a more useful form. And, the learned concept applies
to a broader class than the supplied definition. This newly learned concept is a
valid definition since it has been logically justified through the explanation process.
The same claim cannot be made of other nondeductive learning techniques like
inductive or analogical learning.
The notion of operationality used in EBL systems should depend on the purpose
of the learning system. As such, it should be treated as a dynamic property. For
example, a robot may need to learn the concept of a pencil in order to recognize
pencils. In this case, operationality should be interpreted in terms of the structural
properties of the pencil. On the other hand, if the purpose of learning the pencil
concept relates to design, the robot would be better served with a functional detinitiori
of a pencil. Keller (1988) discusses the operationality problem and its application
in a program called MetaLEX (a successor to LEX described in Chapter 19).
20.5 SUMMARY
We have described two of the more promising approaches to machine learriin, the
analogical and explanation-based learning paradigms. These methods, unlike similar-
ity-based methods, are capable of creating new knowledge from a single training
example. Both methods offer great potential as autonomous learning methods.
Learning by analogy requires that similar, known experiences be available
for use in explaining or solving newly encountered experiences. The complete proces,
can be described in five steps: (I) a newly encountered situation serves as a reminder
pf a known Situation. (2) the thost relevent of the reminded situations are accessed
and recalled, (3) the appropriate parts of the recalled analogues are mapped inirn
the base domain to the target domain. (4) the mapped situation or solution is extended
to fit the current problem, and () the extended solution is tested, generalized, and
stored for subsequent recall.
A number of analogical research learning systems have been developed. Four
representative systems have been described in this chapter.
Explanation-based learning is a form of deductive learning .where the learner
develops a logical explanation of how a positive training example defines a concept.
The explanation is developed using the example, the concept definition, and relevant
domain theory. A key aspect of the explahation is that it satisfy some operational
criteria, possibly through the use of only attributes and predicates that are used in
the domain theory and/or the example.
The EBL method is a two step process. In the first step, an explanation of
the concept is formulated using the training example and domain theory. In the
econd, this explanation is generalized by regressing formulae step-by-step back
Chap. 20 Exercises
EXERCISES
20.1. Describe two examples of analogical learning you have experienced recently.
20.2. Why is it that things similar in some ways tend to be similar in other ways?
20.3. Consult a good dictionary and determine the differences between the definitions of
analogies, metaphors, and similes.
20.4. Make up three new analogies like the examples given in Section 20.2.
20.5. Relate each of the five steps followed in analogical learning to the following example:
Riding a motorcycle is like riding a bicycle with an engine in it.
20.6. Compare the analogical system of Winston to that of Greiner. In what ways do they
differ?
20.7. What appears to be more important in mapping from base to target domain, object
attributes, object relationships, or both? Give examples to support your conclusions.
20.8. What we the main differences between Carbonell's transformational and derivational
systems?
20.8. Define operationality as it applies to explanation-based learning and give an example
of it as applied to some task.
20.10. Explain why each of the four kinds of information (concept definition, positive training
example, domain theory, and operational criteria) is needed in EBL.
20.11. If the end table in the Safe-to-Stack example had not been given a default weight
value, what additional steps would be required in the explanation to complete the
tree structure?
20.12. What is the purpose of the regression process in EBL?
20.13. Work out a complete explanation for the concept safe to cross the street. This requires
domain theory about traffic lights, and traffic, a positive example of a safe crossing.
and operational criteria.
20.14. The learning methods described in Part V have mostly been single paradigm methods.
yet we undoubtedly use combined learning for much of our knowledge learning.
Describe how analogical learning could be combined with EBL as well as inductive
learning to provide a more comprehensive form of learning.