AI Unit V
AI Unit V
APPLICATIONS
AI applications – Language Models – Information Retrieval-
Information Extraction – Natural Language Processing -
Machine Translation – Speech Recognition – Robot –
Hardware – Perception – Planning – Moving
LANGUAGE MODELS
• Language models: models that predict the probability distribution
of language expressions.
• Languages are specified by a set of rules called a grammar
• Natural languages cannot be characterized as a definitive set of
sentences
• Natural languages are ambiguous
• Natural languages are difficult to deal with because they are very
large, and constantly changing. Thus, our language models are, at
best, an approximation.
N-gram character models
• In English a written text is composed of characters-letters, digits,
punctuation, and spaces
• One of the simplest language models is a probability distribution
over sequences of characters.
• A sequence of written symbols of length n is called an n-gram with
special case “unigram” for 1-gram, “bigram” for 2-gram, and
“trigram” for 3-gram.
• A model of the N-GRAM MODEL probability distribution of n-
letter sequences is thus called an n-gram model.
• An n-gram model is defined as a Markov chain of order n − 1. The
chain probability of character Ci depends only on the immediately
preceding characters, not on any other characters.
• In a trigram model (Markov chain of order 2) we have:
• where λ3 + λ2 + λ1 = 1.
• The parameter values λi can be fixed, or they can be trained with an
expectation–maximization algorithm.
• It is also possible to have the values of λi depend on the counts: if
we have a high count of trigrams, then we weigh them relatively
more;
• If only a low count, then we put more weight on the bigram and
unigram models.
Model evaluation
• With so many possible n-gram models-unigram, bigram, trigram,etc.
• How do we know what model to choose?
• We can evaluate a model with cross-validation.
• Split the corpus into a training corpus and a validation corpus.
Determine the parameters of the model from the training data.
• Then evaluate the model on the validation corpus.
• The evaluation can be a task-specific metric, such as measuring
accuracy on language identification.
• Alternatively we can have a task-independent model of language
quality which is inconvenient because the probability of a large
corpus will be a very small number, and floating-point underflow
becomes an issue.
• A different way of describing the probability of a sequence is with a
measure called perplexity, defined as
• This stage is the first one in the cascade where the output is placed
into a database template as well as being placed in the output stream.
Structure merging:
• The final stage merges structures that were built up in the previous
step.
• If the next sentence says “The joint venture will start production in
January,” then this step will notice that there are two references to a
joint venture, and that they should be merged into one.
• In general, finite-state template-based information extraction works
well for a restricted domain in which it is possible to predetermine
what subjects will be discussed, and how they will be mentioned.
• The cascaded transducer model helps modularize the necessary
knowledge, easing construction of the system.
• These systems work especially well when they are reverse-
engineering text that has been generated by a program.
• For example, a shopping site on the Web is generated by a program
that takes database entries and formats them into Web pages;
• A template-based extractor then recovers the original database.
• Finite-state information extraction is less successful at recovering
information in highly variable format, such as text written by
humans on a variety of subjects.
• This feature is true if the current state is SPEAKER and the next word is “said.”
• Note that both f1 and f2 can hold at the same time for a sentence like “Andrew said
. . . .”.
• In this case, the two features overlap each other and both boost the belief in x1 =
SPEAKER. Because of the independence assumption.
• HMMs cannot use overlapping features but CRFs can.
• Furthermore, a feature in a CRF can use any part of the sequence e1:N . Features
can also be defined over transitions between states.
• The features we defined here were binary, but in general, a feature function can be
any real-valued function.
• For domains where we have some knowledge about the types of features we would
like to include, the CRF formalism gives us a great deal of flexibility in defining
them.
• This flexibility can lead to accuracies that are higher than with less flexible models
such as HMMs.
4. Ontology extraction from large corpora
• A different application of extraction technology is building a large
knowledge base or ontology of facts from a corpus.
• This is different in three ways:
• First it is open-ended: we want to acquire facts about all types of
domains, not just one specific domain.
• Second, with a large corpus, this task is dominated by precision, not
recall. Just as with question answering on the Web.
• Third, the results can be statistical aggregates gathered from
multiple sources, rather than being extracted from one specific text.
• Here is one of the most productive templates:
• Here the bold words and commas must appear literally in the text,
but the parentheses are for grouping, the asterisk means repetition of
zero or more, and the question mark means optional.
• NP is a variable standing for a noun phrase;
• This template matches the texts “diseases such as rabies affect your
dog” and “supports network protocols such as DNS,” concluding
that rabies is a disease and DNS is a network protocol.
• Similar templates can be constructed with the key words
“including,” “especially,” and “or other.” Of course these templates
will fail to match many relevant passages, like “Rabies is a disease.”
That is intentional.
• The “NP is a NP” template does indeed sometimes denote a
subcategory relation, but it often means something else, as in “There
is a God” or “She is a little tired.”
• With a large corpus we can afford to be picky; to use only the high-
precision templates.
5.Automated template construction
• The subcategory relation is so fundamental that is worthwhile to
handcraft a few templates to help identify instances of it occurring
in natural language text.
• But what about the thousands of other relations in the world?
• Fortunately, it is possible to learn templates from a few examples,
then use the templates to learn more examples, from which more
templates can be learned, and so on.
• In one of the first experiments of this kind, Brin (1999) started with
a data set of just five examples:
• Clearly these are examples of the author–title relation, but the
learning system had no knowledge of authors or titles.
• The words in these examples were used in a search over a Web
corpus, resulting in 199 matches.
• Each match is defined as a tuple of seven strings,
(Author, Title, Order, Prefix, Middle, Postfix, URL)
• where Order is true if the author came first and false if the title came
first.
• Middle is the characters between the author and title, Prefix is the
10 characters before the match, Suffix is the 10 characters after the
match, and URL is the Web address where the match was made.
• Given a set of matches, a simple template-generation scheme can
find templates to explain the matches.
• The language of templates was designed to have a close mapping to
the matches themselves, to be amenable to automated learning, and
to emphasize high precision.
• Each template has the same seven components as a match.
• The Author and Title are regexes consisting of any characters and
constrained to have a length from half the minimum length of the
examples to twice the maximum length.
• The prefix, middle, and postfix are restricted to literal strings, not
regexes.
• The middle is the easiest to learn: each distinct middle string in the
set of matches is a distinct candidate template.
• For each such candidate, the template’s Prefix is then defined as the
longest common suffix of all the prefixes in the matches, and the
Postfix is defined as the longest common prefix of all the postfixes
in the matches.
• If either of these is of length zero, then the template is rejected. The
URL of the template is defined as the longest prefix of the URLs in
the matches.
• The disadvantage in this approach is the sensitivity to noise.
• If one of the first few templates is incorrect, errors can propagate
quickly.
• One way to limit this problem is to not accept a new example unless
it is verified by multiple templates, and not accept a new template
unless it discovers multiple examples that are also found by other
templates.
6.Machine reading
• To build a large ontology with many thousands of relations, even
that amount of work would be onerous;
• we would like to have an extraction system with no human input of
any kind of system that could read on its own and build up its own
database.
• Such a system would be relation-independent; would work for any
relation.
• In practice, these systems work on all relations in parallel, because
of the I/O demands of large corpora.
• They behave less like a traditional information extraction system
that is targeted at a few relations and more like a human reader who
learns from the text itself, because of this the field has been called
machine reading.
• A representative machine-reading system is TEXTRUNNER (Banko
and Etzioni, 2008).
• TEXTRUNNER uses cotraining to boost its performance, but it
needs something to bootstrap from.
• For TEXTRUNNER, the original inspiration was a taxonomy of
eight very general syntactic templates.
• It was felt that a small number of templates like this could cover
most of the ways that relationships are expressed in English.
• The actual bootsrapping starts from a set of labelled examples that
are extracted from the Penn Treebank, a corpus of parsed sentences.
• For example, from the parse of the sentence “Einstein received the
Nobel Prize in 1921,” TEXTRUNNER is able to extract the relation
(“Einstein,” “received,” “Nobel Prize”).
• Given a set of labeled examples of this type, TEXTRUNNER trains
a linear-chain CRF to extract further examples from unlabeled text.
• The features in the CRF include function words like “to” and “of”
and “the,” but not nouns and verbs (and not noun phrases or verb
phrases).
• Because TEXTRUNNER is domain-independent, it cannot rely on
predefined lists of nouns and verbs.
• TEXTRUNNER achieves a precision of 88% and recall of 45% (F1
of 60%) on a large Web corpus.
• TEXTRUNNER has extracted hundreds of millions of facts from a
corpus of a half-billion Web pages.
• For example, even though it has no predefined medical knowledge,
it has extracted over 2000 answers to the query [what kills bacteria];
• Correct answers include antibiotics, ozone, chlorine, Cipro, and
broccoli sprouts. Questionable answers include “water,” which came
from the sentence “Boiling water for at least 10 minutes will kill
bacteria.”
• It would be better to attribute this to “boiling water” rather than just
“water.”
• With the techniques outlined in this chapter and continual new
inventions, we are starting to get closer to the goal of machine
reading.
Machine Translation
• Machine translation is the automatic translation of text from one
natural language (the source) to another (the target).
• three main applications of machine translation.
• Rough translation - Provided by free online services, gives the “gist”
of a foreign sentence or document, but contains errors.
• Pre-edited translation - Used by companies to publish their
documentation and sales materials in multiple languages.
• The original source text is written in a constrained language that is
easier to translate automatically, and the results are usually edited by
a human to correct any errors.
• Restricted-source translation - Works fully automatically, but only
on highly stereotypical language, such as a weather report.
Machine Translation
• Translation is difficult because, in the fully general case, it requires
in-depth understanding of the text.
• Consider the word “Open” on the door of a store and the same word
“Open” on a large banner outside a newly constructed store.
• The two signs use the identical word to convey different meanings.
• The problem is that different languages categorize the world
differently. For example, the French word “doux” covers a wide
range of meanings corresponding approximately to the English
words “soft,” “sweet,” and “gentle.”
• Representing the meaning of a sentence is more difficult for
translation than it is for single-language understanding.
• A representation language that makes all the distinctions necessary
for a set of languages is called an interlingua.
Machine Translation
• Translator (human or machine) often needs to understand the actual
situation described in the source, not just the individual words.
• For example, to translate the English word “him,” into Korean, a choice
must be made between the humble and honorific form, a choice that
depends on the social relationship between the speaker and the referent of
“him.”
• Translators (both machine and human) sometimes find it difficult to make
this choice.
• To get the translation right, one must understand physics as well as
language.
Machine Translation
Machine translation systems:
• All translation systems must model the source and target languages, but
systems vary in the type of models they use.
• Some systems attempt to analyze the source language text all the way into
an interlingua knowledge representation and then generate sentences in the
target language from that representation.
• This is difficult because it involves three unsolved problems:
1.Creating a complete knowledge representation of everything.
2.Parsing into that representation.
3.Generating sentences from that representation.
Machine Translation
• Other systems are based on a transfer model. They keep a database
of translation rules, and whenever the rule matches, they translate
directly.
• Transfer can occur at the lexical, syntactic, or semantic level.
• For example, a strictly syntactic rule maps English [Adjective Noun]
to French [Noun Adjective].
• A mixed syntactic and lexical rule maps French [S1 “et puis” S2] to
English [S1 “and then” S2]. Figure 23.12 diagrams the various
transfer points.
Machine Translation