0% found this document useful (0 votes)
17 views64 pages

AI Unit V

The document discusses language models and n-gram character models. N-gram character models use sequences of characters like unigrams, bigrams and trigrams to model language. Smoothing techniques like Laplace smoothing and backoff models are used to assign non-zero probabilities to sequences with zero counts. Language identification, genre classification and named entity recognition are some tasks that can be performed using character models.

Uploaded by

SELVAGANESH N IT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views64 pages

AI Unit V

The document discusses language models and n-gram character models. N-gram character models use sequences of characters like unigrams, bigrams and trigrams to model language. Smoothing techniques like Laplace smoothing and backoff models are used to assign non-zero probabilities to sequences with zero counts. Language identification, genre classification and named entity recognition are some tasks that can be performed using character models.

Uploaded by

SELVAGANESH N IT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

UNIT V

APPLICATIONS
AI applications – Language Models – Information Retrieval-
Information Extraction – Natural Language Processing -
Machine Translation – Speech Recognition – Robot –
Hardware – Perception – Planning – Moving
LANGUAGE MODELS
• Language models: models that predict the probability distribution
of language expressions.
• Languages are specified by a set of rules called a grammar
• Natural languages cannot be characterized as a definitive set of
sentences
• Natural languages are ambiguous
• Natural languages are difficult to deal with because they are very
large, and constantly changing. Thus, our language models are, at
best, an approximation.
N-gram character models
• In English a written text is composed of characters-letters, digits,
punctuation, and spaces
• One of the simplest language models is a probability distribution
over sequences of characters.
• A sequence of written symbols of length n is called an n-gram with
special case “unigram” for 1-gram, “bigram” for 2-gram, and
“trigram” for 3-gram.
• A model of the N-GRAM MODEL probability distribution of n-
letter sequences is thus called an n-gram model.
• An n-gram model is defined as a Markov chain of order n − 1. The
chain probability of character Ci depends only on the immediately
preceding characters, not on any other characters.
• In a trigram model (Markov chain of order 2) we have:

• We can define the probability of a sequence of characters P(c1:N ) under


the trigram model by first factoring with the chain rule and then using the
Markov assumption:

• Language identification: Given a text, determine what natural


language it is written in. This is a relatively easy task.
• Even with short texts such as “Hello, world” or “Wie geht es dir,” it
is easy to identify the first as English and the second as German.
• Computer systems identify languages with greater than 99%
accuracy; occasionally, closely related languages, such as Swedish
and Norwegian, are confused
• One approach to language identification is to first build a trigram
character model of each candidate language.
• P(ci | ci−2:i−1, l ), where the variable ranges over languages.
• For each the model is built by counting trigrams in a corpus of that
language, that gives us a model of P(Text | Language).
• But we want to select the most probable language given the text, so
we apply Bayes rule followed by the Markov assumption to get the
most probable language:
• The trigram model can be learned from a corpus, but what about the
prior probability P()?
• We may have some estimate of these values: for example, if we are
selecting a random Web page we know that English is the most
likely language and that the probability of Macedonian will be less
than 1%.
• The exact number we select for these priors is not critical because
the trigram model usually selects one language that is several orders
of magnitude more probable than any other.
• Other tasks for character models include spelling correction, genre
classification, and named-entity recognition.
• Genre classification means deciding if a text is a news story, a legal
document, a scientific article, etc.
• Named-entity recognition is the task of finding names of things in a
document and deciding what class they belong to.
Smoothing n-gram models
• Our language models can be adjusted so that sequences that have a
count of zero in the training corpus will be assigned a small nonzero
probability
• Other counts will be adjusted downward slightly so that the
probability still sums to 1.
• The process of adjusting the probability of low-frequency counts is
called smoothing.
• The simplest type of smoothing was suggested by Pierre-Simon
Laplace in the 18th century.
• It states that if a random Boolean variable X has been false in all n
observations so far then the estimate for P(X = true) should be 1/(n+
2).
• Laplace smoothing (also called add-one smoothing)performs
relatively poorly.
• A better approach is a backoff model, in which we start by
estimating n-gram counts, but for any particular sequence that has a
low (or zero) count, we back off to (n − 1)-grams.

• where λ3 + λ2 + λ1 = 1.
• The parameter values λi can be fixed, or they can be trained with an
expectation–maximization algorithm.
• It is also possible to have the values of λi depend on the counts: if
we have a high count of trigrams, then we weigh them relatively
more;
• If only a low count, then we put more weight on the bigram and
unigram models.
Model evaluation
• With so many possible n-gram models-unigram, bigram, trigram,etc.
• How do we know what model to choose?
• We can evaluate a model with cross-validation.
• Split the corpus into a training corpus and a validation corpus.
Determine the parameters of the model from the training data.
• Then evaluate the model on the validation corpus.
• The evaluation can be a task-specific metric, such as measuring
accuracy on language identification.
• Alternatively we can have a task-independent model of language
quality which is inconvenient because the probability of a large
corpus will be a very small number, and floating-point underflow
becomes an issue.
• A different way of describing the probability of a sequence is with a
measure called perplexity, defined as

• Perplexity can be thought of as the reciprocal of probability,


normalized by sequence length.
• It can also be thought of as the weighted average branching factor of
a model.
• Suppose there are 100 characters in our language, and our model
says they are all equally likely.
• Then for a sequence of any length, the perplexity will be 100.
• If some characters are more likely than others, and the model
reflects that, then the model will have a perplexity less than 100.
Information retrieval
• Task of finding documents that are relevant to a user’s need for
information.
• Example: search engines on the World Wide Web.
• A Web user can type a query into a search engine and see a list of
relevant pages.
• An IR system can be characterized by:
 A corpus of documents
 Queries posed in a query language
 A result set
 A presentation of the result set
A corpus of documents:
• A paragraph, a page, or a multipage text.
Queries posed in a query language:
• Specifies what the user wants to know.
• The query language can be just a list of words or it can specify a
phrase of words.
• It can contain Boolean operators and also non-Boolean operators.
A result set :
• The subset of documents that the IR system judges to be relevant to
the query.
A presentation of the result set:
• A ranked list of document titles or as complex as a rotating color
map of the result set.
• The earliest IR systems worked on a Boolean keyword model.
• Each word in the document collection is treated as a Boolean feature that
is true of a document if the word occurs in the document and false if it
does not.
Advantage:
Simple to explain and implement
Disadvantage:
i) The degree of relevance of a document is a single bit
ii) Boolean expressions are unfamiliar to users who are not programmers or logicians.
IR scoring functions:
• A scoring function takes a document and a query and returns a
numeric score.
• The most relevant documents have the highest scores.
• In the BM25 function, the score is a linear weighted combination of
scores for each of the words that make up the query.
• Three factors affect the weight of a query term:
i) The frequency with which a query term appears in a document (TF)
ii) The inverse document frequency of the term (IDF).
iii) The length of the document.
• The BM25 function takes all three of these into account.

• |dj | is the length of document dj in words


• L is the average document length in the corpus: L = i |di|/N.
• IDF(qi) is the inverse document frequency of word qi, given by:
IR system evaluation
• There have been two measures used in the scoring: recall and
precision
• Imagine that an IR system has returned a result set for a single
query, for which we know which documents are and are not
relevant, out of a corpus of 100 documents.
• The document counts in each category are given in the following
table:

• Precision measures the proportion of documents in the result set that


are actually relevant. In our example, the precision is 30/(30 + 10) =
0.75.
• Recall measures the proportion of all the relevant documents in the
collection that are in the result set. In our example, recall is 30/(30 +
20) = 0.60.
• High recall, low precision: Most of the positive examples are
correctly recognized but there are a lot of false positives.
• Low recall, high precision: We miss a lot of positive examples but
those we predict as positive are indeed positive
• We should know the following terms before calculating the
Precision and recall:
1. Positive (P) : Observation is positive (for example: is an apple).
2. Negative (N) : Observation is not positive (for example: is not an apple).
3. True Positive (TP) : Observation is positive, and is predicted to be
positive.
4. False Negative (FN) : Observation is positive, but is predicted negative.
5. True Negative (TN) : Observation is negative, and is predicted to be
negative.
6. False Positive (FP) : Observation is negative, but is predicted positive.
IR refinements
• One common refinement is a better model of the effect of document
length on relevance.
• Singhal et al. (1996) observed that simple document length
normalization schemes tend to favour short documents too much
and long documents not enough.
• They propose a pivoted document length normalization scheme: The
pivot is the document length at which the old-style normalization is
correct
• Documents shorter than that get a boost and longer ones get a
penalty.
• The BM25 scoring function uses a word model that treats all words
as completely independent, but we know that some words are
correlated:
Example: couch is closely related to both couches and sofa.
• Some systems use stemming for IR refinement.
• First we should recognize the synonyms, such as sofa for couch.
• As a final refinement, IR can be improved by considering metadata (data
outside of the text of the document).
The Page Rank algorithm
• One of the two original ideas that set Google’s search apart from other Web
search engines when it was introduced in 1997
• Invented by Larry page to solve the problem of the tyranny of TF scores
• If the query is [IBM], how do we make sure that IBM’s home page,
ibm.com, is the first result, even if another page mentions the term “IBM”
more frequently?
• The idea is that ibm.com has many in-links (links to the page), so it
should be ranked higher.
• Each in-link is a vote for the quality of the linked-to page.
• But if we only counted in-links, then it would be possible for a Web
spammer to create a network of pages and have them all point to a
page of his choosing, increasing the score of that page.
• The Page Rank algorithm is designed to weight links from high-
quality sites more heavily.
• What is a high quality site? One that is linked to by other high-
quality sites.
• The Page Rank for a page p is defined as:

• PR(p) is the PageRank of page p.


• N is the total number of pages in the corpus
• ini are the pages that link in to p
• C(ini) is the count of the total number of out-links on page in i. The
constant d is a damping factor.
• It can be understood through the random surfer model.
• Imagine a Web surfer who starts at some random page and begins
exploring.
• With probability d (assume d = 0.85) the surfer clicks on one of the
links on the page and with probability 1 − d she gets bored with the
page and restarts on a random page anywhere on the Web.
• The Page Rank of page p is then the probability that the random
surfer will be at page p at any point in time.
• Page Rank can be computed by an iterative procedure: start with all
pages having PR(p)=1, and iterate the algorithm, updating ranks
until they converge.
The HITS algorithm
• The Hyperlink-Induced Topic Search algorithm(HITS) is another
influential link-analysis algorithm.
• HITS differs from Page Rank in several ways.
• First, it is a query-dependent measure: it rates pages with respect to
a query.
• Given a query, HITS first finds a set of pages that are relevant to the
query.
• It does that by intersecting hit lists of query words, and then adding
pages in the link neighbourhood of these pages: pages that link to or
are linked from one of the pages in the original relevant set.
• Each page in this set is considered an authority on the query to the
degree that other pages in the relevant set point to it.
• A page is considered a hub to the degree that it points to other
authoritative pages in the relevant set.
• we want to give more value to the high-quality hubs and authorities.
• Thus, as with Page Rank, we iterate a process that updates the
authority score of a page to be the sum of the hub scores of the
pages that point to it.
• The hub score is the sum of the authority scores of the pages it
points to.
• If we then normalize the scores and repeat k times, the process will
converge.
• Both Page Rank and HITS played important roles in developing our
understanding of Web information retrieval.
HITS ALGORITHM
Question answering
• In Question answering the query really is a question, and the answer
is not a ranked list of documents but rather a short response-a
sentence, or even just a phrase.
• The ASKMSR system (Banko et al., 2002) is a typical Web-based
question-answering system.
• It is based on the intuition that most questions will be answered
many times on the Web.
• Question answering should be thought of as a problem in precision,
not recall.
• We don’t have to deal with all the different ways that an answer
might be phrased - we only have to find one of them.
• For example, consider the query : Who killed Abraham Lincoln?
• Suppose a system had to answer that question with access only to a
single encyclopedia, John will forever be known as the man who
ended Abraham Lincoln’s life.
• The ASKMSR does know 15 different kinds of questions, and how
they can be rewritten as queries to a search engine.
• It knows that [Who killed Abraham Lincoln] can be rewritten as the
query [* killed Abraham Lincoln] and as [Abraham Lincoln was
killed by *].
• It issues these rewritten queries and examines the results that come
back- not the full Web pages, just the short summaries of text that
appear near the query terms.
• The results are broken into 1-, 2-, and 3-grams and tallied for
frequency in the result sets and for weight: an n-gram that came
back from a very specific query rewrite would get more weight than
one from a general query rewrite.
• Once the n-grams are scored, they are filtered by expected type.
• If the original query starts with “who,” then we filter on names of
people; for “how many” we filter on numbers, for “when,” on a
date or time.
• There is also a filter that says the answer should not be part of the
question.
• Together these should allow us to return “John Wilkes Booth” (and
not “Abraham Lincoln”) as the highest-scoring response.
• In some cases the answer will be longer than three words; since the
components responses only go up to 3-grams, a longer response
would have to be pieced together from shorter pieces.
• For example, in a system that used only bigrams, the answer “John
Wilkes Booth” could be pieced together from high-scoring pieces
“John Wilkes” and “Wilkes Booth.”
• At the Text Retrieval Evaluation Conference (TREC), ASKMSR
was rated as one of the top systems, beating out competitors with the
ability to do far more complex language understanding.
• ASKMSR relies upon the breadth of the content on the Web rather
than on its own depth of understanding.
• It won’t be able to handle complex inference patterns like
associating “who killed” with “ended the life of”.
• But it knows that the Web is so vast that it can afford to ignore
passages like that and wait for a simple passage it can handle.
Information extraction
• Information extraction is the process of acquiring knowledge by
skimming a text and looking for occurrences of a particular class of
object and for relationships among objects.
• A typical task is to extract instances of addresses from Web pages,
with database fields for street, city, state, and zip code or instances
of storms from weather reports, with fields for temperature, wind
speed, and precipitation.
• In a limited domain, this can be done with high accuracy.
• As the domain gets more general, more complex linguistic models
and more complex learning techniques are necessary.
• There are six different approaches to information extraction, in order of
increasing complexity on the following dimensions:
 deterministic to stochastic
 domain-specific to general
 hand-crafted to learned,
 small-scale to large-scale.
1) Finite-state automata for information extraction
• The simplest type of information extraction system is an attribute-based
extraction system
• It assumes that the entire text refers to a single object and the task is to
extract attributes of that object
• The template is defined by a finite state automaton, the simplest example
of which is the regular expression or regex.
• Regular expressions are used in Unix commands such as grep, in
programming languages such as Perl, and in word processors such as
Microsoft Word.
• Templates are often defined with three parts: a prefix regex, a target
regex, and a postfix regex.
• For example: the problem of extracting from the text “IBM ThinkBook
970. Our price: $399.00” the set of attributes {Manufacturer=IBM,
Model=ThinkBook970, Price=$399.00}.
• For prices, the target regex is as 0-9, the prefix would look for strings
such as “price:” and the postfix could be empty.
• If a regular expression for an attribute matches the text exactly once,
then we can pull out the portion of the text that is the value of the
attribute.
• If there is no match, all we can do is give a default value or leave the
attribute missing;
• If there are several matches, we need a process to choose among them.
• One strategy is to have several templates for each attribute, ordered by
priority.
• For example the top-priority template for price might look for the prefix
“our price:”;
• if that is not found, we look for the prefix “price:” and if that is not found,
the empty prefix.
• Another strategy is to take all the matches and find some way to choose
among them.
• For example, we could take the lowest price that is within 50% of the
highest price. That will select $78.00 as the target from the text “List price
$99.00, special sale price $78.00, shipping $3.00.”
• One step up from attribute-based extraction systems are relational
extraction systems, which deal with multiple objects and the relations
among them.
• Thus, when these systems see the text “$249.99,” they need to determine
not just that it is a price, but also which object has that price.
• A typical relational-based extraction system is FASTUS, which handles
news stories about corporate mergers and acquisitions.
• It can read the story: Bridgestone Sports Co. said Friday it has set up
a joint venture in Taiwan with a local concern and a Japanese trading
house to produce golf clubs to be shipped to Japan.
• The relations are extracted as follows:

• A relational extraction system can be built as a series of cascaded


finite-state transducers.
• It is a system consisting of a series of small, efficient finite-state
automata (FSAs), where each automaton receives text as input,
transduces the text into a different format, and passes it along to the
next automaton.
• FASTUS consists of five stages:
1. Tokenization
2. Complex-word handling
3. Basic-group handling
4. Complex-phrase handling
5. Structure merging
Tokenization :
• Segments the stream of characters into tokens (words, numbers, and
punctuation).
• For English, tokenization can be fairly simple; just separating
characters at white space or punctuation does a fairly good job.
• Some tokenizers also deal with markup languages such as HTML,
SGML, and XML.
Complex word handling:
• The second stage handles complex words, including collocations
such as “set up” and “joint venture,” as well as proper names such as
“Bridgestone Sports Co.”
• These are recognized by a combination of lexical entries and finite-
state grammar rules.
• For example, a company name might be recognized by the rule.

Basic group handling:


• The third stage handles basic groups, meaning noun groups and verb
groups.
• The idea is to chunk these into units that will be managed by the
later stages.
• The example sentence would emerge from this stage as the
following sequence of tagged groups:
• Here NG means noun group, VG is verb group, PR is preposition,
and CJ is conjunction.
Complex-phrase handling:
• The fourth stage combines the basic groups into complex phrases.
• Again, the aim is to have rules that are finite-state and thus can be
processed quickly, and that result in unambiguous (or nearly
unambiguous) output phrases.
• One type of combination rule deals with domain-specific events
• For example, the below rule captures one way to describe the
formation of a joint venture

• This stage is the first one in the cascade where the output is placed
into a database template as well as being placed in the output stream.
Structure merging:
• The final stage merges structures that were built up in the previous
step.
• If the next sentence says “The joint venture will start production in
January,” then this step will notice that there are two references to a
joint venture, and that they should be merged into one.
• In general, finite-state template-based information extraction works
well for a restricted domain in which it is possible to predetermine
what subjects will be discussed, and how they will be mentioned.
• The cascaded transducer model helps modularize the necessary
knowledge, easing construction of the system.
• These systems work especially well when they are reverse-
engineering text that has been generated by a program.
• For example, a shopping site on the Web is generated by a program
that takes database entries and formats them into Web pages;
• A template-based extractor then recovers the original database.
• Finite-state information extraction is less successful at recovering
information in highly variable format, such as text written by
humans on a variety of subjects.

2) Probabilistic models for information extraction:


• When information extraction must be attempted from noisy or
varied input, simple finite-state approaches fare poorly.
• It is too hard to get all the rules and their priorities right;
• it is better to use a probabilistic model rather than a rule-based
model. The simplest probabilistic model for sequences with hidden
state is the Hidden Markov Model(HMM).
• An HMM models a progression through a sequence of hidden states,
xt, with an observation et at each step.
• To apply HMMs to information extraction, we can either build one
big HMM for all the attributes or build a separate HMM for each
attribute.
• We’ll do the second. The observations are the words of the text, and
the hidden states are whether we are in the target, prefix, or postfix
part of the attribute template, or in the background (not part of a
template).
• For example, here is a brief text and the most probable (Viterbi) path
for that text for two HMMs, one trained to recognize the speaker in
a talk announcement, and one trained to recognize dates.
• The “-” indicates a background state:
• HMMs have two big advantages over FSAs for extraction.
• HMMs are probabilistic and thus tolerant to noise.
• In a regular expression, if a single expected character is missing, the
regex fails to match; with HMMs there is graceful degradation with
missing characters/words, and we get a probability indicating the
degree of match, not just a Boolean match/fail.
• HMMs can be trained from data, they don’t require laborious
engineering of templates, and thus they can more easily be kept up
to date as text changes over time.
• We have assumed a certain level of structure in our HMM templates
that makes it easier to learn
• With a partially specified structure, the forward backward algorithm
can be used to learn both the transition probabilities between states
and the observation model
• For example, the word “Friday” would have high probability in one
or more of the target states of the date HMM, and lower probability
elsewhere.
• With sufficient training data, the HMM automatically learns a
structure of dates that we find intuitive.
• In our example, the prefix covers expressions such as “Speaker:” and
“seminar by,” and the target has one state that covers titles and first
names and another state that covers initials and last names.
• Once the HMMs have been learned, we can apply them to a text,
using the Viterbi algorithm to find the most likely path through the
HMM states.
• One approach is to apply each attribute HMM separately; in this case
you would expect most of the HMMs to spend most of their time in
background states.
• This is appropriate when the extraction is sparse— when the number
of extracted words is small compared to the length of the text.
• The other approach is to combine all the individual attributes into
one big HMM, which would then find a path that wanders through
different target attributes, first finding a speaker target, then a date
target, etc.
• Separate HMMs are better when we expect just one of each
attribute in a text and one big HMM is better when the texts are
more free-form and dense with attributes.
• With either approach, in the end we have a collection of target
attribute observations, and have to decide what to do with them.
• If every expected attribute has one target filler then the decision is
easy.
• HMMs have the advantage of supplying probability numbers that
can help make the choice.
• If some targets are missing, we need to decide if this is an instance
of the desired relation at all, or if the targets found are false
positives.
3.Conditional random fields for information extraction
• One issue with HMMs for the information extraction task is that
they model a lot of probabilities that we don’t really need.
• An HMM is a generative model; it models the full joint probability
of observations and hidden states, and thus can be used to generate
samples.
• That is, we can use the HMM model not only to parse a text and
recover the speaker and date, but also to generate a random instance
of a text containing a speaker and a date.
• All we need in order to understand a text is a discriminative model,
one that models the conditional probability of the hidden attributes
given the observations (the text).
• A framework for this type of model is the conditional random field,
or CRF, which models a conditional probability distribution of a set
of target variables given a set of observed variables.
• Like Bayesian networks, CRFs can represent many different
structures of dependencies among the variables.
• One common structure is the linear-chain conditional random field
for representing Markov dependencies among variables in a
temporal sequence.
• HMMs are the temporal version of naive Bayes models, and linear-
chain CRFs are the temporal version of logistic regression, where
the predicted target is an entire state sequence rather than a single
binary variable.
• Let e1:N be the observations (e.g., words in a document), and x1:N
be the sequence of hidden states (e.g., the prefix, target, and postfix
states).
• A linear-chain conditional random field defines a conditional
probability distribution
• α is a normalization factor (to make sure the probabilities sum to 1).
• F is a feature function defined as the weighted sum of a collection of
k component feature functions:

• The λk parameter values are learned with a MAP (maximum a


posteriori) estimation procedure that maximizes the conditional
likelihood of the training data.
• The feature functions are the key components of a CRF.
• The function fk has access to a pair of adjacent states, xi−1 and xi.
• The entire observation (word) sequence e, and the current position
in the temporal sequence, i. This gives us a lot of flexibility in
defining features.
• We can define a simple feature function, for example one that
produces a value of 1 if the current word is ANDREW and the
current state is SPEAKER:

• If λ1 > 0, then whenever f1 is true, it increases the probability of the


hidden state sequence x1:N.
• This is another way of saying “the CRF model should prefer the
target state SPEAKER for the word ANDREW.”
• If on the other hand λ1 < 0, the CRF model will try to avoid this
association, and if λ1 = 0, this feature is ignored.
• Parameter values can be set manually or can be learned from data.
• Now consider a second feature function:

• This feature is true if the current state is SPEAKER and the next word is “said.”
• Note that both f1 and f2 can hold at the same time for a sentence like “Andrew said
. . . .”.
• In this case, the two features overlap each other and both boost the belief in x1 =
SPEAKER. Because of the independence assumption.
• HMMs cannot use overlapping features but CRFs can.
• Furthermore, a feature in a CRF can use any part of the sequence e1:N . Features
can also be defined over transitions between states.
• The features we defined here were binary, but in general, a feature function can be
any real-valued function.
• For domains where we have some knowledge about the types of features we would
like to include, the CRF formalism gives us a great deal of flexibility in defining
them.
• This flexibility can lead to accuracies that are higher than with less flexible models
such as HMMs.
4. Ontology extraction from large corpora
• A different application of extraction technology is building a large
knowledge base or ontology of facts from a corpus.
• This is different in three ways:
• First it is open-ended: we want to acquire facts about all types of
domains, not just one specific domain.
• Second, with a large corpus, this task is dominated by precision, not
recall. Just as with question answering on the Web.
• Third, the results can be statistical aggregates gathered from
multiple sources, rather than being extracted from one specific text.
• Here is one of the most productive templates:
• Here the bold words and commas must appear literally in the text,
but the parentheses are for grouping, the asterisk means repetition of
zero or more, and the question mark means optional.
• NP is a variable standing for a noun phrase;
• This template matches the texts “diseases such as rabies affect your
dog” and “supports network protocols such as DNS,” concluding
that rabies is a disease and DNS is a network protocol.
• Similar templates can be constructed with the key words
“including,” “especially,” and “or other.” Of course these templates
will fail to match many relevant passages, like “Rabies is a disease.”
That is intentional.
• The “NP is a NP” template does indeed sometimes denote a
subcategory relation, but it often means something else, as in “There
is a God” or “She is a little tired.”
• With a large corpus we can afford to be picky; to use only the high-
precision templates.
5.Automated template construction
• The subcategory relation is so fundamental that is worthwhile to
handcraft a few templates to help identify instances of it occurring
in natural language text.
• But what about the thousands of other relations in the world?
• Fortunately, it is possible to learn templates from a few examples,
then use the templates to learn more examples, from which more
templates can be learned, and so on.
• In one of the first experiments of this kind, Brin (1999) started with
a data set of just five examples:
• Clearly these are examples of the author–title relation, but the
learning system had no knowledge of authors or titles.
• The words in these examples were used in a search over a Web
corpus, resulting in 199 matches.
• Each match is defined as a tuple of seven strings,
(Author, Title, Order, Prefix, Middle, Postfix, URL)
• where Order is true if the author came first and false if the title came
first.
• Middle is the characters between the author and title, Prefix is the
10 characters before the match, Suffix is the 10 characters after the
match, and URL is the Web address where the match was made.
• Given a set of matches, a simple template-generation scheme can
find templates to explain the matches.
• The language of templates was designed to have a close mapping to
the matches themselves, to be amenable to automated learning, and
to emphasize high precision.
• Each template has the same seven components as a match.
• The Author and Title are regexes consisting of any characters and
constrained to have a length from half the minimum length of the
examples to twice the maximum length.
• The prefix, middle, and postfix are restricted to literal strings, not
regexes.
• The middle is the easiest to learn: each distinct middle string in the
set of matches is a distinct candidate template.
• For each such candidate, the template’s Prefix is then defined as the
longest common suffix of all the prefixes in the matches, and the
Postfix is defined as the longest common prefix of all the postfixes
in the matches.
• If either of these is of length zero, then the template is rejected. The
URL of the template is defined as the longest prefix of the URLs in
the matches.
• The disadvantage in this approach is the sensitivity to noise.
• If one of the first few templates is incorrect, errors can propagate
quickly.
• One way to limit this problem is to not accept a new example unless
it is verified by multiple templates, and not accept a new template
unless it discovers multiple examples that are also found by other
templates.
6.Machine reading
• To build a large ontology with many thousands of relations, even
that amount of work would be onerous;
• we would like to have an extraction system with no human input of
any kind of system that could read on its own and build up its own
database.
• Such a system would be relation-independent; would work for any
relation.
• In practice, these systems work on all relations in parallel, because
of the I/O demands of large corpora.
• They behave less like a traditional information extraction system
that is targeted at a few relations and more like a human reader who
learns from the text itself, because of this the field has been called
machine reading.
• A representative machine-reading system is TEXTRUNNER (Banko
and Etzioni, 2008).
• TEXTRUNNER uses cotraining to boost its performance, but it
needs something to bootstrap from.
• For TEXTRUNNER, the original inspiration was a taxonomy of
eight very general syntactic templates.
• It was felt that a small number of templates like this could cover
most of the ways that relationships are expressed in English.
• The actual bootsrapping starts from a set of labelled examples that
are extracted from the Penn Treebank, a corpus of parsed sentences.
• For example, from the parse of the sentence “Einstein received the
Nobel Prize in 1921,” TEXTRUNNER is able to extract the relation
(“Einstein,” “received,” “Nobel Prize”).
• Given a set of labeled examples of this type, TEXTRUNNER trains
a linear-chain CRF to extract further examples from unlabeled text.
• The features in the CRF include function words like “to” and “of”
and “the,” but not nouns and verbs (and not noun phrases or verb
phrases).
• Because TEXTRUNNER is domain-independent, it cannot rely on
predefined lists of nouns and verbs.
• TEXTRUNNER achieves a precision of 88% and recall of 45% (F1
of 60%) on a large Web corpus.
• TEXTRUNNER has extracted hundreds of millions of facts from a
corpus of a half-billion Web pages.
• For example, even though it has no predefined medical knowledge,
it has extracted over 2000 answers to the query [what kills bacteria];
• Correct answers include antibiotics, ozone, chlorine, Cipro, and
broccoli sprouts. Questionable answers include “water,” which came
from the sentence “Boiling water for at least 10 minutes will kill
bacteria.”
• It would be better to attribute this to “boiling water” rather than just
“water.”
• With the techniques outlined in this chapter and continual new
inventions, we are starting to get closer to the goal of machine
reading.
Machine Translation
• Machine translation is the automatic translation of text from one
natural language (the source) to another (the target).
• three main applications of machine translation.
• Rough translation - Provided by free online services, gives the “gist”
of a foreign sentence or document, but contains errors.
• Pre-edited translation - Used by companies to publish their
documentation and sales materials in multiple languages.
• The original source text is written in a constrained language that is
easier to translate automatically, and the results are usually edited by
a human to correct any errors.
• Restricted-source translation - Works fully automatically, but only
on highly stereotypical language, such as a weather report.
Machine Translation
• Translation is difficult because, in the fully general case, it requires
in-depth understanding of the text.
• Consider the word “Open” on the door of a store and the same word
“Open” on a large banner outside a newly constructed store.
• The two signs use the identical word to convey different meanings.
• The problem is that different languages categorize the world
differently. For example, the French word “doux” covers a wide
range of meanings corresponding approximately to the English
words “soft,” “sweet,” and “gentle.”
• Representing the meaning of a sentence is more difficult for
translation than it is for single-language understanding.
• A representation language that makes all the distinctions necessary
for a set of languages is called an interlingua.
Machine Translation
• Translator (human or machine) often needs to understand the actual
situation described in the source, not just the individual words.
• For example, to translate the English word “him,” into Korean, a choice
must be made between the humble and honorific form, a choice that
depends on the social relationship between the speaker and the referent of
“him.”
• Translators (both machine and human) sometimes find it difficult to make
this choice.
• To get the translation right, one must understand physics as well as
language.
Machine Translation
Machine translation systems:
• All translation systems must model the source and target languages, but
systems vary in the type of models they use.
• Some systems attempt to analyze the source language text all the way into
an interlingua knowledge representation and then generate sentences in the
target language from that representation.
• This is difficult because it involves three unsolved problems:
1.Creating a complete knowledge representation of everything.
2.Parsing into that representation.
3.Generating sentences from that representation.
Machine Translation
• Other systems are based on a transfer model. They keep a database
of translation rules, and whenever the rule matches, they translate
directly.
• Transfer can occur at the lexical, syntactic, or semantic level.
• For example, a strictly syntactic rule maps English [Adjective Noun]
to French [Noun Adjective].
• A mixed syntactic and lexical rule maps French [S1 “et puis” S2] to
English [S1 “and then” S2]. Figure 23.12 diagrams the various
transfer points.
Machine Translation

You might also like