Artificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language Processing
ARTIFICIAL
INTELLIGENCE
UNIT – III
Chapter – 2
68
3.2.1. LANGUAGE MODELS :
▪ Formal languages, such as the programming languages Java or Python, have precisely defined
language models.
▪ A language can be defined as a set of strings; “print(2 + 2)” is a legal program in the language
Python, whereas “2)+(2 print” is not.
▪ Since there are an infinite number of legal programs, they cannot be enumerated; instead they are
specified by set of rules called a grammar.
▪ Formal languages also have rules that define the meaning or semantics of a program; for example,
the rules say that the “meaning” of “2 + 2” is 4, and the meaning of “1/0” is that an error is signaled.
▪ Natural languages, such as English or Spanish, cannot be characterized as a definitive set of
sentences.
▪ Everyone agrees that “Not to be invited is sad” is a sentence of English, but people disagree on the
grammaticality of “To be not invited is sad.”
▪ Therefore, it is more fruitful to define a natural language model as a probability distribution over
sentences rather than a definitive set.
P(S = words)
▪ Natural languages are also ambiguous. Because we cannot speak of a single meaning for a sentence,
but rather of a probability distribution over possible meanings.
▪ Finally, natural languages are difficult to deal with because they are very large, and constantly
changing.
▪ Thus, our language models are, at best, an approximation. We start with the simplest possible
approximations and move up from there.
▪ We can define the probability of a sequence of characters P(c1:N) under the trigram model by first
69
factoring with the chain rule and then using the Markov assumption:
▪ We call a body of text a corpus (plural corpora), from the Latin word for body.
▪ What can we do with n-gram character models? One task for which they are well suited is language
identification .
▪ For Example, given a text, determine what natural language it is written in.
▪ This is a relatively easy task; even with short texts such as “Hello, world” or “Wie geht es dir,” it is
easy to identify the first as English and the second as German.
▪ Computer systems identify languages with greater than 99% accuracy; occasionally, closely related
languages, such as Swedish and Norwegian, are confused.
▪ One approach to language identification is to first build a trigram character model of each candidate
language, where the variable L ranges over languages.
▪ That gives us a model of P(Text | Language), but we want to select the most probable language given
the text, so we apply Bayes’ rule followed by the Markov assumption to get the most probable
language:
70
▪ where λ3 + λ2 + λ1 = 1. The parameter values λi can be fixed, or they can be trained with an
expectation–maximization algorithm.
▪ It is also possible to have the values of λi depend on the counts: if we have a high count of trigrams,
then we weigh them relatively more; if only a low count, then we put more weight on the bigram and
unigram models.
(iii) Model evaluation :
▪ With so many possible n-gram models—unigram, bigram, trigram, interpolated smoothing with
different values of λ, etc.—how do we know what model to choose? We can evaluate a model with
cross-validation.
▪ Split the corpus into a training corpus and a validation corpus.Determine the parameters of the
model from the training data. Then evaluate the model on the validation corpus.
▪ The evaluation can be a task-specific metric, such as measuring accuracy on language identification.
▪ Alternatively we can have a task-independent model of language quality: calculate the probability
assigned to the validation corpus by the model; the higher the probability the better.
▪ This metric is inconvenient because the probability of a large corpus will be a very small number,
and floating-point underflow becomes an issue.
▪ A different way of describing the probability of a sequence is with a measure called perplexity,
defined as :
▪ We now consider in depth the task of text classification, also known as categorization: givena text
of some kind, decide which of a predefined set of classes it belongs to.
▪ Language identification and genre classification are examples of text classification
▪ spam detection classifying an email message as spam or not-spam(ham).
▪ A training set is readily available: the positive (spam) examples are in my spam folder, the negative
(ham) examples are in my inbox.
▪ Note that we have two complementary ways of talking about classification.
▪ In the language-modeling approach, we define one n-gram language model for P(Message | spam)
by training on the spam folder, and one model for P(Message | ham) by training on the inbox.
▪ Then we can classify a new message with an application of Bayes’ rule:
▪ where P (c) is estimated just by counting the total number of spam and ham messages. This approach
works well for spam detection, just as it did for language identification.
▪ If there are 100,000 words in the language model, then the feature vector has length 100,000, but for
a short email message almost all the features will have count zero.
▪ This unigram representation has been called the bag of words model.
▪ You can think of the model as putting the words of the training corpus in a bag and then selecting
words one at a time.
▪ The notion of order of the words is lost; a unigram model gives the same probability to any
permutation of a text.
▪ Higher-order n-gram models maintain some local notion of word order.
▪ It can be expensive to run algorithms on a very large feature vector, so often a process of feature
selection is used to keep only the features that best discriminate between spam and ham.
▪ Once we have chosen a set of features, we can apply any of the supervised learning techniques we
have seen; popular ones for text categorization include k-nearest-neighbors, support vector machines,
decision trees, naive Bayes, and logistic regression.
▪ All of these have been applied to spam detection, usually with accuracy in the 98%–99% range.
With a carefully designed feature set, accuracy can exceed 99.9%.
72
(i) Classification by data compression :
▪ Another way to think about classification is as a problem in data compression.
▪ A lossless compression algorithm takes a sequence of symbols, detects repeated patterns in it, and
writes a description of the sequence that is more compact than the original.
▪ For example, the text “0.142857142857142857” might be compressed to “0.[142857]*3.”
▪ To do classification by compression, we first lump together all the spam training messages and
compress them as a unit.
▪ We do the same for the ham. Then when given a new message to classify, we append it to the spam
messages and compress the result.
▪ We also append it to the ham and compress that. Whichever class compresses better—adds the
fewer number of additional bytes for the new message—is the predicted class.
▪ Information retrieval is the task of finding documents that are relevant to a user’s need for
information.
▪ The best-known examples of information retrieval systems are search engines on the World Wide
Web.
▪ A Web user can type a query such as “AI book” into a search engine and see a list of relevant pages.
73
First, the degree of relevance of a document is a single bit, so there is no guidance as to how to order
the relevant documents for presentation.
Second, Boolean expressions are unfamiliar to users who are not programmers or logicians.
Third, it can be hard to formulate an appropriate query, even for a skilled user.
74
▪ In our example, the precision is 30/(30 + 10) = .75. The false positive rate is 1 - .75 = .25.
▪ Recall measures the proportion of all the relevant documents in the collection that are in the result
set.
▪ In our example, recall is 30/(30 + 20) = .60. The false negative rate is 1 - .60 = .40.
▪ In a very large document collection, such as the World Wide Web, recall is difficult to compute,
because there is no easy way to examine every page on the Web for relevance.
▪ All we can do is either estimate recall by sampling or ignore recall completely and just judge
precision.
(iii) IR refinements :
▪ There are many possible refinements to the system described here, and indeed Web search engines
are continually updating their algorithms as they discover new approaches and as the Web grows and
changes.
▪ One common refinement is a better model of the effect of document length on relevance.
▪ Singhal et al. (1996) observed that simple document length normalization schemes tend to favor
short documents too much and long documents not enough.
▪ They propose a pivoted document length normalization scheme; the idea is that the pivot is the
document length at which the old-style normalization is correct; documents shorter than that get a
boost and longer ones get a penalty.
▪ The BM25 scoring function uses a word model that treats all words as completely independent, but
we know that some words are correlated.
▪ Many IR systems attempt to account for these correlations.
▪ The next step is to recognize synonyms, such as “sofa” for “couch.” As with stemming, this has the
potential for small gains in recall, but can hurt precision.
▪ As a final refinement, IR can be improved by considering metadata—data outside of the text of the
document. Examples include human-supplied keywords and publication data.
▪ On the Web, hypertext links between documents are a crucial source of information.
▪ where P R(p) is the PageRank of page p, N is the total number of pages in the corpus, ini are the
pages that link in to p, and C(ini) is the count of the total number of out-links on page ini.
▪ The constant d is a damping factor. It can be understood through the random surfer model :
imagine a Web surfer who starts at some random page and begins exploring.
76
3.2.4. INFORMATION EXTRACTION :
▪ Information extraction is the process of acquiring knowledge by skimming a text and looking for
occurrences of a particular class of object and for relationships among objects.
▪ A typical task is to extract instances of addresses from Web pages, with database fields for street,
city, state, and zip code; or instances of storms from weather reports, with fields for temperature,
wind speed, and precipitation.
▪ In a limited domain, this can be done with high accuracy. As the domain gets more general, more
complex linguistic models and more complex learning techniques are necessary.
▪ Templates are often defined with three parts: a prefix regex, a target regex, and a postfix regex.
▪ For prices, the target regex is as above, the prefix would look for strings such as “price:” and the
postfix could be empty.
▪ The idea is that some clues about an attribute come from the attribute value itself and some come
from the surrounding text.
▪ One step up from attribute-based extraction systems are relational extraction systems, which deal
with multiple objects and the relations among them.
▪ Thus, when these systems see the text “$249.99,” they need to determine not just that it is a price, but
also which object has that price.
▪ A typical relational-based extraction system is FASTUS, which handles news stories about corporate
mergers and acquisitions.
▪ A relational extraction system can be built as a series of cascaded finite-state transducers.
▪ That is, the system consists of a series of small, efficient finite-state automata (FSAs), where each
automaton receives text as input, transduces the text into a different format, and passes it along to the
next automaton.
77
▪ FASTUS consists of five stages:
1. Tokenization
2. Complex-word handling
3. Basic-group handling
4. Complex-phrase handling
5. Structure merging
1. FASTUS’s first stage is tokenization, which segments the stream of characters into tokens (words,
numbers, and punctuation). Some tokenizers also deal with markup languages such as HTML,
SGML, and XML.
2. The second stage handles complex words, including collocations such as “set up” and “joint
venture,” as well as proper names such as “Bridgestone Sports Co.”
3. The third stage handles basic groups, meaning noun groups and verb groups. The idea is to chunk
these into units that will be managed by the later stages.
4. The fourth stage combines the basic groups into complex phrases.
5. The final stage merges structures that were built up in the previous step.
80