0% found this document useful (0 votes)
11 views13 pages

Artificial Intelligence: Natural Language Processing

The document discusses language models in natural language processing, focusing on n-gram models for both characters and words, and their applications in tasks like language identification and text classification. It also covers the challenges of smoothing n-gram models, model evaluation using perplexity, and the principles of information retrieval systems, including query formulation and scoring functions like BM25. Additionally, it highlights the importance of feature selection and various supervised learning techniques for effective text categorization.

Uploaded by

jiisayyad62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

Artificial Intelligence: Natural Language Processing

The document discusses language models in natural language processing, focusing on n-gram models for both characters and words, and their applications in tasks like language identification and text classification. It also covers the challenges of smoothing n-gram models, model evaluation using perplexity, and the principles of information retrieval systems, including query formulation and scoring functions like BM25. Additionally, it highlights the importance of feature selection and various supervised learning techniques for effective text categorization.

Uploaded by

jiisayyad62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

▪ This work has generally used policy search as well as the PEGASUS algorithm with simulation

based on a learned transition model.

ARTIFICIAL
INTELLIGENCE

UNIT – III

Chapter – 2

Natural Language Processing

68
3.2.1. LANGUAGE MODELS :

▪ Formal languages, such as the programming languages Java or Python, have precisely defined
language models.
▪ A language can be defined as a set of strings; “print(2 + 2)” is a legal program in the language
Python, whereas “2)+(2 print” is not.
▪ Since there are an infinite number of legal programs, they cannot be enumerated; instead they are
specified by set of rules called a grammar.
▪ Formal languages also have rules that define the meaning or semantics of a program; for example,
the rules say that the “meaning” of “2 + 2” is 4, and the meaning of “1/0” is that an error is signaled.
▪ Natural languages, such as English or Spanish, cannot be characterized as a definitive set of
sentences.
▪ Everyone agrees that “Not to be invited is sad” is a sentence of English, but people disagree on the
grammaticality of “To be not invited is sad.”
▪ Therefore, it is more fruitful to define a natural language model as a probability distribution over
sentences rather than a definitive set.
P(S = words)
▪ Natural languages are also ambiguous. Because we cannot speak of a single meaning for a sentence,
but rather of a probability distribution over possible meanings.
▪ Finally, natural languages are difficult to deal with because they are very large, and constantly
changing.
▪ Thus, our language models are, at best, an approximation. We start with the simplest possible
approximations and move up from there.

(i) N-gram character models :


▪ Ultimately, a written text is composed of characters—letters, digits, punctuation, and spaces in
English (and more exotic characters in some other languages).
▪ Thus, one of the simplest language models is a probability distribution over sequences of characters.
▪ We write P(C1:N) for the probability of a sequence of N characters, C1 through CN.
▪ In one Web collection, P(“the”) = 0.027 and P(“zgq”) = 0.000000002.
▪ A sequence of written symbols of length n is called an n-gram (from the Greek root for writing or
letters), with special case “unigram” for 1-gram, “bigram” for 2-gram, and “trigram” for 3-gram.
▪ A model of the probability distribution of n-letter sequences is thus called an n-gram model. (But
be careful: we can have n-gram models over sequences of words, syllables, or other units; not just
over characters.)
▪ An n-gram model is defined as a Markov chain of order n - 1.
▪ In a Markov chain the probability of character ci depends only on the immediately preceding
characters, not on any other characters.
▪ So in a trigram model (Markov chain of order 2) we have :

▪ We can define the probability of a sequence of characters P(c1:N) under the trigram model by first

69
factoring with the chain rule and then using the Markov assumption:

▪ We call a body of text a corpus (plural corpora), from the Latin word for body.
▪ What can we do with n-gram character models? One task for which they are well suited is language
identification .
▪ For Example, given a text, determine what natural language it is written in.
▪ This is a relatively easy task; even with short texts such as “Hello, world” or “Wie geht es dir,” it is
easy to identify the first as English and the second as German.
▪ Computer systems identify languages with greater than 99% accuracy; occasionally, closely related
languages, such as Swedish and Norwegian, are confused.
▪ One approach to language identification is to first build a trigram character model of each candidate
language, where the variable L ranges over languages.
▪ That gives us a model of P(Text | Language), but we want to select the most probable language given
the text, so we apply Bayes’ rule followed by the Markov assumption to get the most probable
language:

(ii) Smoothing n-gram models :


▪ The major complication of n-gram models is that the training corpus provides only an estimate of the
true probability distribution.
▪ For common character sequences such as “ _th” any English corpus will give a good estimate: about
1.5% of all trigrams. On the other hand, “ _ht” is very uncommon—no dictionary words start with ht.
The process of adjusting the probability of low-frequency counts is called smoothing.
▪ The simplest type of smoothing was suggested by Pierre-Simon Laplace in the 18th century: he said
that, in the lack of further information, if a random Boolean variable X has been false in all n
observations so far then the estimate for P (X = true) should be 1/(n+2).
▪ That is, he assumes that with two more trials, one might be true and one false. Laplace smoothing
(also called add-one smoothing) is a step in the right direction, but performs relatively poorly.
▪ A better approach is a backoff model, in which we start by estimating n-gram counts, but for any
particular sequence that has a low (or zero) count, we back off to (n - 1)-grams.
▪ Linear interpolation smoothing is a backoff model that combines trigram, bigram, and unigram
models by linear interpolation. It defines the probability estimate as :

70
▪ where λ3 + λ2 + λ1 = 1. The parameter values λi can be fixed, or they can be trained with an
expectation–maximization algorithm.
▪ It is also possible to have the values of λi depend on the counts: if we have a high count of trigrams,
then we weigh them relatively more; if only a low count, then we put more weight on the bigram and
unigram models.
(iii) Model evaluation :
▪ With so many possible n-gram models—unigram, bigram, trigram, interpolated smoothing with
different values of λ, etc.—how do we know what model to choose? We can evaluate a model with
cross-validation.
▪ Split the corpus into a training corpus and a validation corpus.Determine the parameters of the
model from the training data. Then evaluate the model on the validation corpus.
▪ The evaluation can be a task-specific metric, such as measuring accuracy on language identification.
▪ Alternatively we can have a task-independent model of language quality: calculate the probability
assigned to the validation corpus by the model; the higher the probability the better.
▪ This metric is inconvenient because the probability of a large corpus will be a very small number,
and floating-point underflow becomes an issue.
▪ A different way of describing the probability of a sequence is with a measure called perplexity,
defined as :

▪ Perplexity can be thought of as the reciprocal of probability, normalized by sequence length.


▪ It can also be thought of as the weighted average branching factor of a model. Suppose there are 100
characters in our language, and our model says they are all equally likely. Then for a sequence of any
length, the perplexity will be 100.
▪ If some characters are more likely than others, and the model reflects that, then the model will have a
perplexity less than 100.

(iv) N-gram word models :


▪ Now we turn to n-gram models over words rather than characters.
▪ All the same mechanism applies equally to word and character models. The main difference is that
the vocabulary—the set of symbols that make up the corpus and the model—is larger.
▪ There are only about 100 characters in most languages, and sometimes we build character models
that are even more restrictive, for example by treating “A” and “a” as the same symbol or by treating
all punctuation as the same symbol.
▪ But with word models we have at least tens of thousands of symbols, and sometimes millions.
▪ The wide range is because it is not clear what constitutes a word.
▪ In English a sequence of letters surrounded by spaces is a word, but in some languages, like Chinese,
words are not separated by spaces, and even in English many decisions must be made to have a clear
policy on word boundaries: how many words are in.
71
▪ Word n-gram models need to deal with out of vocabulary words.
▪ With character models, we didn’t have to worry about someone inventing a new letter of the
alphabet.
▪ But with word models there is always the chance of a new word that was not seen in the training
corpus, so we need to model that explicitly in our language model.

3.2.2. TEXT CLASSIFICATION :

▪ We now consider in depth the task of text classification, also known as categorization: givena text
of some kind, decide which of a predefined set of classes it belongs to.
▪ Language identification and genre classification are examples of text classification
▪ spam detection classifying an email message as spam or not-spam(ham).
▪ A training set is readily available: the positive (spam) examples are in my spam folder, the negative
(ham) examples are in my inbox.
▪ Note that we have two complementary ways of talking about classification.
▪ In the language-modeling approach, we define one n-gram language model for P(Message | spam)
by training on the spam folder, and one model for P(Message | ham) by training on the inbox.
▪ Then we can classify a new message with an application of Bayes’ rule:

▪ where P (c) is estimated just by counting the total number of spam and ham messages. This approach
works well for spam detection, just as it did for language identification.
▪ If there are 100,000 words in the language model, then the feature vector has length 100,000, but for
a short email message almost all the features will have count zero.
▪ This unigram representation has been called the bag of words model.
▪ You can think of the model as putting the words of the training corpus in a bag and then selecting
words one at a time.
▪ The notion of order of the words is lost; a unigram model gives the same probability to any
permutation of a text.
▪ Higher-order n-gram models maintain some local notion of word order.
▪ It can be expensive to run algorithms on a very large feature vector, so often a process of feature
selection is used to keep only the features that best discriminate between spam and ham.
▪ Once we have chosen a set of features, we can apply any of the supervised learning techniques we
have seen; popular ones for text categorization include k-nearest-neighbors, support vector machines,
decision trees, naive Bayes, and logistic regression.
▪ All of these have been applied to spam detection, usually with accuracy in the 98%–99% range.
With a carefully designed feature set, accuracy can exceed 99.9%.
72
(i) Classification by data compression :
▪ Another way to think about classification is as a problem in data compression.
▪ A lossless compression algorithm takes a sequence of symbols, detects repeated patterns in it, and
writes a description of the sequence that is more compact than the original.
▪ For example, the text “0.142857142857142857” might be compressed to “0.[142857]*3.”
▪ To do classification by compression, we first lump together all the spam training messages and
compress them as a unit.
▪ We do the same for the ham. Then when given a new message to classify, we append it to the spam
messages and compress the result.
▪ We also append it to the ham and compress that. Whichever class compresses better—adds the
fewer number of additional bytes for the new message—is the predicted class.

3.2.3. INFORMATION RETRIEVAL :

▪ Information retrieval is the task of finding documents that are relevant to a user’s need for
information.
▪ The best-known examples of information retrieval systems are search engines on the World Wide
Web.
▪ A Web user can type a query such as “AI book” into a search engine and see a list of relevant pages.

▪ An information retrieval (henceforth IR) system can be characterized by :


A corpus of documents. Each system must decide what it wants to treat as a document: a paragraph,
a page, or a multipage text.
 Queries posed in a query language. A query specifies what the user wants to know.The query
language can be just a list of words, such as [AI book]; or it can specify a phrase of words that must
be adjacent, as in [“AI book”]; it can contain Boolean operators as in [AI AND book]; it can include
non-Boolean operators such as [AI NEAR book].
 A result set. This is the subset of documents that the IR system judges to be relevant to the query.
 A presentation of the result set. This can be as simple as a ranked list of document titles or as
complex as a rotating color map of the result set projected onto a three dimensional space, rendered
as a two-dimensional display.
▪ The earliest IR systems worked on a Boolean keyword model. Each word in the
documentcollection is treated as a Boolean feature that is true of a document if the word occurs in the
document and false if it does not.
The query language is the language of Boolean expressions over features. A document is relevant
only if the expression evaluates to true.
▪ This model has the advantage of being simple to explain and implement.
▪ However, it has some disadvantages:

73
 First, the degree of relevance of a document is a single bit, so there is no guidance as to how to order
the relevant documents for presentation.
 Second, Boolean expressions are unfamiliar to users who are not programmers or logicians.
 Third, it can be hard to formulate an appropriate query, even for a skilled user.

(i) IR scoring functions :


▪ Most IR systems have abandoned the Boolean model and use models based on the statistics of word
counts. We describe the BM25 scoring function.
▪ A scoring function takes a document and a query and returns a numeric score; the most relevant
documents have the highest scores.
In the BM25 function, the score is a linear weighted combination of scores for each of the words that
make up the query.
▪ Three factors affect the weight of a query term:
 First, the frequency with which a query term appears in a document (also known as TF for term
frequency). For the query documents that mention “farming” frequently will have higher scores.
 Second, the inverse document frequency of the term, or IDF. The word “in” appears in almost every
document, so it has a high document frequency, and thus a low inverse document frequency, and
thus it is not as important to the query.
 Third, the length of the document. A million-word document will probably mention all the query
words, but may not actually be about the query. A short document that mentions all the words is a
much better candidate.
▪ The BM25 function takes all three of these into account.
▪ Then, given a document dj and a query consisting of the words q1:N, we have :

▪ IDF(qi) is the inverse document frequency of word qi, given by :

(ii) IR system evaluation :


▪ How do we know whether an IR system is performing well? We undertake an experiment in which
the system is given a set of queries and the result sets are scored with respect to human relevance
judgments.
▪ Traditionally, there have been two measures used in the scoring:
 recall
 precision.
▪ Precision measures the proportion of documents in the result set that are actually relevant.

74
▪ In our example, the precision is 30/(30 + 10) = .75. The false positive rate is 1 - .75 = .25.
▪ Recall measures the proportion of all the relevant documents in the collection that are in the result
set.
▪ In our example, recall is 30/(30 + 20) = .60. The false negative rate is 1 - .60 = .40.
▪ In a very large document collection, such as the World Wide Web, recall is difficult to compute,
because there is no easy way to examine every page on the Web for relevance.
▪ All we can do is either estimate recall by sampling or ignore recall completely and just judge
precision.

(iii) IR refinements :
▪ There are many possible refinements to the system described here, and indeed Web search engines
are continually updating their algorithms as they discover new approaches and as the Web grows and
changes.
▪ One common refinement is a better model of the effect of document length on relevance.
▪ Singhal et al. (1996) observed that simple document length normalization schemes tend to favor
short documents too much and long documents not enough.
▪ They propose a pivoted document length normalization scheme; the idea is that the pivot is the
document length at which the old-style normalization is correct; documents shorter than that get a
boost and longer ones get a penalty.
▪ The BM25 scoring function uses a word model that treats all words as completely independent, but
we know that some words are correlated.
▪ Many IR systems attempt to account for these correlations.
▪ The next step is to recognize synonyms, such as “sofa” for “couch.” As with stemming, this has the
potential for small gains in recall, but can hurt precision.
▪ As a final refinement, IR can be improved by considering metadata—data outside of the text of the
document. Examples include human-supplied keywords and publication data.
▪ On the Web, hypertext links between documents are a crucial source of information.

(iv) The PageRank algorithm :


▪ PageRank was one of the two original ideas that set Google’s search apart from other Web search
engines when it was introduced in 1997. (The other innovation was the use of anchor text—the
underlined text in a hyperlink).
▪ PageRank was invented to solve the problem of the tyranny of TF scores: if the query is [IBM],
how do we make sure that IBM’s home page, ibm.com, is the first result, even if another page
mentions the term “IBM” more frequently?
The idea is that ibm.com has many in-links (links to the page), so it should be ranked higher: each
in-link is a vote for the quality of the linked-to page.
▪ But if we only counted in-links, then it would be possible for a Web spammer to create a network of
pages and have them all point to a page of his choosing, increasing the score of that page.
75
▪ Therefore, the PageRank algorithm is designed to weight links from high-quality sites more heavily.
▪ What is a highquality site? One that is linked to by other high-quality sites.
▪ The definition is recursive, but we will see that the recursion bottoms out properly. The PageRank
for a page p is defined as:

▪ where P R(p) is the PageRank of page p, N is the total number of pages in the corpus, ini are the
pages that link in to p, and C(ini) is the count of the total number of out-links on page ini.
▪ The constant d is a damping factor. It can be understood through the random surfer model :
imagine a Web surfer who starts at some random page and begins exploring.

(v) The HITS algorithm :


▪ The Hyperlink-Induced Topic Search algorithm, also known as “Hubs and Authorities” or HITS, is
another influential link-analysis algorithm .
▪ HITS differs from PageRank in several ways.
▪ First, it is a query-dependent measure: it rates pages with respect to a query.
▪ Given a query, HITS first finds a set of pages that are relevant to the query. It does that by
intersecting hit lists of query words, and then adding pages in the link neighborhood of these pages
▪ Both PageRank and HITS played important roles in developing our understanding of Web
information retrieval.
▪ These algorithms and their extensions are used in ranking billions of queries daily as search engines
steadily develop better ways of extracting yet finer signals of search relevance.
(vi) Question answering :
▪ Information retrieval is the task of finding documents that are relevant to a query, where the query
may be a question, or just a topic area or concept.
▪ Question answering is a somewhat different task, in which the query really is a question, and the
answer is not a ranked list of documents but rather a short response—a sentence, or even just a
phrase.
▪ There have been question-answering NLP (natural language processing) systems since the 1960s,
but only since 2001 have such systems used Web information retrieval to radically increase their
breadth of coverage.

76
3.2.4. INFORMATION EXTRACTION :

▪ Information extraction is the process of acquiring knowledge by skimming a text and looking for
occurrences of a particular class of object and for relationships among objects.
▪ A typical task is to extract instances of addresses from Web pages, with database fields for street,
city, state, and zip code; or instances of storms from weather reports, with fields for temperature,
wind speed, and precipitation.
▪ In a limited domain, this can be done with high accuracy. As the domain gets more general, more
complex linguistic models and more complex learning techniques are necessary.

(i) Finite-state automata for information extraction:


▪ The simplest type of information extraction system is an attribute-based extraction system that
assumes that the entire text refers to a single object and the task is to extract attributes of that object.
▪ For example, the problem of extracting from the text “IBM ThinkBook 970. Our price: $399.00” the
set of attributes {Manufacturer=IBM, Model=ThinkBook970, Price=$399.00}.
▪ We can address this problem by defining a template (also known as a pattern) for each attribute we
would like to extract. The template is defined by a finite state automaton, the simplest example of
which is the regular expression, or regex.
▪ Here we show how to build up a regular expression template for prices in dollars:

▪ Templates are often defined with three parts: a prefix regex, a target regex, and a postfix regex.
▪ For prices, the target regex is as above, the prefix would look for strings such as “price:” and the
postfix could be empty.
▪ The idea is that some clues about an attribute come from the attribute value itself and some come
from the surrounding text.
▪ One step up from attribute-based extraction systems are relational extraction systems, which deal
with multiple objects and the relations among them.
▪ Thus, when these systems see the text “$249.99,” they need to determine not just that it is a price, but
also which object has that price.
▪ A typical relational-based extraction system is FASTUS, which handles news stories about corporate
mergers and acquisitions.
▪ A relational extraction system can be built as a series of cascaded finite-state transducers.
▪ That is, the system consists of a series of small, efficient finite-state automata (FSAs), where each
automaton receives text as input, transduces the text into a different format, and passes it along to the
next automaton.
77
▪ FASTUS consists of five stages:

1. Tokenization
2. Complex-word handling
3. Basic-group handling
4. Complex-phrase handling
5. Structure merging
1. FASTUS’s first stage is tokenization, which segments the stream of characters into tokens (words,
numbers, and punctuation). Some tokenizers also deal with markup languages such as HTML,
SGML, and XML.
2. The second stage handles complex words, including collocations such as “set up” and “joint
venture,” as well as proper names such as “Bridgestone Sports Co.”
3. The third stage handles basic groups, meaning noun groups and verb groups. The idea is to chunk
these into units that will be managed by the later stages.
4. The fourth stage combines the basic groups into complex phrases.
5. The final stage merges structures that were built up in the previous step.

(ii) Probabilistic models for information extraction:


▪ When information extraction must be attempted from noisy or varied input, simple finite-state
approaches fare poorly.
▪ It is too hard to get all the rules and their priorities right; it is better to use a probabilistic model
rather than a rule-based model.
▪ The simplest probabilistic model for sequences with hidden state is the hidden Markov model, or
HMM.
▪ HMM models a progression through a sequence of hidden states, xt, with an observation et at each
step.
▪ To apply HMMs to information extraction, we can either build one big HMM for all the attributes or
build a separate HMM for each attribute. We’ll do the second.
▪ HMMs have two big advantages over FSAs for extraction.
 First, HMMs are probabilistic, and thus tolerant to noise.
Second, HMMs can be trained from data; they don’t require laborious engineering of
templates, and thus they can more easily be kept up to date as text changes over time.

(iii) Conditional random fields for information extraction :


▪ One issue with HMMs for the information extraction task is that they model a lot of probabilities
that we don’t really need.
▪ Modeling this directly gives us some freedom. We don’t need the independence assumptions of the
Markov model—we can have an xt that is dependent on x1.
▪ A framework for this type of model is the conditional random field, or CRF, which models a
78
conditional probability distribution of a set of target variables given a set of observed variables.
▪ Like Bayesian networks, CRFs can represent many different structures of dependencies among the
variables.
▪ One common structure is the linear-chain conditional random field for representing Markov
dependencies among variables in a temporal sequence.
▪ Thus, HMMs are the temporal version of naive Bayes models, and linear-chain CRFs are the
temporal version of logistic regression.

(iv) Ontology extraction from large corpora :


▪ So far we have thought of information extraction as finding a specific set of relations (e.g.,speaker,
time, location) in a specific text (e.g., a talk announcement).
▪ A different application of extraction technology is building a large knowledge base or ontology of
facts from a corpus.
▪ This is different in three ways:
▪ First it is open-ended—we want to acquire facts about all types of domains, not just one specific
domain.
▪ Second, with a large corpus, this task is dominated by precision, not recall—just as with question
answering on the Web .
▪ Third, the results can be statistical aggregates gathered from multiple sources, rather than being
extracted from one specific text.

(v) Automated template construction :


▪ Fortunately, it is possible to learn templates from a few examples, then use the templates to learn
more examples, from which more templates can be learned, and so on.
▪ In one of the first experiments of this kind, Brin (1999) started with a data set of just five examples:

(“Isaac Asimov”, “The Robots of Dawn”)


(“David Brin”, “Startide Rising”)
(“James Gleick”, “Chaos—Making a New Science”)
(“Charles Dickens”, “Great Expectations”)
(“William Shakespeare”, “The Comedy of Errors”)
▪ Clearly these are examples of the author–title relation, but the learning system had no knowledge of
authors or titles.
▪ The words in these examples were used in a search over a Web corpus, resulting in 199 matches.
Each match is defined as a tuple of seven strings,

(Author, Title, Order, Prefix, Middle, Postfix, URL) ,


▪ where Order is true if the author came first and false if the title came first, Middle is the characters
between the author and title, Prefix is the 10 characters before the match, Suffix is the 10 characters
79
after the match, and URL is the Web address where the match was made.

(vi) Machine reading :


▪ Automated template construction is a big step up from handcrafted template construction, but it still
requires a handful of labeled examples of each relation to get started.
▪ To build a large ontology with many thousands of relations, even that amount of work would be
onerous; we would like to have an extraction system with no human input of any kind—a system that
could read on its own and build up its own database.
▪ Such a system would be relation-independent; would work for any relation. In practice, these
systems work on all relations in parallel, because of the I/O demands of large corpora.
▪ They behave less like a traditional information extraction system that is targeted at a few relations
and more like a human reader who learns from the text itself; because of this the field has been called
machine reading.

80

You might also like