0% found this document useful (0 votes)

11 views13 pages

Artificial Intelligence: Natural Language Processing

The document discusses language models in natural language processing, focusing on n-gram models for both characters and words, and their applications in tasks like language identification and text classification. It also covers the challenges of smoothing n-gram models, model evaluation using perplexity, and the principles of information retrieval systems, including query formulation and scoring functions like BM25. Additionally, it highlights the importance of feature selection and various supervised learning techniques for effective text categorization.

Uploaded by

jiisayyad62

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views13 pages

Artificial Intelligence: Natural Language Processing

Uploaded by

jiisayyad62

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

▪ This work has generally used policy search as well as the PEGASUS algorithm with simulation

based on a learned transition model.

ARTIFICIAL
INTELLIGENCE

UNIT – III

Chapter – 2

Natural Language Processing

68
3.2.1. LANGUAGE MODELS :

▪ Formal languages, such as the programming languages Java or Python, have precisely defined
language models.
▪ A language can be defined as a set of strings; “print(2 + 2)” is a legal program in the language
Python, whereas “2)+(2 print” is not.
▪ Since there are an infinite number of legal programs, they cannot be enumerated; instead they are
specified by set of rules called a grammar.
▪ Formal languages also have rules that define the meaning or semantics of a program; for example,
the rules say that the “meaning” of “2 + 2” is 4, and the meaning of “1/0” is that an error is signaled.
▪ Natural languages, such as English or Spanish, cannot be characterized as a definitive set of
sentences.
▪ Everyone agrees that “Not to be invited is sad” is a sentence of English, but people disagree on the
grammaticality of “To be not invited is sad.”
▪ Therefore, it is more fruitful to define a natural language model as a probability distribution over
sentences rather than a definitive set.
P(S = words)
▪ Natural languages are also ambiguous. Because we cannot speak of a single meaning for a sentence,
but rather of a probability distribution over possible meanings.
▪ Finally, natural languages are difficult to deal with because they are very large, and constantly
changing.
▪ Thus, our language models are, at best, an approximation. We start with the simplest possible
approximations and move up from there.

(i) N-gram character models :

▪ Ultimately, a written text is composed of characters—letters, digits, punctuation, and spaces in
English (and more exotic characters in some other languages).
▪ Thus, one of the simplest language models is a probability distribution over sequences of characters.
▪ We write P(C1:N) for the probability of a sequence of N characters, C1 through CN.
▪ In one Web collection, P(“the”) = 0.027 and P(“zgq”) = 0.000000002.
▪ A sequence of written symbols of length n is called an n-gram (from the Greek root for writing or
letters), with special case “unigram” for 1-gram, “bigram” for 2-gram, and “trigram” for 3-gram.
▪ A model of the probability distribution of n-letter sequences is thus called an n-gram model. (But
be careful: we can have n-gram models over sequences of words, syllables, or other units; not just
over characters.)
▪ An n-gram model is defined as a Markov chain of order n - 1.
▪ In a Markov chain the probability of character ci depends only on the immediately preceding
characters, not on any other characters.
▪ So in a trigram model (Markov chain of order 2) we have :

▪ We can define the probability of a sequence of characters P(c1:N) under the trigram model by first

69
factoring with the chain rule and then using the Markov assumption:

▪ We call a body of text a corpus (plural corpora), from the Latin word for body.
▪ What can we do with n-gram character models? One task for which they are well suited is language
identification .
▪ For Example, given a text, determine what natural language it is written in.
▪ This is a relatively easy task; even with short texts such as “Hello, world” or “Wie geht es dir,” it is
easy to identify the first as English and the second as German.
▪ Computer systems identify languages with greater than 99% accuracy; occasionally, closely related
languages, such as Swedish and Norwegian, are confused.
▪ One approach to language identification is to first build a trigram character model of each candidate
language, where the variable L ranges over languages.
▪ That gives us a model of P(Text | Language), but we want to select the most probable language given
the text, so we apply Bayes’ rule followed by the Markov assumption to get the most probable
language:

(ii) Smoothing n-gram models :

▪ The major complication of n-gram models is that the training corpus provides only an estimate of the
true probability distribution.
▪ For common character sequences such as “ _th” any English corpus will give a good estimate: about
1.5% of all trigrams. On the other hand, “ _ht” is very uncommon—no dictionary words start with ht.
The process of adjusting the probability of low-frequency counts is called smoothing.
▪ The simplest type of smoothing was suggested by Pierre-Simon Laplace in the 18th century: he said
that, in the lack of further information, if a random Boolean variable X has been false in all n
observations so far then the estimate for P (X = true) should be 1/(n+2).
▪ That is, he assumes that with two more trials, one might be true and one false. Laplace smoothing
(also called add-one smoothing) is a step in the right direction, but performs relatively poorly.
▪ A better approach is a backoff model, in which we start by estimating n-gram counts, but for any
particular sequence that has a low (or zero) count, we back off to (n - 1)-grams.
▪ Linear interpolation smoothing is a backoff model that combines trigram, bigram, and unigram
models by linear interpolation. It defines the probability estimate as :

70
▪ where λ3 + λ2 + λ1 = 1. The parameter values λi can be fixed, or they can be trained with an
expectation–maximization algorithm.
▪ It is also possible to have the values of λi depend on the counts: if we have a high count of trigrams,
then we weigh them relatively more; if only a low count, then we put more weight on the bigram and
unigram models.
(iii) Model evaluation :
▪ With so many possible n-gram models—unigram, bigram, trigram, interpolated smoothing with
different values of λ, etc.—how do we know what model to choose? We can evaluate a model with
cross-validation.
▪ Split the corpus into a training corpus and a validation corpus.Determine the parameters of the
model from the training data. Then evaluate the model on the validation corpus.
▪ The evaluation can be a task-specific metric, such as measuring accuracy on language identification.
▪ Alternatively we can have a task-independent model of language quality: calculate the probability
assigned to the validation corpus by the model; the higher the probability the better.
▪ This metric is inconvenient because the probability of a large corpus will be a very small number,
and floating-point underflow becomes an issue.
▪ A different way of describing the probability of a sequence is with a measure called perplexity,
defined as :

▪ Perplexity can be thought of as the reciprocal of probability, normalized by sequence length.

▪ It can also be thought of as the weighted average branching factor of a model. Suppose there are 100
characters in our language, and our model says they are all equally likely. Then for a sequence of any
length, the perplexity will be 100.
▪ If some characters are more likely than others, and the model reflects that, then the model will have a
perplexity less than 100.

(iv) N-gram word models :

▪ Now we turn to n-gram models over words rather than characters.
▪ All the same mechanism applies equally to word and character models. The main difference is that
the vocabulary—the set of symbols that make up the corpus and the model—is larger.
▪ There are only about 100 characters in most languages, and sometimes we build character models
that are even more restrictive, for example by treating “A” and “a” as the same symbol or by treating
all punctuation as the same symbol.
▪ But with word models we have at least tens of thousands of symbols, and sometimes millions.
▪ The wide range is because it is not clear what constitutes a word.
▪ In English a sequence of letters surrounded by spaces is a word, but in some languages, like Chinese,
words are not separated by spaces, and even in English many decisions must be made to have a clear
policy on word boundaries: how many words are in.
71
▪ Word n-gram models need to deal with out of vocabulary words.
▪ With character models, we didn’t have to worry about someone inventing a new letter of the
alphabet.
▪ But with word models there is always the chance of a new word that was not seen in the training
corpus, so we need to model that explicitly in our language model.

3.2.2. TEXT CLASSIFICATION :

▪ We now consider in depth the task of text classification, also known as categorization: givena text
of some kind, decide which of a predefined set of classes it belongs to.
▪ Language identification and genre classification are examples of text classification
▪ spam detection classifying an email message as spam or not-spam(ham).
▪ A training set is readily available: the positive (spam) examples are in my spam folder, the negative
(ham) examples are in my inbox.
▪ Note that we have two complementary ways of talking about classification.
▪ In the language-modeling approach, we define one n-gram language model for P(Message | spam)
by training on the spam folder, and one model for P(Message | ham) by training on the inbox.
▪ Then we can classify a new message with an application of Bayes’ rule:

▪ where P (c) is estimated just by counting the total number of spam and ham messages. This approach
works well for spam detection, just as it did for language identification.
▪ If there are 100,000 words in the language model, then the feature vector has length 100,000, but for
a short email message almost all the features will have count zero.
▪ This unigram representation has been called the bag of words model.
▪ You can think of the model as putting the words of the training corpus in a bag and then selecting
words one at a time.
▪ The notion of order of the words is lost; a unigram model gives the same probability to any
permutation of a text.
▪ Higher-order n-gram models maintain some local notion of word order.
▪ It can be expensive to run algorithms on a very large feature vector, so often a process of feature
selection is used to keep only the features that best discriminate between spam and ham.
▪ Once we have chosen a set of features, we can apply any of the supervised learning techniques we
have seen; popular ones for text categorization include k-nearest-neighbors, support vector machines,
decision trees, naive Bayes, and logistic regression.
▪ All of these have been applied to spam detection, usually with accuracy in the 98%–99% range.
With a carefully designed feature set, accuracy can exceed 99.9%.
72
(i) Classification by data compression :
▪ Another way to think about classification is as a problem in data compression.
▪ A lossless compression algorithm takes a sequence of symbols, detects repeated patterns in it, and
writes a description of the sequence that is more compact than the original.
▪ For example, the text “0.142857142857142857” might be compressed to “0.[142857]*3.”
▪ To do classification by compression, we first lump together all the spam training messages and
compress them as a unit.
▪ We do the same for the ham. Then when given a new message to classify, we append it to the spam
messages and compress the result.
▪ We also append it to the ham and compress that. Whichever class compresses better—adds the
fewer number of additional bytes for the new message—is the predicted class.

3.2.3. INFORMATION RETRIEVAL :

▪ Information retrieval is the task of finding documents that are relevant to a user’s need for
information.
▪ The best-known examples of information retrieval systems are search engines on the World Wide
Web.
▪ A Web user can type a query such as “AI book” into a search engine and see a list of relevant pages.

▪ An information retrieval (henceforth IR) system can be characterized by :

A corpus of documents. Each system must decide what it wants to treat as a document: a paragraph,
a page, or a multipage text.
 Queries posed in a query language. A query specifies what the user wants to know.The query
language can be just a list of words, such as [AI book]; or it can specify a phrase of words that must
be adjacent, as in [“AI book”]; it can contain Boolean operators as in [AI AND book]; it can include
non-Boolean operators such as [AI NEAR book].
 A result set. This is the subset of documents that the IR system judges to be relevant to the query.
 A presentation of the result set. This can be as simple as a ranked list of document titles or as
complex as a rotating color map of the result set projected onto a three dimensional space, rendered
as a two-dimensional display.
▪ The earliest IR systems worked on a Boolean keyword model. Each word in the
documentcollection is treated as a Boolean feature that is true of a document if the word occurs in the
document and false if it does not.
The query language is the language of Boolean expressions over features. A document is relevant
only if the expression evaluates to true.
▪ This model has the advantage of being simple to explain and implement.
▪ However, it has some disadvantages:

73
 First, the degree of relevance of a document is a single bit, so there is no guidance as to how to order
the relevant documents for presentation.
 Second, Boolean expressions are unfamiliar to users who are not programmers or logicians.
 Third, it can be hard to formulate an appropriate query, even for a skilled user.

(i) IR scoring functions :

▪ Most IR systems have abandoned the Boolean model and use models based on the statistics of word
counts. We describe the BM25 scoring function.
▪ A scoring function takes a document and a query and returns a numeric score; the most relevant
documents have the highest scores.
In the BM25 function, the score is a linear weighted combination of scores for each of the words that
make up the query.
▪ Three factors affect the weight of a query term:
 First, the frequency with which a query term appears in a document (also known as TF for term
frequency). For the query documents that mention “farming” frequently will have higher scores.
 Second, the inverse document frequency of the term, or IDF. The word “in” appears in almost every
document, so it has a high document frequency, and thus a low inverse document frequency, and
thus it is not as important to the query.
 Third, the length of the document. A million-word document will probably mention all the query
words, but may not actually be about the query. A short document that mentions all the words is a
much better candidate.
▪ The BM25 function takes all three of these into account.
▪ Then, given a document dj and a query consisting of the words q1:N, we have :

▪ IDF(qi) is the inverse document frequency of word qi, given by :

(ii) IR system evaluation :

▪ How do we know whether an IR system is performing well? We undertake an experiment in which
the system is given a set of queries and the result sets are scored with respect to human relevance
judgments.
▪ Traditionally, there have been two measures used in the scoring:
 recall
 precision.
▪ Precision measures the proportion of documents in the result set that are actually relevant.

74
▪ In our example, the precision is 30/(30 + 10) = .75. The false positive rate is 1 - .75 = .25.
▪ Recall measures the proportion of all the relevant documents in the collection that are in the result
set.
▪ In our example, recall is 30/(30 + 20) = .60. The false negative rate is 1 - .60 = .40.
▪ In a very large document collection, such as the World Wide Web, recall is difficult to compute,
because there is no easy way to examine every page on the Web for relevance.
▪ All we can do is either estimate recall by sampling or ignore recall completely and just judge
precision.

(iii) IR refinements :
▪ There are many possible refinements to the system described here, and indeed Web search engines
are continually updating their algorithms as they discover new approaches and as the Web grows and
changes.
▪ One common refinement is a better model of the effect of document length on relevance.
▪ Singhal et al. (1996) observed that simple document length normalization schemes tend to favor
short documents too much and long documents not enough.
▪ They propose a pivoted document length normalization scheme; the idea is that the pivot is the
document length at which the old-style normalization is correct; documents shorter than that get a
boost and longer ones get a penalty.
▪ The BM25 scoring function uses a word model that treats all words as completely independent, but
we know that some words are correlated.
▪ Many IR systems attempt to account for these correlations.
▪ The next step is to recognize synonyms, such as “sofa” for “couch.” As with stemming, this has the
potential for small gains in recall, but can hurt precision.
▪ As a final refinement, IR can be improved by considering metadata—data outside of the text of the
document. Examples include human-supplied keywords and publication data.
▪ On the Web, hypertext links between documents are a crucial source of information.

(iv) The PageRank algorithm :

▪ PageRank was one of the two original ideas that set Google’s search apart from other Web search
engines when it was introduced in 1997. (The other innovation was the use of anchor text—the
underlined text in a hyperlink).
▪ PageRank was invented to solve the problem of the tyranny of TF scores: if the query is [IBM],
how do we make sure that IBM’s home page, ibm.com, is the first result, even if another page
mentions the term “IBM” more frequently?
The idea is that ibm.com has many in-links (links to the page), so it should be ranked higher: each
in-link is a vote for the quality of the linked-to page.
▪ But if we only counted in-links, then it would be possible for a Web spammer to create a network of
pages and have them all point to a page of his choosing, increasing the score of that page.
75
▪ Therefore, the PageRank algorithm is designed to weight links from high-quality sites more heavily.
▪ What is a highquality site? One that is linked to by other high-quality sites.
▪ The definition is recursive, but we will see that the recursion bottoms out properly. The PageRank
for a page p is defined as:

▪ where P R(p) is the PageRank of page p, N is the total number of pages in the corpus, ini are the
pages that link in to p, and C(ini) is the count of the total number of out-links on page ini.
▪ The constant d is a damping factor. It can be understood through the random surfer model :
imagine a Web surfer who starts at some random page and begins exploring.

(v) The HITS algorithm :

▪ The Hyperlink-Induced Topic Search algorithm, also known as “Hubs and Authorities” or HITS, is
another influential link-analysis algorithm .
▪ HITS differs from PageRank in several ways.
▪ First, it is a query-dependent measure: it rates pages with respect to a query.
▪ Given a query, HITS first finds a set of pages that are relevant to the query. It does that by
intersecting hit lists of query words, and then adding pages in the link neighborhood of these pages
▪ Both PageRank and HITS played important roles in developing our understanding of Web
information retrieval.
▪ These algorithms and their extensions are used in ranking billions of queries daily as search engines
steadily develop better ways of extracting yet finer signals of search relevance.
(vi) Question answering :
▪ Information retrieval is the task of finding documents that are relevant to a query, where the query
may be a question, or just a topic area or concept.
▪ Question answering is a somewhat different task, in which the query really is a question, and the
answer is not a ranked list of documents but rather a short response—a sentence, or even just a
phrase.
▪ There have been question-answering NLP (natural language processing) systems since the 1960s,
but only since 2001 have such systems used Web information retrieval to radically increase their
breadth of coverage.

76
3.2.4. INFORMATION EXTRACTION :

▪ Information extraction is the process of acquiring knowledge by skimming a text and looking for
occurrences of a particular class of object and for relationships among objects.
▪ A typical task is to extract instances of addresses from Web pages, with database fields for street,
city, state, and zip code; or instances of storms from weather reports, with fields for temperature,
wind speed, and precipitation.
▪ In a limited domain, this can be done with high accuracy. As the domain gets more general, more
complex linguistic models and more complex learning techniques are necessary.

(i) Finite-state automata for information extraction:

▪ The simplest type of information extraction system is an attribute-based extraction system that
assumes that the entire text refers to a single object and the task is to extract attributes of that object.
▪ For example, the problem of extracting from the text “IBM ThinkBook 970. Our price: $399.00” the
set of attributes {Manufacturer=IBM, Model=ThinkBook970, Price=$399.00}.
▪ We can address this problem by defining a template (also known as a pattern) for each attribute we
would like to extract. The template is defined by a finite state automaton, the simplest example of
which is the regular expression, or regex.
▪ Here we show how to build up a regular expression template for prices in dollars:

▪ Templates are often defined with three parts: a prefix regex, a target regex, and a postfix regex.
▪ For prices, the target regex is as above, the prefix would look for strings such as “price:” and the
postfix could be empty.
▪ The idea is that some clues about an attribute come from the attribute value itself and some come
from the surrounding text.
▪ One step up from attribute-based extraction systems are relational extraction systems, which deal
with multiple objects and the relations among them.
▪ Thus, when these systems see the text “$249.99,” they need to determine not just that it is a price, but
also which object has that price.
▪ A typical relational-based extraction system is FASTUS, which handles news stories about corporate
mergers and acquisitions.
▪ A relational extraction system can be built as a series of cascaded finite-state transducers.
▪ That is, the system consists of a series of small, efficient finite-state automata (FSAs), where each
automaton receives text as input, transduces the text into a different format, and passes it along to the
next automaton.
77
▪ FASTUS consists of five stages:

1. Tokenization
2. Complex-word handling
3. Basic-group handling
4. Complex-phrase handling
5. Structure merging
1. FASTUS’s first stage is tokenization, which segments the stream of characters into tokens (words,
numbers, and punctuation). Some tokenizers also deal with markup languages such as HTML,
SGML, and XML.
2. The second stage handles complex words, including collocations such as “set up” and “joint
venture,” as well as proper names such as “Bridgestone Sports Co.”
3. The third stage handles basic groups, meaning noun groups and verb groups. The idea is to chunk
these into units that will be managed by the later stages.
4. The fourth stage combines the basic groups into complex phrases.
5. The final stage merges structures that were built up in the previous step.

(ii) Probabilistic models for information extraction:

▪ When information extraction must be attempted from noisy or varied input, simple finite-state
approaches fare poorly.
▪ It is too hard to get all the rules and their priorities right; it is better to use a probabilistic model
rather than a rule-based model.
▪ The simplest probabilistic model for sequences with hidden state is the hidden Markov model, or
HMM.
▪ HMM models a progression through a sequence of hidden states, xt, with an observation et at each
step.
▪ To apply HMMs to information extraction, we can either build one big HMM for all the attributes or
build a separate HMM for each attribute. We’ll do the second.
▪ HMMs have two big advantages over FSAs for extraction.
 First, HMMs are probabilistic, and thus tolerant to noise.
Second, HMMs can be trained from data; they don’t require laborious engineering of
templates, and thus they can more easily be kept up to date as text changes over time.

(iii) Conditional random fields for information extraction :

▪ One issue with HMMs for the information extraction task is that they model a lot of probabilities
that we don’t really need.
▪ Modeling this directly gives us some freedom. We don’t need the independence assumptions of the
Markov model—we can have an xt that is dependent on x1.
▪ A framework for this type of model is the conditional random field, or CRF, which models a
78
conditional probability distribution of a set of target variables given a set of observed variables.
▪ Like Bayesian networks, CRFs can represent many different structures of dependencies among the
variables.
▪ One common structure is the linear-chain conditional random field for representing Markov
dependencies among variables in a temporal sequence.
▪ Thus, HMMs are the temporal version of naive Bayes models, and linear-chain CRFs are the
temporal version of logistic regression.

(iv) Ontology extraction from large corpora :

▪ So far we have thought of information extraction as finding a specific set of relations (e.g.,speaker,
time, location) in a specific text (e.g., a talk announcement).
▪ A different application of extraction technology is building a large knowledge base or ontology of
facts from a corpus.
▪ This is different in three ways:
▪ First it is open-ended—we want to acquire facts about all types of domains, not just one specific
domain.
▪ Second, with a large corpus, this task is dominated by precision, not recall—just as with question
answering on the Web .
▪ Third, the results can be statistical aggregates gathered from multiple sources, rather than being
extracted from one specific text.

(v) Automated template construction :

▪ Fortunately, it is possible to learn templates from a few examples, then use the templates to learn
more examples, from which more templates can be learned, and so on.
▪ In one of the first experiments of this kind, Brin (1999) started with a data set of just five examples:

(“Isaac Asimov”, “The Robots of Dawn”)

(“David Brin”, “Startide Rising”)
(“James Gleick”, “Chaos—Making a New Science”)
(“Charles Dickens”, “Great Expectations”)
(“William Shakespeare”, “The Comedy of Errors”)
▪ Clearly these are examples of the author–title relation, but the learning system had no knowledge of
authors or titles.
▪ The words in these examples were used in a search over a Web corpus, resulting in 199 matches.
Each match is defined as a tuple of seven strings,

(Author, Title, Order, Prefix, Middle, Postfix, URL) ,

▪ where Order is true if the author came first and false if the title came first, Middle is the characters
between the author and title, Prefix is the 10 characters before the match, Suffix is the 10 characters
79
after the match, and URL is the Web address where the match was made.

(vi) Machine reading :

▪ Automated template construction is a big step up from handcrafted template construction, but it still
requires a handful of labeled examples of each relation to get started.
▪ To build a large ontology with many thousands of relations, even that amount of work would be
onerous; we would like to have an extraction system with no human input of any kind—a system that
could read on its own and build up its own database.
▪ Such a system would be relation-independent; would work for any relation. In practice, these
systems work on all relations in parallel, because of the I/O demands of large corpora.
▪ They behave less like a traditional information extraction system that is targeted at a few relations
and more like a human reader who learns from the text itself; because of this the field has been called
machine reading.

Ai Unit 5
No ratings yet
Ai Unit 5
16 pages
Unit 5 Notes Final
No ratings yet
Unit 5 Notes Final
14 pages
Unit 5-Aiml
No ratings yet
Unit 5-Aiml
25 pages
Unit V-AI-KCS071
No ratings yet
Unit V-AI-KCS071
28 pages
Unit 5
No ratings yet
Unit 5
26 pages
Bcse306l Ai Module-7 Smsatapathy
No ratings yet
Bcse306l Ai Module-7 Smsatapathy
51 pages
Ai Unit 3 Part 2
No ratings yet
Ai Unit 3 Part 2
8 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Natural Language: Anguage Odels
No ratings yet
Natural Language: Anguage Odels
28 pages
Chapter 5
No ratings yet
Chapter 5
22 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
Ngrams
100% (1)
Ngrams
22 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Chapter 6-NLP
No ratings yet
Chapter 6-NLP
8 pages
5624 - Softskill - NLP
No ratings yet
5624 - Softskill - NLP
28 pages
Cortado-Cap 6
No ratings yet
Cortado-Cap 6
160 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
NLP Cat 2
No ratings yet
NLP Cat 2
78 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
NLP UNIT III (Part 1)
No ratings yet
NLP UNIT III (Part 1)
15 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
Class-Based N-Gram Models of Natural Language
No ratings yet
Class-Based N-Gram Models of Natural Language
14 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
N-Gram Models For Language Detection
No ratings yet
N-Gram Models For Language Detection
14 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Unit - 4 NLP - R20
No ratings yet
Unit - 4 NLP - R20
12 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
NLP Unit 4 Q & A
No ratings yet
NLP Unit 4 Q & A
17 pages
Unit Vapplications Notes
No ratings yet
Unit Vapplications Notes
13 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
N Grams
No ratings yet
N Grams
51 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
N-Gram Language Models
No ratings yet
N-Gram Language Models
26 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
Lecture5 Ngrams
No ratings yet
Lecture5 Ngrams
40 pages
Lecture 4 N Grams
No ratings yet
Lecture 4 N Grams
29 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
N-Gram Language Model: Based On Speech and Language Processing. Daniel Jurafsky & James H. Martin Book, 2023
No ratings yet
N-Gram Language Model: Based On Speech and Language Processing. Daniel Jurafsky & James H. Martin Book, 2023
46 pages
Evaluating Language Models
No ratings yet
Evaluating Language Models
21 pages
NLP Lec 11
No ratings yet
NLP Lec 11
6 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Auditing Theory 2013
No ratings yet
Auditing Theory 2013
28 pages
Meaning of The Term Childhood As The Happiest Period of Life
No ratings yet
Meaning of The Term Childhood As The Happiest Period of Life
2 pages
SAP S4 Hana Syllabus
No ratings yet
SAP S4 Hana Syllabus
3 pages
SIIT Student Handbook
No ratings yet
SIIT Student Handbook
49 pages
Chapter 6 Curriculum Evaluation Hazel 014525
No ratings yet
Chapter 6 Curriculum Evaluation Hazel 014525
10 pages
NOTES LIFE PROCESSES (Respiration, Excretion
No ratings yet
NOTES LIFE PROCESSES (Respiration, Excretion
3 pages
Tracy Resume
No ratings yet
Tracy Resume
2 pages
Time-Series Econometrics
No ratings yet
Time-Series Econometrics
36 pages
Fluid Statics Examples
No ratings yet
Fluid Statics Examples
14 pages
Published Answer Marks Write A Detailed Account About The Second Pillar of Islam: Prayer (Salat) - Use The AO1 Marking Grid 10
No ratings yet
Published Answer Marks Write A Detailed Account About The Second Pillar of Islam: Prayer (Salat) - Use The AO1 Marking Grid 10
1 page
PUMA Annual Report 2021
No ratings yet
PUMA Annual Report 2021
322 pages
Understanding How PeopleCode Events Work
No ratings yet
Understanding How PeopleCode Events Work
14 pages
Man Diesel Engine
No ratings yet
Man Diesel Engine
325 pages
2024 Estimation
No ratings yet
2024 Estimation
91 pages
Kisi-Kisi SOAL UJIAN AKHIR SEKOLAH BAHASA INGGRIS 2024
No ratings yet
Kisi-Kisi SOAL UJIAN AKHIR SEKOLAH BAHASA INGGRIS 2024
12 pages
The Clergyman's Wife Chapter Sampler
0% (2)
The Clergyman's Wife Chapter Sampler
21 pages
MSDS
No ratings yet
MSDS
3 pages
Audit of The Acquisition and Payment Cycle: Tests of Controls, Substantive Tests of Transactions, and Accounts Payable
No ratings yet
Audit of The Acquisition and Payment Cycle: Tests of Controls, Substantive Tests of Transactions, and Accounts Payable
39 pages
Match The Verbs With Its Definition
No ratings yet
Match The Verbs With Its Definition
2 pages
Tripping Batteries
No ratings yet
Tripping Batteries
5 pages
Uncontrolled Rectifier
No ratings yet
Uncontrolled Rectifier
18 pages
Digging Tools PDF
No ratings yet
Digging Tools PDF
6 pages
Ics Question Paper
No ratings yet
Ics Question Paper
28 pages
Jnu Final 6303
No ratings yet
Jnu Final 6303
36 pages
Questionnaire For Employees
No ratings yet
Questionnaire For Employees
7 pages
GEN-Sup 2020 EN
No ratings yet
GEN-Sup 2020 EN
24 pages
The Nearest Neighbour Algorithm
No ratings yet
The Nearest Neighbour Algorithm
3 pages
Kamuli District DDP III 2020 - 2025 - 0
No ratings yet
Kamuli District DDP III 2020 - 2025 - 0
233 pages

Artificial Intelligence: Natural Language Processing

Uploaded by

Artificial Intelligence: Natural Language Processing

Uploaded by

▪ This work has generally used policy search as well as the PEGASUS algorithm with simulation

based on a learned transition model.

Natural Language Processing

(i) N-gram character models :

(ii) Smoothing n-gram models :

▪ Perplexity can be thought of as the reciprocal of probability, normalized by sequence length.

(iv) N-gram word models :

3.2.2. TEXT CLASSIFICATION :

3.2.3. INFORMATION RETRIEVAL :

▪ An information retrieval (henceforth IR) system can be characterized by :

(i) IR scoring functions :

▪ IDF(qi) is the inverse document frequency of word qi, given by :

(ii) IR system evaluation :

(iv) The PageRank algorithm :

(v) The HITS algorithm :

(i) Finite-state automata for information extraction:

(ii) Probabilistic models for information extraction:

(iii) Conditional random fields for information extraction :

(iv) Ontology extraction from large corpora :

(v) Automated template construction :

(“Isaac Asimov”, “The Robots of Dawn”)

(Author, Title, Order, Prefix, Middle, Postfix, URL) ,

(vi) Machine reading :

You might also like