0% found this document useful (0 votes)
7 views19 pages

UNIT 4 Part1

The document discusses natural language communication in artificial intelligence, focusing on phrase structure grammars, syntactic analysis, and semantic interpretation. It outlines various types of grammars, including context-free and probabilistic context-free grammars, and their applications in parsing and understanding language. Additionally, it addresses the challenges of data sparsity and the integration of semantic meaning into grammatical structures.

Uploaded by

Vijay Dhanush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

UNIT 4 Part1

The document discusses natural language communication in artificial intelligence, focusing on phrase structure grammars, syntactic analysis, and semantic interpretation. It outlines various types of grammars, including context-free and probabilistic context-free grammars, and their applications in parsing and understanding language. Additionally, it addresses the challenges of data sparsity and the integration of semantic meaning into grammatical structures.

Uploaded by

Vijay Dhanush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

lOMoARcPSD|47823449

R19-AI- UNIT-IV-Chapter-1

Computer Science & Engineering (Jawaharlal Nehru Technological University,


Anantapur)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Riyaz Naik SMD ([email protected])
lOMoARcPSD|47823449

ARTIFICIAL
INTELLIGENCE

UNIT – IV

Syllabus
Natural Language for Communication: Phrase structure grammars, Syntactic Analysis,
Augmented Grammars and semantic Interpretation, Machine Translation, Speech
Recognition.

Perception: Image Formation, Early Image Processing Operations, Object Recognition by


appearance, Reconstructing the 3D World, Object Recognition from Structural information,
Using Vision.

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

Chapter – 1

Natural Language for


Communication

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

INTRODUCTION
Communication is the intentional exchange of information brought about by the production SIGN
and perception of signs drawn from a shared system of conventional signs. Most animals use signs to
represent important messages: food here, predator nearby, approach, withdraw, let’s mate.

4.1.1. PHRASE STRUCTURE GRAMMARS :

 The n-gram language models were based on sequences of words.

 The big issue for these models is data sparsity—with a vocabulary of, say, trigram probabilities to
estimate, and so a corpus of even a trillion words will not be able to supply reliable estimates for all
of them.

 We can address the problem of sparsity through generalization.

 Despite the exceptions, the notion of a lexical category (also known as a part of speech) such as
noun or adjective is a useful generalization—useful in its own right, but more so when we string
together lexical categories to form syntactic categories such as noun phrase or verb phrase, and
combine these syntactic categories into trees representing the phrase structure of sentences: nested
phrases, each marked with a category .

 GENERATIVE CAPACITY :

 Grammatical formalisms can be classified by their generative capacity: the set of languages they
can represent.

 Chomsky (1957) describes four classes of grammatical formalisms that differ only in the form of the
rewrite rules.

 The classes can be arranged in a hierarchy, where each class can be used to describe all the
languages that can be described by a less powerful class, as well as some additional languages.

 Here we list the hierarchy, most powerful class first:

1. Recursively enumerable grammars use unrestricted rules: both sides of the rewrite rules can have
any number of terminal and nonterminal symbols, as in the rule A B C → D E.

 These grammars are equivalent to Turing machines in their expressive power.

2. Context-sensitive grammars are restricted only in that the right-hand side must contain at least as
many symbols as the left-hand side.

 The name “contextsensitive” comes from the fact that a rule such as AX B →AYB
says that an X can be rewritten as a Y in the context of a preceding A and a following B.

Context-sensitive grammars can represent languages such as (a sequence of n copies of a followed

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

by the same number of bs and then cs).

3. In context-free grammars (or CFGs), the left-hand side consists of a single nonterminal symbol.
Thus, each rule licenses rewriting the nonterminal as the right-hand side in any context.

CFGs are popular for natural-language and programming-language grammars, although it is now
widely accepted that at least some natural languages have constructions that are not context-free
(Pullum, 1991).

Context-free grammars can represent

4. Regular grammars are the most restricted class. Every rule has a single nonterminal on the left-hand
side and a terminal symbol optionally followed by a nonterminal on the right- hand side.

Regular grammars are equivalent in power to finite state machines. They are poorly suited
for programming languages, because they cannot represent constructs such as balanced opening and
closing parentheses .

The closest they can come is representing a∗b∗, a sequence of any number of as followed
by any number of bs.

 There have been many competing language models based on the idea of phrase structure; we will
describe a popular model called the probabilistic context-free grammar, or PCFG.

 A grammar is a collection of rules that defines a language as a set of allowable strings of words.
Probabilistic means that the grammar assigns a probability to every string.

 Here is a PCFG rule:


VP → Verb [0.70]
VP NP [0.30]

 Here VP (verb phrase) and NP (noun phrase) are non-terminal symbols. The grammar also
refers to actual words, which are called terminal symbols.

 This rule is saying that with probability 0.70 a verb phrase consists solely of a verb, and with
probability 0.30 it is a VP followed by an NP.

(i) The lexicon of

 First we define the lexicon, or list of allowable words. The words are grouped into the lexical
categories familiar to dictionary users: nouns, pronouns, and names to denote things; verbs to denote
events; adjectives to modify nouns; adverbs to modify verbs; and function words: articles (such as
the), prepositions (in), and conjunctions (and).

 Each of the categories ends in . . . to indicate that there are other words in the category.
4

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

 For nouns, names, verbs, adjectives, and adverbs, it is infeasible even in principle to list all the
words. Not only are there tens of thousands of members in each class, but new ones–like iPod or
biodiesel—are being added constantly.

 These five categories are called open classes.

 For the categories of pronoun, relative pronoun, article, preposition, and conjunction we could have
listed all the words with a little more work. These are called closed classes; they have a small
number of words (a dozen or so).

 Closed classes change over the course of centuries, not months. For example, “thee” and “thou”
were commonly used pronouns in the 17th century, were on the decline in the 19th, and are seen
today only in poetry and some regional dialects.

(ii) The Grammar of

 The next step is to combine the words into phrases.

A grammar for with rules for each of the six syntactic categories and an example for each rewrite
rule.

4.1.2. SYNTACTIC ANALYSIS (PARSING) :

 Parsing is the process of analyzing a string of words to uncover its phrase structure, according to
the rules of a grammar.
5

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

 Consider the following two sentences:

1. Have the students in section 2 of Computer Science 101 take the exam.

2. Have the students in section 2 of Computer Science 101 taken the exam?

 Even though they share the first 10 words, these sentences have very different parses, because the
first is a command and the second is a question.

 By using left-to-right parsing algorithm would have to guess whether the first word is part of a
command or a question and will not be able to tell if the guess is correct until at least the eleventh
word, take or taken.

 If the algorithm guesses wrong, it will have to backtrack all the way to the first word and reanalyze
the whole sentence under the other interpretation.

 To avoid this source of inefficiency we can use dynamic programming: every time we analyze a
substring, store the results so we won’t have to reanalyze it later.

 For example, once we discover that “the students in section 2 of Computer Science 101” is an NP,
we can record that result in a data structure known as a chart.

 Algorithms that do this are called chart parsers.

 There are many types of chart parsers; we describe a bottom-up version called the CYK algorithm,
after its inventors, John Cocke, Daniel Younger, and Tadeo Kasami.

 CYK algorithm :

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

 The CYK algorithm requires a grammar with all rules in one of two very specific formats: lexical
rules of the form X → word, and syntactic rules of the form X → Y Z .

 This grammar format, called Chomsky Normal Form, may seem restrictive, but it is not: any
context-free grammar can be automatically transformed into Chomsky Normal Form.

(i) Learning probabilities for PCFGs :

 A PCFG has many rules, with a probability for each rule.

 This suggests that learning the grammar from data might be better than a knowledge engineering
approach.

 Learning is easiest if we are given a corpus of correctly parsed sentences, commonly called a
treebank.

The Penn Treebank is the best known; it consists of 3 million words which have been annotated with part of
speech and parse-tree structure, using human labor assisted by some automated tools.

 Annotated tree from the Penn Treebank :

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

 Given a corpus of trees, we can create a PCFG just by counting (and smoothing).

 In the example above, there are two nodes of the form [S[NP . . .][VP . . .]]. We would count these,
and all the other subtrees with root S in the corpus.

If there are 100,000 S nodes of which 60,000 are of this form, then we create the rule:
S → NP VP [0.60] .

(ii) Comparing context-free and Markov models :

 The problem with PCFGs is that they are context-free.

 That means that the difference between P (“eat a banana”) and P (“eat a bandanna”) depends only on
P (Noun → “banana”) versus
P (Noun → “bandanna”) and not on the relation between “eat” and the respective objects.

 A Markov model of order two or more, given a sufficiently large corpus, will know that “eat a
banana” is more probable.

 We can combine a PCFG and Markov model to get the best of both. The simplest approach is to
estimate the probability of a sentence with the geometric mean of the probabilities computed by both
models.

Another problem with PCFGs is that they tend to have too strong a preference for shorter sentences.

4.1.3. AUGMENTED GRAMMARS AND SEMANTIC INTERPRETATION :

 In this concept, we see how to extend context-free grammars.

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

(i) Lexicalized PCFGs :

To get at the relationship between the verb “eat” and the nouns “banana” versus “bandanna,” we can
use a lexicalized PCFG, in which the probabilities for a rule depend on the relationship between
words in the parse tree, not just on the adjacency of words in a sentence.

 Of course, we can’t have the probability depend on every word in the tree, because we won’t have
enough training data to estimate all those probabilities.

 It is useful to introduce the notion of the head of a phrase—the most important word. Thus, “eat” is
the head of the VP “eat a banana” and “banana” is the head of the NP “a banana.”

 We use the notation VP(v) to denote a phrase with category VP whose head word is v. We say that
the category VP is augmented with the head variable v.

Here is an augmented grammar that describes theverb–object relation:

(ii) Formal definition of augmented grammar rules:

 Augmented rules are complicated, so we will give them a formal definition by showing how an
augmented rule can be translated into a logical sentence.

 The sentence will have the form of a definite clause, so the result is called a definite clause
grammar, or DCG.

That gives us

 This definite clause says that if the predicate Article is true of a head word a and a string s1, and Adjs
is similarly true of a head word j and a string s2, and Noun is true of a head word n and a string s3,
and if j and n are compatible, then the predicate NP is true of the head
word n and the result of appending strings s1, s2, and s3.

 The translation from grammar rule to definite clause allows us to talk about parsing as logical

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

inference.

 This makes it possible to reason about languages and strings in many different ways.

For example, it means we can do bottom-up parsing using forward chaining or top-down parsing
using backward chaining .

 In fact, parsing natural language with DCGs was one of the first applications of (and motivations for)
the Prolog logic programming language.

 It is sometimes possible to run the process backward and do language generation as well as
parsing.

(iii) Case agreement and subject–verb agreement:

 We splitting NP into two categories, NPS and NPO, to stand for noun phrases in the subjective and
objective case, respectively.

 We would also need to split the category Pronoun into the two categories PronounS (which includes
“I”) and PronounO (which includes “me”).

 The top part of Figure shows the grammar for case agreement; we call the resulting language

 Unfortunately, E1 still overgenerates. English requires subject–verb agreement for person and
number of the subject and main verb of a sentence.

10

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

For example, if “I” is the subject, then “I smell” is grammatical, but “I smells” is not. If “it” is the
subject, we get the reverse.

(iv) Semantic interpretation :

To show how to add semantics to a grammar, we start with an example that is simpler than English:
the semantics of arithmetic expressions.

 Figure shows a grammar for arithmetic expressions, where each rule is augmented with a variable
indicating the semantic interpretation of the phrase.

 The semantics of a digit such as “3” is the digit itself. The semantics of an expression such as “3 +
4” is the operator “+” applied to the semantics of the phrase “3” and the phrase “4.”

The rules obey the principle of compositional semantics.

 The semantics of a phrase is a function of the semantics of the subphrases. Figure shows the parse
tree for 3 + (4 ÷ 2) according to this grammar. The root of the parse tree is Exp(5), an expression
whose semantic interpretation is 5.

(v) Complications :

11

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

 The grammar of real English is endlessly complex. We will briefly mention some examples.

1. Time and tense:

Suppose we want to represent the difference between “John loves Mary” and “John loved Mary.”

 English uses verb tenses (past, present, and future) to indicate the relative time of an event. One good
choice to represent the time of events is the event calculus notation

 In event calculus we have

John loves mary: E1 ∈ Loves(John, Mary) ∧ During(Now, Extent(E1))


John loved mary: E2 ∈ Loves(John, Mary) ∧ After(Now, Extent(E2)) .

2. Quantification :

 Consider the sentence “Every agent feels a breeze.”

The sentence has only one syntactic parse under E0, but it is actually semantically ambiguous; the
preferred meaning is “For every agent there exists a breeze that the agent feels,” but an acceptable
alternative meaning is “There exists a breeze that every agent feels.

3. Pragmatics:

 We have shown how an agent can perceive a string of words and use a grammar to derive a set of
possible semantic interpretations.

4. Long-distance dependencies:

Questions introduce a new grammatical complexity. In “Who did the agent tell you to give
the gold to?” the final word “to” should be parsed as [PP to ], where the “ ” denotes a gap or trace
where an NP is missing; the missing NP is licensed by the first word of the sentence, “who.”

5. Ambiguity :

In some cases, hearers are consciously aware of ambiguity in an utterance.

Types of ambiguities :

à Lexical ambiguity, in which a word has more than one meaning.

à Syntactic ambiguity, refers to a phrase that has multiple parses.

àSemantic ambiguity, The syntactic ambiguity leads to a semantic ambiguity, because one parse
means the other.

 Disambiguation, is the process of recovering the most probable intended meaning of an utterance.

 To do disambiguation properly, we need to combine four models:


12

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

à world model

à mental model

à language model

à acoustic mode

4.1.4. MACHINE TRANSLATION :

 Machine translation is the automatic translation of text from one natural language (the source) to
another (the target).

It was one of the first application areas envisioned for computers (Weaver, 1949), but it is only in
the past decade that the technology has seen widespread usage.

 Historically, there have been three main applications of machine translation.

à Rough translation, as provided by free online services, gives the “gist” of a foreign sentence or
document, but contains errors.

à Pre-edited translation is used by companies to publish their documentation and sales materials in
multiple languages.

The original source text is written in a constrained language that is easier to translate automatically,
and the results are usually edited by a human to correct any errors.

à Restricted-source translation works fully automatically, but only on highly stereotypical language,
such as a weather report.

 Translation is difficult because, in the fully general case, it requires in-depth understanding of the
text. This is true even for very simple texts—even “texts” of one word.

(i) Machine translation systems :

 All translation systems must model the source and target languages, but systems vary in the type of
models they use.

 Some systems attempt to analyze the source language text all the way into an interlingua knowledge
representation and then generate sentences in the target language from that representation.

 This is difficult because it involves three unsolved problems:

 creating a complete knowledge representation of everything;

 parsing into that representation; and

 generating sentences from that representation.


13

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

 Other systems are based on a transfer model.

 They keep a database of translation rules (or examples), and whenever the rule (or example)
matches, they translate directly.

 Transfer can occur at the lexical, syntactic, or semantic level.

 For example, a strictly syntactic rule maps English [Adjective Noun] to French [Noun Adjective]. A
mixed syntactic and lexical rule maps French [S1 “et puis” S2] to English [S1 “and then” S2].

(ii) Statistical machine translation :

 Now that we have seen how complex the translation task can be, it should come as no surprise that
the most successful machine translation systems are built by training a probabilistic model using
statistics gathered from a large corpus of text.

 This approach does not need a complex ontology of interlingua concepts, nor does it need
handcrafted grammars of the source and target languages, nor a hand-labeled treebank.

 All it needs is data—sample translations from which a translation model can be learned. To translate
a sentence in, say, English (e) into French (f), we find the string of words f ∗ that maximizes

 Here the factor P (f) is the target language model for French; it says how probable a given sentence
is in French. P (e|f) is the translation mode.

 All that remains is to learn the phrasal and distortion probabilities. We sketch the procedure;

14

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

 Find parallel texts:

First, gather a parallel bilingual corpus. For example, a Hansard is a record of parliamentary debate.
Canada, Hong Kong, and other countries produce bilingual Hansards, the European Union publishes its
official documents in 11 languages, and the United Nations publishes multilingual documents.

 Segment into sentences: The unit of translation is a sentence, so we will have to break the corpus
into sentences. Periods are strong indicators of the end of a sentence.

 Align sentences: For each sentence in the English version, determine what sentence(s) it corresponds
to in the French version.

Usually, the next sentence of English corresponds to the next sentence of French in a 1:1
match, but sometimes there is variation: one sentence in one language will be split into a 2:1 match,
or the order of two sentences will be swapped, resulting in a 2:2 match.

 Align phrases: Within a sentence, phrases can be aligned by a process that is similar to that used for
sentence alignment, but requiring iterative improvement.

 Extract distortions: Once we have an alignment of phrases we can define distortion probabilities.
Simply count how often distortion occurs in the corpus for each distance d = 0, ±1, ±2, . . ., and apply
smoothing.

 Improve estimates with EM: Use expectation–maximization to improve the estimates of P(f | e) and
P(d) values.

We compute the best alignments with the current values of these parameters in the E step, then
update the estimates in the M step and iterate the process until convergence.

4.1.5. SPEECH RECOGNITION :

 Speech recognition is the task of identifying a sequence of words uttered by a speaker, given the
acoustic signal.

 It has become one of the mainstream applications of AI—millions of people interact with speech
recognition systems every day to navigate voice mail systems, search the Web from mobile phones,
and other applications.

 Speech is an attractive option when hands-free operation is necessary, as when operating machinery.

Speech recognition is difficult because the sounds made by a speaker are ambiguous and, well, noisy.

 Several issues that make speech problematic :


15

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

 First, segmentation: written words in English have spaces between them, but in fast speech there
are no pauses in “wreck a nice” that would distinguish it as a multiword phrase as opposed to the
single word “recognize.”

 Second, coarticulation: when speaking quickly the “s” sound at the end of “nice” merges with the
“b” sound at the beginning of “beach,” yielding something that is close to a “sp.”

 Another problem that does not show up in this example is homophones—words like “to,” “too,” and
“two” that sound the same but differ in meaning.

 As usual, the most likely sequence can be computed with the help of Bayes’ rule to be:

 Here P(sound 1:t|word 1:t) is the acoustic model. It describes the sounds of words—that “ceiling”
begins with a soft “c” and sounds the same as “sealing.”

P(word 1:t) is known as the language model. It specifies the prior probability of each utterance—
for example, that “ceiling fan” is about 500 times more likely as a word sequence than “sealing fan.”

 This approach was named the noisy channel model by Claude Shannon (1948).

 He described a situation in which an original message (the words in our example) is transmitted over
a noisy channel (such as a telephone line) such that a corrupted message (the sounds in our example)
are received at the other end.

 Once we define the acoustic and language models, we can solve for the most likely sequence of
words using the Viterbi algorithm .

 Most speech recognition systems use a language model that makes the Markov assumption—that the
current state Word t depends only on a fixed number n of previous states—and represent Word t as a
single random variable taking on a finite set of values, which makes it a Hidden Markov Model
(HMM).

Thus, speech recognition becomes a simple application of the HMM methodology,

(i) Acoustic model :

 Sound waves are periodic changes in pressure that propagate through the air. When these waves
strike the diaphragm of a microphone, the back-and-forth movement generates an electric current.

 An analog-to-digital converter measures the size of the current—which approximates the amplitude
of the sound wave—at discrete intervals called the sampling rate.

16

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

Speech sounds, which are mostly in the range of 100 Hz (100 cycles per second) to 1000 Hz, are
typically sampled at a rate of 8 kHz. (CDs and mp3 files are sampled at 44.1 kHz.)

 The precision of each measurement is determined by the quantization factor; speech recognizers
typically keep 8 to 12 bits.

 That means that a low-end system, sampling at 8 kHz with 8-bit quantization, would require nearly
half a megabyte per minute of speech.

 Since we only want to know what words were spoken, not exactly what they sounded like, we don’t
need to keep all that information.

 We only need to distinguish between different speech sounds. Linguists have identified about 100
speech sounds, or phones, that can be composed to form all the words in all known human
languages.

 Roughly speaking, a phone is the sound that corresponds to a single vowel or consonant, but there
are some complications: combinations of letters, such as “th” and “ng” produce single phones, and
some letters produce different phones in different contexts all the phones that are used in English.

 A phoneme is the smallest unit of sound that has a distinct meaning to speakers of a particular
language.

 For example, the “t” in “stick” sounds similar enough to the “t” in “tick” that speakers of English
consider them the same phoneme.

 First, we observe that although the sound frequencies in speech may be several kHz, the changes in
the content of the signal occur much less often, perhaps at no more than 100 Hz.

 Therefore, speech systems summarize the properties of the signal over time slices called frames.

(ii) Language model :

 For general-purpose speech recognition, the language model can be an n-gram model of text learned
from a corpus of written sentences.

 However, spoken language has different characteristics than written language, so it is better to get a
17

Downloaded by Riyaz Naik SMD ([email protected])


lOMoARcPSD|47823449

corpus of transcripts of spoken language.

 For task-specific speech recognition, the corpus should be task-specific: to build your airline
reservation system, get transcripts of prior calls.

It also helps to have task-specific vocabulary, such as a list of all the airports and cities served, and
all the flight numbers.

(iii) Building a speech recognizer :

 The quality of a speech recognition system depends on the quality of all of its components— the
language model, the word-pronunciation models, the phone models, and the signal processing
algorithms used to extract spectral features from the acoustic signal.

 The accuracy of a system depends on a number of factors. First, the quality of the signal matters: a
high-quality directional microphone aimed at a stationary mouth in a padded room will do much
better than a cheap microphone transmitting a signal over phone lines from a car in traffic with the
radio playing.

18

Downloaded by Riyaz Naik SMD ([email protected])

You might also like