Unit V-AI-KCS071

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Galgotias College of Engineering and Technology

ARTIFICIAL INTELLIGENCE
KCS 071
B.TECH IV YR / VII SEM
(2022-23)

DEPARTMENT OF COMPUTER SCIENCE &


ENGENEERING

1
Course Outcomes
Course
Statement (On completion of this course, the student will be able to)
Outcome
Understand the basics of the theory and practice of Artificial Intelligence as a
CO1
discipline and about intelligent agents.
CO2 Understand search techniques and gaming theory.
The student will learn to apply knowledge representation techniques and
CO3 problem solving strategies to common AI applications.
CO4 Student should be aware of techniques used for classification and clustering.

CO5 Student should aware of basics of pattern recognition and steps required for it.

UNIT 1
Introduction–Definition – Future of Artificial Intelligence – Characteristics of Intelligent
Agents–Typical Intelligent Agents – Problem Solving Approach to Typical AI problems.
UNIT 2
Problem solving Methods – Search Strategies- Uninformed – Informed – Heuristics – Local
Search Algorithms and Optimization Problems – Searching with Partial Observations –
Constraint Satisfaction Problems – Constraint Propagation – Backtracking Search – Game
Playing – Optimal Decisions in Games – Alpha – Beta Pruning – Stochastic Games.
UNIT 3
First Order Predicate Logic – Prolog Programming – Unification – Forward Chaining-
Backward Chaining – Resolution – Knowledge Representation – Ontological Engineering-
Categories and Objects – Events – Mental Events and Mental Objects – Reasoning Systems for
Categories – Reasoning with Default Information.
UNIT 4
Architecture for Intelligent Agents – Agent communication – Negotiation and Bargaining –
Argumentation among Agents – Trust and Reputation in Multi-agent systems.
UNIT 5
AI applications – Language Models – Information Retrieval- Information Extraction – Natural
Language Processing – Machine Translation – Speech Recognition.

2
UNIT 5

APPLICATIONS
AI applications – Language Models – Information Retrieval- Information Extraction – Natural
Language Processing – Machine Translation – Speech Recognition.

5.1 LANGUAGE MODELS

Language can be defined as a set of strings; “print(2+2)” is a legal program in the


language Python, where “2) + (2 print” is not. Since the are an infinite number of legal
programs, they cannot be enumerated; instead they are specified by a set of rules called a
grammar. Formal languages also have rules that defined the meaning semantics of a program;
for example, the rules say that the “meaning” of “2 + 2” is 4, and the meaning of “1/0” is that
an error is signated.

1. Natural languages, such an English or Spanish, cannot be characterized as a definite set


of sentences. Example: Everyone agrees that “Not to be invited is sad” is a sentence of
English, but people disagree on the grammatically of “To be not invited is said”.
Therefore, it is more fruitful to define a natural language model as a probability
distribution over sentences rather than a definitive set. That is, rather than asking if a
string of words is or is not a member of the set defining the language, we instead ask
for P(S = word) - what is the probability that a random sentence would to words. Natural
languages are also ambiguous. “He saw her duck” can mean either that he saw a
waterfowl belonging to her, or that he saw her move to evade something. Thus, again,
we cannot speak of a single meaning for a sentence, but rather of a probability
distribution over possible meanings.

2. Finally, natural language are difficult to deal with because they are very large, and
constantly changing. Thus, our language models are, at best, an approximation. We start
with simplest possible approximation and move up from there.

5.2 N-GRAM CHARACTER MODELS

1. An n-gram model is defined as a Markov chain of order n – 1. In Markov chain the


probability of character ci depends only on the immediately preceding characters, not
on any other characters. So in a trigram model (Markov chain of order 2) we have

P(c4 | c1:i-1) = P(c4 | c4 – 2:i -1)

We can define the probability of a sequence of characters P(c1:N) under the trigram
model by first factoring with the chain rule and then using the Markov assumption:
N N

P(C1 : N) =  P(c1 | c1:i−1) =  P(ci | ci−2:i−1)


i=1 i=1

3
For a trigram character model in a language with 100 characters, P(Ci|Ci-2:i=1) has a
million entries, and can be accurately estimated by counting character sequences in a
body of text of 10 million characters or more. We call a body of text a corpus (plural
corpora), from the Latin word for body.

2. Language identification: Given a text, determine what natural language it is written


in. This is a relatively easy task, even with short texts such as “Hello, world” or
“Wiegehtesdir”, it is easy to identify the first as English and the second as German.
Computer systems identify languages with greater than 99% accuracy; occasionally,
closely related languages, such as Swedish and Norwegian, are confused.

3. One approach to language identification is to first build a trigram character model of


each candidate language, P (ci | ci-2; i-1), where the variable_ranges over languages.
For each_the model is built by counting trigrams in a corpus of that language. (About
100,000 characters of each language are needed). That gives us a model of P (Text |
Language), but we want to select the most probable language given the text, so we apply
Bayes’ rule followed by the Markov assumption to get the most probable language:

argmax
𝑙* = argmaxP(𝑙 | e1 : N)
𝑙

argmax
= P(𝑙)O(C 𝑙)
1:N
𝑙

argmax N
= P( 𝑙 ) P(c4 | c4 − 2 : i −1, 𝑙)

𝑙 i=1

4. Other tasks for character models include spelling correction, genre classification, and
named-entity recognition. Genre classification means decided if a text is a new story, a
legal document, a scientific article, etc. While many features help make this
classification, counts of punctuation and other character n-gram features go a long way
(Kessler et al., 1997).

5. Named-entity recognition is the task of finding names of things in a document and


deciding what class they belong to. For example, in the text “MrSopersteen was
prescribed aciphex”, we should recognize that “Mr. Sopersteen” is the name of a person
and “aciphex” is the name of a drug. Character-level models are good for this task
because they can associate the character sequence “ex” (“ex” followed by a space) with
a drug name and “steen” with a person name, and thereby identify words that they have
never seen before.

4
5.3 SMOOTHING N-GRAM MODELS

1. The major complication of n-gram models is that the training corpus provides only an
estimate of the true probability distribution.

2. For common character sequence such as “th” any English corpus will give a good
estimate: about 1.5% of all trigrams.

3. On the other hand, “ht” is very uncommon – no dictionary words start5 with ht. It is
likely that the sequence would

4. Have a count of zero in a training corpus of standard English. Does that mean we should
assign P(“th”) = 0? If we did, then the text “The program issues an http request” would
have an English probability of zero, which seems wrong.

5. The process adjusting the probability of low-frequency counts is called smoothing.

6. A better approach is a backoff model, in which we start by estimating n-gram counts,


but for any particular sequence that has a low (or zero) count, we back off to (n-1)
grams. Linear interpolation smoothing is a backoff model that combines trigram, and
unigram models by linear interpolation. It defines the probability estimate as

* P(ci|ci-2:i=1) = 3P(ci|ci-2:i-1) + 2P(ci|ci-1) + 1P(ci), where 3 + 2 + 1 = 1.

It is also possible to have the values of I depend on the counts: if we have a high count
of trigrams, then we weigh them relatively more; if only a low count, then we put more
weight on the bigram and unigram models.

5.4 MODEL EVALUATION

1. The evaluation can be a task-specific metric, such as meaning accuracy on language


identification.

2. Alternatively, we can have a task-independent model of language quality: calculate the


probability assigned to the validation corpus by the model; the higher the probability
the better. This metric is inconvenient because the probability of a large corpus will be
a very small number, and floating-point underflow becomes an issue. A different way
of describing the probability of a sequence is with a measure called perplexity, defined
as

Perplexity (c1:N) = P(c1:N)-1/N

3. Perplexity can be thought of as the reciprocal of probability, normalized by sequence


length.

4. It can also be thought of as the weighted average branching factor5 of a model. Suppose
there are 100 characters in our language, and our model says they are all equally likely.

5
Then for a sequence of any length, the perplexity will be 100. If some characters are
more likely than others, and the model reflects that, then the model will have a
perplexity less than 100.

5.5 N-GRAM WORD MODEL

1. n-gram models over words rather than characters

2. All the same mechanism applies equally to word and character models.

3. The main difference is that the vocabulary— the set of symbols that make up the corpus
and the model—is larger.

4. There are only about 100 characters in most languages, and sometimes we build
character models that are even more restrictive, for example by treating “A” and “a” as
the same symbol or by treating all punctuation as the same symbol. But with word
models we have at least tens of thousands of symbols, and sometimes millions. The
wide range is because it is not clear what constitutes a word.

5. Word n-gram models need to deal with out of vocabulary words.

6. With word models there is always the chance of a new word that was not seen in the
training corpus, so we need to model that explicitly in our language model.

7. This can be done by adding just one new word to the vocabulary: <UNK>, standing for
the unknown word.

8. Sometimes multiple unknown-word symbols are used, for different classes. For
example, any string of digits might be replaced with <NUM>, or any email address with
<EMAIL>.

5.6 INFORMATION RETRIEVAL

DEFINITION: Information retrieval is the task of documents that are relevant to a


user’s need for information.

The best-known examples of information retrieval systems are search engines on the
World Wide Web. A Web user can type a query such as [AI book] into a search engine and see
a list of relevant pages. In this section, we will see how such systems are built. An information
retrieval (henceforth IR) system can be characterized by

1. A corpus of documents. Each system must decide what it wants to treat as a document:
a paragraph, a page, or a multipage text.

2. Queries posed in a query language. A query specifies what the user wants to know. The
query language can be just a list of words, such as [AI book]; or it can specify a phrase
of words that must be adjacent, as in [“AI book”]; it can contain Boolean operators as

6
in [AI AND book]; it can include non-Boolean operators such as [AI NEAR book] or
[AI book site: www.aaai.org].

3. A result set. This is the subset of documents that the IR system judges to be relevant to
the query. By relevant, we mean likely to be of use to the person who posed the query,
for the particular information need expressed in the query.

4. A presentation of the result set. This can be as simple as a ranked list of document titles
or as complex as a rotating color map of the result set projected onto a three-
dimensional space, rendered as a two-dimensional display.

The earliest IR systems worked on a Boolean keyword model. Each word in the
document collection is treated as a Boolean feature that is true of a document if the word occurs
in the document and false if it does not.

Advantage

1. Simple to explain and implement.

Disadvantages

1. The degree of relevance of a document is a single bit, so there is no guidance as to how


to order the relevant documents for presentation.

2. Boolean expressions are unfamiliar to users who are not programmers or logicians.
Users find it unintuitive that when they want to know about farming in the states of
Kansas and Nebraska they need to issue the query [farming (Kansas OR Nebraska)].

3. It can be hard to formulate an appropriate query, even for a skilled user. Suppose we
try [information AND retrieval AND models AND optimization] and get an empty
result set. We could try [information OR retrieval OR models OR optimization], but if
that returns too many results, it is difficult to know what to try next.

5.7 IR SCORING FUNCTIONS

1. Most IR systems have abandoned the Boolean model and use models based on the
statistics of word counts.

2. A scoring function takes a document and a query and returns a numeric score; the most
relevant documents have the highest scores

3. In the BM25 scoring function, the score is a linear weighted combination of scores for
each of the words that make up the query

4. Three factors affect the weight of a query term:

7
• First, the frequency with which a query term appears in a document (also known as TF
for term frequency). For the query [farming in Kansas], documents that mention
“farming” frequently will have higher scores.

• Second, the inverse document frequency of the term, or IDF. The word “in” appears in
almost every document,

• so it has a high document frequency, and thus a low inverse document frequency, and
thus it is not as important to the query as “farming” or “Kansas.”

• Third, the length of the document. A million-word document will probably mention all
the query words, but may not actually be about the query. A short document that
mentions all the words is a much better candidate.

5. The BM25 function takes all three of these into account

BM25(d , q1: N) =  IDF(q ).


N TF(qi , d j ).(k + 1)

j i
| dj |
TF(qi , d j ) + k.(1− b + a.
i=1

where |dj| is the length of document dj in words, and L is the average document length
in the corpus, L = i |di| N. We have two parameters, k and b, that can be tuned by cross-
validation; typical values are k = 2.0 and b = 0.75. IDF (qi) is the inverse document.

6. Systems create an index ahead of time that lists, for each vocabulary word, the
documents that contain the word. This is called the hit list for the word. Then when
given an query, we intersect the hit lists of the query words and only score the
documents in the intersection.

5.8 IR SYSTEM EVALUATION

Imagine that IR system has returned a result set for a single query, for which we know
which documents are and are not relevant, out of a corpus of 100 documents. The document
counts in each category are given in the following table.

Table 5.1

In result set Not in result set


Relevant 30 20
Not relevant 10 40

Precision measures the proportion of documents in the result set that are actually
relevant. In our example, the precision is 30/(30 + 10)=.75. The false positive rate is 1 −
.75=.25. 2. Recall measures the proportion of all the relevant documents in the collection that
8
are in the result set. In our example, recall is 30/(30 + 20)=.60. The false negative rate is 1
−.60=.40. 3. In a very large document collection, such as the WorldWideWeb, recall is
difficult to compute, because there is no easy way to examine every page on the Web for
relevance.

5.9 IR REFINEMENTS

1. One common refinement is a better model of the effect of document length on


relevance.
2. The BM25 scoring function uses a word model that treats all words as completely
independent, but we know that some words are correlated: “couch” is closely related to
both “couches” and “sofa.” Many IR systems attempt to account for these correlations.
3. For example, if the query is [couch], it would be a shame to exclude from the result set
those documents that mention “COUCH” or “couches” but not “couch.” Most IR
systems do case folding of “COUCH” to “couch,” and some use a stemming algorithm
to reduce “couches” to the stem form “couch,” both in the query and the documents.
4. The next step is to recognize synonyms, such as “sofa” for “couch.” As with stemming,
this has the potential for small gains in recall, but can hurt precision. Synonyms and
related words can be found in dictionaries or by looking for correlations in documents
or in queries.
5. As a final refinement, IR can be improved by considering metadata—data outside of
the text of the document. Examples include human-supplied keywords and publication
data. On the Web, hypertext links between documents are a crucial source of
information.

5.10 PAGE RANK ALGORITHM

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search
engine results. Page Rank was named after Larry Page, one of the founders of Google.
PageRank is a way of measuring the importance of website pages. According to Google:
PageRank works by counting the number and quality of links to a page to determine a rough
estimate of how important the website is. The underlying assumption is that more important
websites are likely to receive more links from other website

9
Figure 5.1 The HITS algorithm for computing hubs and authorities with respect to a
query. REELVANT-PAGES fetches the pages that match the query and EXPAND-
PAGES adds in every page that links to or is linked from one of the relevant pages.
NORMALIZE divides each page’s score by the sum of the sequence of all pages’
scores (separately for both the authority and hubs scores)

We will see that the recursion bottom out property. The PageRank for a page p is
defined as
1− d
PR(p) = + d PR(ini )

N i C(ini )

where PR(p) is the PageRank of page p, N is the total number of pages in the corpus,
ini are the pages that link into p, and C(ini) is the count of the total number of out-links
on page ini. The constant d is a damping factor. It can be understood through the
random surface model: imagine a Web surface who starts at some random page and
begins exploring.

5.11 THE HITS ALGORITHM

1. The Hyperlink-Induced Topic Search algorithm, also known as “Hubs and Authorities”
or HITS, is another influential link-analysis algorithm.

2. HITS differ from PageRank in several ways. First, it is a query-dependent measure: it


rates pages with respect to a query.

3. HITS first find a set of pages that are relevant to the query. It does that by intersecting
hit lists of queries

4. Words, and then adding pages in the link neighborhood of these pages—pages that link
to or are linked from one of the pages in the original relevant set.

5. Each page in this set is considered an authority on the query to the degree that other
pages in the relevant set point to it. A page is considered a hub to the degree that it
points to other authoritative pages in the relevant set.

6. Just as with PageRank, we don’t want to merely count the number of links; we want to
give more value to the high-quality hubs and authorities.

7. Thus, as with PageRank, we iterate a process that updates the authority score of a page
to be the sum of the hub scores of the pages that point to it, and the hub score to be the
sum of the authority scores of the pages it points to.

8. Both PageRank and HITS played important roles in developing our understanding of
Web information retrieval. These algorithms and their extensions are used in ranking

10
billions of queries daily as search engines steadily develop better ways of extracting yet
finer signals of search relevance.

5.12 IMAGE EXTRACTION

Information extraction is the process of acquiring INFORMATION knowledge by


skimming a text and looking for occurrences of a particular class of object and for relationships
among objects. A typical task is to extract instances of addresses from Web pages, with
database fields for street, city, state, and zip code; or instances of storms from weather reports,
with fields for temperature, wind speed, and precipitation. In a limited domain, this can be done
with high accuracy.

5.13 FINITE-STATE AUTOMATA FOR INFORMATION EXTRACTION

1. The simplest type of information extraction system is an attribute-based extraction


system that assumes that the entire text refers to a single object and the task is to extract
attributes of that object.

2. One step up from attribute-based extraction systems are relational extraction systems,
which deal with multiple objects and the relations among them.

3. A typical relational-based extraction system is FASTUS, which handles news stories


about corporate mergers and acquisitions.

4. A relational extraction system can be built as a series of cascaded finite-state


transducers.

5. That is, the system consists of a series of small, efficient finite-state automata (FSAs),
where each automaton receives text as input, transduces the text into a different format,
and passes it along to the next automaton. FASTUS consists of five stages: 1.
Tokenization 2. Complex-word handling 3. Basic-group handling 4. Complex-phrase
handling 5. Structure merging

6. FASTUS’s first stage is tokenization, which segments the stream of characters into
tokens (words, numbers, and punctuation). For English, tokenization can be fairly
simple; just separating characters at white space or punctuation does a fairly good job.
Some tokenizers also deal with markup languages such as HTML, SGML, and XML.

7. The second stage handles complex words, including collocations such as “set up” and
“joint venture,” as well as proper names such as “Bridgestone Sports Co.” These are
recognized by a combination of lexical entries and finite- state grammar rules.

8. The third stage handles basic groups, meaning noun groups and verb groups. The idea
is to chunk these into units that will be managed by the later stages.

9. The fourth stage combines the basic groups into complex phrases. Again, the aim is to
have rules that are finite- state and thus can be processed quickly, and that result in

11
unambiguous (or nearly unambiguous) output phrases. One type of combination rule
deals with domain-specific events.

10. The final stage merges structures that were built up in the previous step. If the next
sentence says “The joint venture will start production in January,” then this step will
notice that there are two references to a joint venture, and that they should be merged
into one. This is an instance of the identity uncertainty problem.

5.14 PROBABILISTIC MODELS FOR INFORMATION EXTRACTION

1. The simplest probabilistic model for sequences with hidden state is the hidden Markov
model, or HMM.

2. HMMs have two big advantages over FSAs for extraction.

• First, HMMs are probabilistic, and thus tolerant to noise. In a regular expression, if a
single expected character is missing, the regex fails to match; with HMMs there is
graceful degradation with missing characters/words, and we get a probability indicating
the degree of match, not just a Boolean match/fail.

• Second, HMMs can be trained from data; they don’t require laborious engineering of
templates, and thus they can more easily be kept up to date as text changes over time.

Figure 5.2 Hidden Markov model for the speaker of a talk


announcement. The two square states are the target (note the second
target state has a self-loop, so the target can match a strain of any
length), the four circles to the left are prefix, and the one of the right is
the postfix. For each state, only a few of the high-probability words are
shown. From Freitag and McCallum (2000)
3. Once the HMMs have been learned, we can apply them to a text, using the Viterbi
algorithm to find the most likely path through the HMM states. One approach is to apply
each attribute HMM separately; in this case you would expect most of the HMMs to
spend most of their time in background states. This is appropriate when the extractionis
sparse - when the number of extracted words is small compared to the length of thetext.

12
4. The other approach is to combine all the individual attributes into one big HMM, which
would then find a path that wanders through different target attributes, first finding a
speaker target, then a date target, etc. Separate HMMs are better when we expect just
one of each attribute in a text and one big HMM is better when the texts are more free-
form and dense with attributes.

5. HMMs have the advantage of supplying probability numbers that can help make the
choice. If some targets are missing, we need to decide if this is an instance of the desired
relation at all, or if the targets found are false positives. A machine learning algorithm
can be trained to make this choice.

5.15 ONTOLOGY EXTRACTION FROM LARGE CORPORA

1. A different application of extraction technology is building a large knowledge base or


ontology of facts from a corpus. This is different in three ways:

• First it is open-ended—we want to acquire facts about all types of domains, not just one
specific domain.

• Second, with a large corpus, this task is dominated by precision, not recall—just as with
question answering on the Web

• Third, the results can be statistical aggregates gathered from multiple sources, rather
than being extracted from one specific text.
2. Here is one of the most productive templates: NP such as NP (, NP)* (,)? ((and | or)
NP)?.
3. Here the bold words and commas must appear literally in the text, but the parentheses
are for grouping, the asterisk means repetition of zero or more, and the question mark
means optional.
4. NP is a variable standing for a noun phrase
5. This template matches the texts “diseases such as rabies affect your dog” and “supports
network protocols such as DNS,” concluding that rabies is a disease and DNS is a
network protocol.
6. Similar templates can be constructed with the key words “including,” “especially,” and
“or other.” Of course these templates will fail to match many relevant passages, like
“Rabies is a disease.” That is intentional.
7. The “NP is a NP” template does indeed sometimes denote a subcategory relation, but
it often means something else, as in “There is a God” or “She is a little tired.” With a
large corpus we can afford to be picky; to use only the high-precision templates.
8. We’ll miss many statements of a subcategory relationship, but most likely we’ll find a
paraphrase of the statement somewhere else in the corpus in a form we can use.

13
5.16 AUTOMATED TEMPLATE CONSTRUCTION

Clearly these are examples of the author–title relation, but the learning system had no
knowledge of authors or titles. The words in these examples were used in a search over a Web
corpus, resulting in 199 matches. Each match is defined as a tuple of seven strings, (Author,
Title, Order, Prefix, Middle, Postfix, URL), where Order is true if the author came first and
false if the title came first, Middle is the characters between the author and title, Prefix is the
10 characters before the match, Suffix is the 10 characters after the match, and URL is the Web
address where the match was made.

1. Each template has the same seven components as a match.


2. The Author and Title are regexes consisting of any characters (but beginning and ending
in letters) and constrained to have a length from half the minimum length of the
examples to twice the maximum length.
3. The prefix, middle, and postfix are restricted to literal strings, not regexes.
4. The middle is the easiest to learn: each distinct middle string in the set of matches is a
distinct candidate template. For each such candidate, the template’s Prefix is then
defined as the longest common suffix of all the prefixes in the matches, and the Postfix
is defined as the longest common prefix of all the postfixes in the matches.
5. If either of these is of length zero, then the template is rejected.
6. The URL of the template is defined as the longest prefix of the URLs in the matches.
The biggest weakness in this approach is the sensitivity to noise. If one of the first few
templates is incorrect, errors can propagate quickly. One way to limit this problem is to
not accept a new example unless it is verified by multiple templates, and not accept a
new template unless it discovers multiple examples that are also found by other
templates.

5.17 MACHINE READING

1. Traditional information extraction system that is targeted at a few relations and more
like a human reader who learns from the text itself; because of this the field has been
called machine reading.

2. A representative machine-reading system is TEXTRUNNER. TEXTRUNNER uses


cotraining to boost its performance, but it needs something to bootstrap from.

3. Because TEXTRUNNER is domain-independent, it cannot rely on predefined lists of


nouns and verbs.

14
Figure 5.3 Eight general templates that cover about 95% of the ways that
relations are expressed in English

TEXTRUNNER achieves a precision of 88% and recall of 45% (F1 of 60%) on a large
Web corpus. TEXTRUNNER has extracted hundreds of millions of facts from a corpus of half-
billion Web pages.

5.18 NATURAL LANGUAGE PROCESSING

Machine translation, sometimes referred to by the abbreviation MT (not to be confused


with computer-aided translation, machine-aided human translation (MAHT) or interactive
translation) is a sub-field of computational linguistics that investigates the use of software to
translate text or speech from one language to another.

On a basic level, MT performs simple substitution of words in one language for words
in another, but that alone usually cannot produce a good translation of a text because
recognition of whole phrases and their closest counterparts in the target language is needed.
Solving this problem with corpus statistical, and neural techniques is a rapidly growing field
that is leading to better translations, handling differences in linguistic typology, translation of
idioms and the isolation of anomalies.

MACHINE TRANSLATION SYSTEMS

All translation systems must model the source and target languages, but systems vary
in the type of models they use. Some systems, attempt to analyse the sourced language text all
the way in to an interlingua knowledge representation and then generate sentences in the target
language from that representation. This is difficult because it involves three unsolved problems:
creating a complete knowledge representation of everything; parsing into that representation;
and generating sentences from that representation. Other systems are based on a transfer model.

15
Figure 5.4 The Vauquois triangle: schematic diagram of the choice for a machine
translation system (Vauquois, 1968). We start with English test as the top. An
interlingua-based system follows the solid lines, pursuing English first into a syntactic
form, then into a semantic representation and an interlingua representation, and then
through generation to a semantic, syntactic and lexical form in French. A transfer-
based system uses the dashed lines as a shortcut. Different systems make the transfer
at different points: some make it at multiple points.

STATISTICAL MACHINE TRANSLATION

1. Find parallel texts: First, gather a parallel bilingual corpus. For example, a Hansard
is a record of parliamentary debate. Canada, Hong Kong, and other countries produce
bilingual Hansards, the European Union publishes its official documents in 11
languages, and the United Nations publishes multilingual documents. Bilingual text is
also available online; some Web sites publish parallel content with parallel URLs, for
example, /en/ for the English page and /fr/ for the corresponding French page. The
leading statistical translation systems train on hundreds of millions of words of parallel
text and billions of words of monolingual text.

2. Segment into sentences: The unit of translation is a sentence, so we will have to break
the corpus into sentences. Periods are strong indicators of the end of a sentences, but
consider “Dr. J.R. Smith of Rodeo Dr. PAD $ 29.99 ON 9.9.09”; only the final period
ends a sentence. One way to decide if a period ends a sentence is to train a model that
takes as features the surrounding words and their parts of speech. This approach
achieves about 98% accuracy.

3. Align sentences: For each sentence in the En1glish version, determine what sentence(s)
it corresponds to in the French version. Usually, the next sentence of English
corresponds to the next sentence of French in a 1:1 match, but sometimes there is
variation: one sentence in one language will be split into a 2:1 match, or the order of
two sentences will be swapped, resulting in a 2:2 match. By looking at the sentence
lengths alone (i.e. short sentences should align with short sentences), it is possible to
align (1:1, 1:2, or 2;2, etc) with accuracy in the 90% to 99% range using a variation on
the Viterbi algorithm.

16
4. Align phrases: Within a sentence, phrases can be aligned by a process that is similar
to that used for sentence alignment, but requiring iterative improvement. When we start,
we have no way of knowing that “qui dort” aligns with “sleeping”, but we can arrive at
that alignment by a process of aggregation of evidence.

5. Extract distortions: Once we have an alignment of phrases, we can define distortion


probabilities. Simply count how often distortion occurs in the corpus for each distance.

6. Improve estimates with EM: Use expectation – maximization to improve the estimate
of P(f|e) and P(d) values. We compute the best alignments with the current values of
these parameters in the E step, then update the estimates in the M step and iterate the
process until convergence.

SPEECH RECOGNITION

Definition: Speech recognition is the task of identifying a sequence of SPEECH words


uttered by a speaker, given the acoustic signal. It has become one of the mainstream
applications of AI.

1. Example: The phrase “recognize speech” sounds almost the same as “wreak a nice
beach” when spoken quickly. Even this short example shows several of the issues that
make speech problematic.

2. First segmentation: written words in English have spaces between them, but in fast
speech there are no pauses in “wreck a nice” that would distinguish it as a multiword
phrase as opposed to the single word “recognize”.

3. Second, coarticulation: when speaking quickly the “s” sound at the end of “nice”
merges with the “b” sound at the beginning of “beach” yielding something that is close
to a “sp”. Another problem that does not show up in this example is homophones –
words like “to”, “too” and “two” that sound the same but differe in meaning

argmax P(word1:t | sound1:t) = argmax P(sound1:t | word1:t) P(word1:t).


word1:t word1:t

Heere P (sound1:t |sound1:t) is the acrostic model. It describes the sound of words – that
“ceiling” begins with a soft “c” and sounds the same as “sealing”. P (word1:t) is known
as the language model. It specifies the prior probability of each utterance – for
example, that “ceiling fan” is about 500 times more likely as a word sequence than
“sealing fan”.

4. Once we define the acoustic and language models, we can solve for the most likely
sequence of words using the Viterbi algorithm.

Acoustic Model

17
1. An analog-to-digital converter measures the size of the current – which approximates
the amplitude of the sound wave at discrete intervals called as sampling rate.

2. The precision of each measurement is determined by the quantization factor, speech


recognizers typically keep 8 to 12 bits. That means that a low-end system, sampling at
8 kHz with 8-bit quantization, would require nearly half a megabyte per minute of
speech.

3. A phoneme is the smallest unit of sound that has a distinct meaning to speakers of a
particular language.

For example the “t” in “stick” sounds similar enough to the “t” in “tick” that speakers
of English consider them the same phoneme.

Each frame is summarized by a vector of features. Below picture represents phone


model.

Figure 5.5 Translating the acoustic signal into a sequence of frames. In this
diagram each frame is described by the discretized values of three acoustic
features; a real system would have dozens of features

Building a speech recognizer

1. The quality of a speech recognition system depends on the quality of all of its
components – the language model, the word-pronunciation models, the phone models,
and the signal processing algorithms used to extract spectral features from the acoustic
signals.

2. The systems with the highest accuracy work by training a different models for each
speaker, thereby capturing differences in dialect as well as male / female and other
variations. This training can require several hours of interaction with the speaker, so the
systems with the most widespread adoption do not create speaker-specific models.

3. The accuracy of a system depends on a number of factors. First, the quality of the signal
matters: a high-quality directional microphone aimed at a stationary mouth in a padded
room will do much better than a cheap microphone transmitting a signal over phone
lines from a car in traffic with the radio playing. The vocabulary size matters: when
18
recognizing digit strings with a vocabulary of 11 words (1-9 plus “oh” and “zero)”, the
word error rate will be below 0.5%, where as it rises to about 10% on news stories with
a 20,000-word vocabulary, and 20% on a corpus with a 64,000-word vocabulary. The
task matters too: when the system is trying to accomplish a specific task – book a flight
or give directions to a restaurant – the task can often be a accomplished perfectly even
with a word error rate of 109% or more.

Figure 5.6A schematic diagram of the choices for a machine translation system.
We start with English text at the top. An interlingua-based system follows the
solid lines, parsing English first into a syntactic form, then into a semantic
representation and an interlingua representation, and then through generation
to a semantic, syntactic, and lexical form in French. A transfer-based system
uses the dashed lines as a shortcut. Different systems make the transfer at
different points; some it at multiple points.

5.19 ROBOT

1. Robots are physical agents that perform tasks by manipulating the physical world.

2. To do so, they are equipped with effectors such as legs, wheels, joints, and grippers.

3. Effectors have a single purpose: to assert physical forces on the environment.

4. Robots are also equipped with sensors, which allow them to perceive their environment.

5. Present day robotics employs a diverse set of sensors, including cameras and lasers to
measure the environment, and gyroscopes and accelerometers to measure the robot’s
own motion.

6. Most of today’s robots fall into one of three primary categories. Manipulators, or robot
arms are physically anchored to their workplace, for example in a factory assembly line
or on the International Space Station.

Robot Hardware

19
1. Sensors are the perceptual interface between robot and environment.

2. Passive sensors, such as cameras, are true observers of the environment: they capture
signals that are generated by other sources in the environment.

3. Active sensors, such as sonar, send energy into the environment. They rely on the fact
that this energy is reflected back to the sensor. Active sensors tend to provide more
information than passive sensors, but at the expense of increased power consumption
and with a danger of interference when multiple active sensors are used at the same
time. Whether active or passive, sensors can be divided into three types, depending on
whether they sense the environment, the robot’s location, or the robot’s internal
configuration.

4. Range finders are sensors that measure the distance to nearby objects. In the early days
of robotics, robots were commonly equipped with sonar sensors. Sonar sensors emit
directional sound waves, which are reflected by objects, with some of the sound making
it.

5. Stereo vision relies on multiple cameras to image the environment from slightly
different viewpoints, analyzing the resulting parallax in these images to compute the
range of surrounding objects. For mobile ground robots, sonar and stereo vision are
now rarely used, because they are not reliably accurate.

6. Other range sensors use laser beams and special 1-pixel cameras that can be directed
using complex arrangements of mirrors or rotating elements. These sensors are called
scanning lidars (short for light detection and ranging).

7. Other common range sensors include radar, which is often the sensor of choice for
UAVs. Radar sensors can measure distances of multiple kilometers. On the other
extreme end of range sensing are tactile sensors such as whiskers, bump panels, and
touch-sensitive skin. These sensors measure range based on physical contact, and can
be deployed only for sensing objects very close to the robot.

8. A second important class of sensors is location sensors. Most location sensors use range
sensing as a primary component to determine location. Outdoors, the Global
Positioning System (GPS) is the most common solution to the localization problem.

9. The third important class is proprioceptive sensors, which inform the robot of its own
motion. To measure the exact configuration of a robotic joint, motors are often equipped
with shaft decoders that count the revolution of motors in small increments.

10. Inertial sensors, such as gyroscopes, rely on the resistance of mass to the change of
velocity. They can help reduce uncertainty.

20
11. Other important aspects of robot state are measured by force sensors and torque sensors.
These are indispensable when robots handle fragile objects or objects whose exact
shape and location is unknown.

Robotic Perception

1. Perception is the process by which robots map sensor measurements into internal
representations of the environment. Perception is difficult because sensors are noisy,
and the environment is partially observable, unpredictable, and often dynamic. In other
words, robots have all the problems of state estimation (or filtering)

2. As a rule of thumb, good internal representations for robots have three properties: they
contain enough information for the robot to make good decisions, they are structured
so that they can be updated efficiently, and they are natural in the sense that internal
variables correspond to natural state variables in the physical world.

Figure 5.7 Robot perception can be viewed as temporal inference from


sequence of actions and measurements, as illustrated by this dynamic Bayers
network

2. Another machine learning technique enables robots to continuously adapt to broad


changes in sensor measurements.

3. Adaptive perception techniques enable robots to adjust to such changes. Methods that
make robots collect their own training data (with labels!) are called self-supervised. In
this instance, the robot machine learning to leverage a short-range sensor that works
well for terrain classification into a sensor that can see much farther.

5.20 PLANNING TO MOVE

1. All of a robot’s deliberations ultimately come down to deciding how to move effectors.

2. The point-to-point motion problem is to deliver the robot or its end effector to a
designated target location.

3. A greater challenge is the compliant motion problem, in which a robot moves while
being in physical contact with an obstacle.

4. An example of compliant motion is a robot manipulator that screws in a light bulb, or


a robot that pushes a box across a table top. We begin by finding a suitable
representation in which motion-planning problems can be described and solved. It turns

21
out that the configuration space—the space of robot states defined by location,
orientation, and joint angles—is a better place to work than the original 3D space.

5. The path planning problem is to find a path from one configuration to another in
configuration space. We have already encountered various versions of the path-
planning problem throughout this book; the complication added by robotics is that path
planning involves continuous spaces. T

6. Here are two main approaches: cell decomposition and skeletonization. Each reduces
the continuous path- planning problem to a discrete graph-search problem. In this
section, we assume that motion is deterministic and that localization of the robot is
exact. Subsequent sections will relax these assumptions.

7. The second major family of path-planning algorithms is based on the idea of


skeletonization. These algorithms reduce the robot’s free space to a one-dimensional
representation, for which the planning problem is easier. This lower-dimensional
representation is called a skeleton of the configuration space.

5.20.1 Configuration Space

1. It has two joints that move independently. Moving the joints alters the (x, y) coordinates
of the elbow and the gripper. (The arm cannot move in the z direction.) This suggests
that the robot’s configuration can be described by a four-dimensional coordinate: (xe,
ye) for the location of the elbow relative to the environment and (xg, yg) for the location
of the gripper. Clearly, these four coordinates characterize the full state of the robot.
They constitute what is known as workspace representation.

2. Configuration spaces have their own problems. The task of a robot is usually expressed
in workspace coordinates, not in configuration space coordinates. This raises the
question of how to map between workspace coordinates and configuration space.

3. These transformations are linear for prismatic joints and trigonometric for revolute
joints. This chain of coordinate transformation is known as kinematics.

4. The inverse problem of calculating the configuration of a robot whose effector location
is specified in workspace coordinates is known as inverse kinematics.The
configuration space can be decomposed into two subspaces: the space of all
configurations that a robot may attain, commonly called free space, and the space of
unattainable configurations, called occupied space.

5.20.2 Cell Decomposition Methods

1. The simplest cell decomposition consists of a regularly spaced grid.

22
2. Grayscale shading indicates the value of each free-space grid cell—i.e., the cost of the
shortest path from that cell to the goal.

3. Cell decomposition methods can be improved in a number of ways, to alleviate some


of these problems. The first approach allows further subdivision of the mixed cells—
perhaps using cells of half the original size. This can be continued recursively until a
path is found that lies entirely within free cells. (Of course, the method only works if
there is a way to decide if a given cell is a mixed cell, which is easy only if the
configuration space boundaries

4. Have relatively simple mathematical descriptions.) This method is complete provided


there is a bound on the smallest passageway through which a solution must pass. One
HYBRID A* algorithm that implements this is hybrid A*.

5.20.3 Modified Cost Functions

1. A potential field is a function defined over state space, whose value grows with the
distance to the closet obstacle.

2. The potential field can be used as an additional cost term in the shortest-path
calculation.

3. This induces an interesting trade off. On the one hand, the robot seeks to minimize path
length to the goal. On the other hand, it tries to stay away from obstacles by virtue to
minimizing the potential function.

4. There exist many other ways to modify the cost function. For example, it may be
desirable to smooth the control parameters over time.

5.20.4 Skeletonization methods

1. The second major family of path-planning algorithms is based on the idea of


skeletonization.

2. These algorithms reduce the robot’s free space to a one-dimensional representation, for
which the planning problem is easier.

3. This lower-dimensional representation is called a skeleton of the configuration space.

4. It is a Voronoi graph of the free space - the set of all points that are equidistant to two
or more obstacles. To do path planning with a Voronoi graph, the robot first changes
its present configuration to a point on the Voronoi graph.

5. It is easy to show that this can always be achieved by a straight-line motion in


configuration space. Second, the robot follows the Voronoi graph until it reaches the
point nearest to the target configuration. Finally, the robot leaves the Voronoi graph

23
and moves to the target. Again, this final step involves straight-line motion in
configuration space.

Figure 5.8(a) A repelling potential field pushes the robot away from
obstacles, (b) Path found by simultaneously minimizing path length and
the potential

Figure 5.9 (a) The Voronoi graph is the set of points equidistance to two
or more obstacles in configuration space (b) A probabilistic moodmap,
composed of 100 randomly chosen points in free space.

Robust methods

1. A robust method is one that assumes a bounded amount of uncertainty in each aspect
of a problem, but does not assign probabilities to values within the allowed interval.

2. A robust solution is one that works no matter what actual values occur, provided they
are within the assumed intervals.

3. An extreme form of robust method is the conformant planning approach.

24
Figure 5.10A two-dimensional environment, velocity uncertainty cone,
and envelope of possible robot motions. The intended velocity is , but
with uncertainty the actual velocity could be anywhere in C, resulting in
a final configuration somewhere in the motion envelope, which means we
wouldn’t know if we hit the hole or not

Figure 5.11The first motion command and the resulting envelope of


possible robot motions. No matter what the error, we know the final
configuration will be to the left of the hole

Define Speech Recognition. Explain different types of speech recognition systems.

Speech recognition is an interdisciplinary subfield of computer science and


computational linguistics that develops methodologies and technologies that enable the
recognition and translation of spoken language into text by computers. It is also known as
automatic speech recognition (ASR), computer speech recognition or speech to text (STT).
Speech recognition is the process by which computer maps an acoustic speech signal to some
form of abstract meaning of the speech. The fundamental aspect of speech recognition is the
translation of sound into text and commands.

This process is highly difficult since sound has to be matched with stored sound bites
on which further analysis has to be done because sound bites do not match with pre-existing
sound pieces. Feature extraction technique and pattern matching techniques plays important
role in speech recognition system to maximize the rate of speech recognition of various
persons.

Types of speech recognition system based on utterances

1. Isolated Words Isolated word recognition system which recognizes single utterances
i.e. single word. Isolated word recognition is suitable for situations where the user
is required to give only one word response or commands. It is simple and easiest for
implementation because word boundaries are obvious and the words tend to be clearly
pronounced which is the major advantage of this type. Eg. Yes/No, Listen / Not-Listen

2. Connected Words A connected words system is similar to isolated words, but it allows
separate utterances to be - run-together‟ with a minimal pause between them.

25
Utterance is the vocalization of a word or words that represent a single
meaning to the computer. Eg. drive car(action + object), baby book(agent + object) \

3. Continuous Speech Continuous speech recognition system allows users to speak almost
naturally, while the computer determines its content. Continuous speech recognition
system is difficult to develop. Eg. dictation

4. Spontaneous Speech Spontaneous speech recognition system recognizes the natural


speech. Spontaneous speech is natural that comes suddenly through mouth. An ASR
system with spontaneous speech is able to handle a variety of natural speech
features such as words being run together. Spontaneous speech may include
mispronunciation, false-starts and non words. Eg. Spontaneous speech may include
mispronunciation, false-starts (will you – will you go to your college?) and non
words("ums").

Explain about Automatic Speech Recognition System

Speech recognition is an interdisciplinary subfield of computer science and


computational linguistics that develops methodologies and technologies that enable the
recognition and translation of spoken language into text by computers. It is also known as
automatic speech recognition (ASR), computer speech recognition or speech to text (STT).

Figure 5.12 System Architecture of for Automatic Speech Recognition System

Pre-processing/Digital Processing The recorded acoustic signal is an analog signal. An


analog signal cannot directly transfer to the ASR systems. So these speech signals need to
transform in the form of digital signals and then only they can be processed. These digital
signals are move to the first order filters to spectrally flatten the signals. This procedure
increases the energy of signal at higher frequency.

Feature Extraction Feature extraction step finds the set of parameters of utterances that
have acoustic correlation with speech signals and these parameters are computed through
processing of the acoustic waveform. These parameters are known as features or feature vectors
(x₁, x₂, x₃, ...). The main focus of feature extractor is to keep the relevant information and
26
discard irrelevant one. Feature extractor divides the acoustic signal into 10-25 ms. Data
acquired in these frames is multiplied by window function. There are many types of window
functions that can be used such as hamming Rectangular, Blackman, Welch or Gaussian etc.
Feature extraction methods: Principal component Analysis (PCA), Linear Discriminate
Analysis (LDA), Wavelet, Independent component Analysis(ICA) etc.

27
TEXT/REFERENCE BOOKS

1. S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach, Prentice Hall,


3rd Edition, 2009.

2. M. Tim Jones, “Artificial Intelligence: A Systems Approach (Computer Science)”,


Jones and Bartlett Publishers, Inc.; 1st Edition, 2008.

3. Nils J. Nilsson, “The Quest for Artificial Intelligence”, Cambridge University Press,
2009.

4. William F. Clocksin and Christopher S. Mellish, “Programming in Prolog: Using the


ISO Standard”, 5th Edition, Springer, 2003.

5. Gerhard Weiss, “Multi Agent Systems”, 2nd Edition, MIT Press, 2013.

6. David L. Poole and Alan K. Mackworth, “Artificial Intelligence: Foundations of


Computational Agents”, Cambridge University Press, 2010

28

You might also like