0% found this document useful (0 votes)
23 views24 pages

8.5 Multilingual Speech Processing

Uploaded by

tjcbx2z9k7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views24 pages

8.5 Multilingual Speech Processing

Uploaded by

tjcbx2z9k7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

MULTILINGUAL

SPEECH PROCESSING
Veronica Gosteva – 1002
Linguistic and marketing
◦ Multilingual speech processing
provides a great opportunity to
revisit lingering challenges.

◦ First, current speech technology


is challenged by the peculiarities of
many languages, which increases the
probability to detect inappropriate modeling
assumptions.
◦ Second, recognition of multiple languages,
especially their simultaneous recognition,
can be viewed as an extreme instance of
model mismatch and can therefore serve
as a testbed for model adaptation and
other robustness techniques.
◦ Third, due to the need for technology-
educated language experts, we are forced
to think about speech and language
technology education in general. Last but
not least, the high demand of speech
processing systems in new languages
encourages the development of tools and
methods that automate the building
process
◦ The difficulties of speech processing are
compounded with multilingual systems, and few if
any commercial multilingual speech services exist
to date. Yet intense research activity in areas of
potential commercial interest are underway. These
are aiming at:
By determining a speaker's language
automatically, callers could be routed to human
Spoken Language Identification
translation services. This is of particular interest to
public services such as police, government offices

Future Spoken Language Services could be


provided in multiple languages. Dictation systems
Multilingual Speech Recognition and and spoken language database access systems, for
Understanding example, could operate in multiple languages, and
deliver text or information in the language of the
input speech.

Voice activated dictionaries, phrase books or


spoken language translators, telephone based
Speech Translation
speech translation services and/or automatic
translation of foreign broadcasts and speeches.
Statistical Language
Modeling
◦ A language model is a probability assignment over all
possible word sequences in a natural (human)
language. Its goal, loosely stated, is to assign relatively
large probability to meaningful, grammatical, or merely
frequent word sequences compared to rare,
ungrammatical, or nonsensical ones.

◦ The classical communication channel model of automatic speech


recognition.
Translation Aware Language
Modeling
◦ A speech recognition system, which contains a
language model, often serves as the front-end of
a translation system. The back-end of such a
system is either another human language or a
database access and manipulation language. In
case of spoken interaction with databases,
considerable progress has been made in paying
special attention to parts of the source language
sentence that matter for the application.
Non-native Speech

There has been much progress in the past Non-native speech , tend to have a large
few years in the areas of large impact on the accuracy of current speech
vocabulary speech recognition, dialog recognition systems. This is the case for
systems, and robustness of recognizers to small vocabulary, isolated word
noisy environments, making speech recognition tasks as well as for large
processing systems ready for real-world vocabulary, spontaneous speech
applications. recognition tasks.
Non-native speech
◦ The differences between native and non-native speech
can be quantified in a variety of ways, all relevant to the
problem of improving recognition for non-native speakers.

◦ Differences in articulation, speaking rate, and pause


distribution can affect acoustic modeling, which looks for
patterns in phone pronunciation and duration and cross-
word behavior.
◦ Differences in disfluency distribution, word choice,
syntax, and discourse style can affect language modeling.
And, of course, as these components are not independent
of one another, all affect overall recognizer performance.
◦ When speaking a foreign language, one must
concentrate not only on the meaning of the
message but also on getting the syntax right,
articulating the sounds, capturing the cadence of
the sequence of words, speaking with the right
level of formality, and mastering other elements
of spoken language that are more or less
automatic for a native speaker. The additional
cognitive load can result in slower speech, with
more pauses as the speaker stops to think. The
fluidity of speech is called fluency, and offers a
number of opportunities for quantification.
◦ The table compares fluency for native speakers of
English, Japanese, and Chinese speaking English
in read and spontaneous speech tasks.
◦ The overall word rate (number of words per
second) is much lower for the non-native speakers
for both types of speaking tasks. The main factor in
the decrease, though, seems to be the number of
pauses inserted
◦ Due to the peculiarities of spoken
Coupling
language, an effective solution to Speech
speech translation cannot be expected
to be a mere sequential connection of Recognition
automatic speech recognition (ASR) and
machine translation components but and
rather a coupling between both.
Translation
This coupling can be characterized by three orthogonal dimensions:

1) the complexity of the search algorithm,

2) the incrementality,

3) the tightness, which describes how close ASR and MT interact


while searching for a solution (Ringger, 1995.)
◦ State-of-the-art translation systems use a variety
of different coupling strategies. Examples of
loosely coupled systems are:

◦ IBM's MASTOR (Liu et al., 2003), ATR-MATRIX


(Takezawa et al., 1998c), and NESPOLE! (Lavie et
al., 2001a), which uses the interlingua-based
JANUS system. Examples for tightly coupled
systems are EuTrans (Pastor et al.)
◦ 2001), developed at UPV, and AT&T's Transnizer
(Mohri and Riley, 1997).
◦ A generic SDS consists of:

• A speech recognizer that transcribes input speech into text

• A natural language understanding component that transforms the


recognition output into a semantic representation (typically via
parsing)

• A discourse and dialog manager that handles the inheritance of


discourse history, content retrieval (often via database access), and
dialog turn taking between the human and the computer

• A spoken response generator that verbalizes the retrieved content

• A text-to-speech synthesizer to generate a spoken presentation of the


verbalized content
Development of multilingual SDS sees a drastic increase in system
complexity for every additional language supported.

Very often, multilingual SDS involves multiple speech recognizers,


one for each supported language.

---> This naturally creates the need for language identification as a


preprocess unless the selected language is explicitly stated by the
user.
SDS encompasses a suite of speech and

Multilingual language technologies to offer a


conversational interface to dynamic

Spoken
information, including speech recognition,
natural language understanding, dialog
modeling, and speech synthesis. Hence,
Dialog the user can present queries to the system
by speaking naturally, and the SDS can
Systems respond in real time in synthetic speech.
Numerous commercial SDS have been
deployed for multiple languages.
An example
dialog in the
stocks domain
illustrating the
capabilities of a
state-of-the-art
spoken dialog
system.
(source: www.
speechworks.c
om).
Multilingual Speech Recognition
◦ The speech recognition component uses
an HMM-based approach with context-
dependent acoustic models. In order to
efficiently capture contextual and
temporal variations in the input while
constraining the number of parameters, the
system uses the successive state splitting
(SSS) algorithm in combination with a
minimum description length criterion.
◦ This algorithm constructs appropriate context-
dependent model topologies by iteratively
identifying an HMM state that should be split into
two independent states. It then reestimates the
parameters of the resulting HMMs based on the
standard maximum-likelihood criterion. Two types
of splitting are supported:

◦ Contextual splitting

◦ Temporal splitting
◦ Contextual
splitting
and
temporal
splitting.
◦ In the past decade, the performance of
automatic speech processing systems
(such as automatic speech recognizers,
speech translation systems, and speech
synthesizers) has improved dramatically,
resulting in an increasingly widespread use
of speech technology in real-world
scenarios.
◦ The challenge of rapidly adapting existing
speech processing systems to new
languages is currently one of the major
bottlenecks in the development of
BIBLIOGRAPHY:
Singh, R., Raj, B., Stern, R. (2002). Automatic generation of subword units for
speech recognition systems. In: IEEE Transactions on Speech and Audio
Processing. 10. pp. 98-99.
◦ Somers, H. (1999). Review article: Example-based machine translation.
Journal
◦ of Machine Translation, 14 (2), 113-157.
◦ SPICE (2005). http://www.cmuspice.org.
◦ Spiegel, M. (1993). Using the Orator synthesizer for a public reverse-directory
service: Design, lessons, and recommendations. In: Proceedings of the
European Conference on Speech Communication and Technology
(EUROSPEECH). Berlin, Germany. pp. 1897-1900.

You might also like