0% found this document useful (0 votes)
64 views

Lecture1 PDF

This document provides an introduction to a lecture series on automatic speech recognition. It outlines the lectures, assignments, and term project. It also discusses key challenges in ASR including co-articulation, speaker independence, spontaneous speech, language modeling, and noise robustness. Finally, it characterizes the capabilities of ASR systems based on parameters like speaking mode, style, enrollment, vocabulary size, language model, and transducer used.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Lecture1 PDF

This document provides an introduction to a lecture series on automatic speech recognition. It outlines the lectures, assignments, and term project. It also discusses key challenges in ASR including co-articulation, speaker independence, spontaneous speech, language modeling, and noise robustness. Finally, it characterizes the capabilities of ASR systems based on parameters like speaking mode, style, enrollment, vocabulary size, language model, and transducer used.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Lecture # 1

Session 2003
Introduction to Automatic Speech Recognition

Lectures: Jim Glass & guest lecturers


Introduction to ASR
Problem definition
State of the art examples
Course overview
Lecture outline
Assignments
Term Project
Grading

6.345 Automatic Speech Recognition Introduction 1


Communication via Spoken Language

Input Output

Speech Speech

Human
Recognition Synthesis

Computer

Text Text

Generation Understanding

Meaning
Meaning

6.345 Automatic Speech Recognition Introduction 2


Virtues of Spoken Language

Natural: Requires no special training


Flexible: Leaves hands and eyes free
Efficient: Has high data rate
Economical: Can be transmitted/received inexpensively

Speech
Speech interfaces
interfaces are
are ideal
ideal for
for information
information access
access and
and
management
management when:
when:
The
The information
information space
space isis broad
broad andand complex,
complex,
The
The users
users are
are technically
technically naive,
naive, oror
Only
Only telephones
telephones are
are available.
available.

6.345 Automatic Speech Recognition Introduction 3


Diverse Sources of Constraint for
Spoken Language Communication
Acoustic: human vocal tract
Phonetic: let us pray
lettuce spray
Phonological: gas shortage
fish sandwich
Phonotactic: blit vnuk
Syntactic: I am flying to Chicago tomorrow
tomorrow I flying Chicago am to
Semantic: Is the baby crying
Is the bay bee crying
Contextual: It is easy to recognize speech
It is easy to wreck a nice beach
6.345 Automatic Speech Recognition Introduction 4
Automatic Speech Recognition

ASR
ASR
System
System
Speech Recognized
Signal Words

An ASR system converts the speech signal into words


The recognized words can be
The final output, or
The input to natural language processing

6.345 Automatic Speech Recognition Introduction 5


Application Areas for Speech Based Interfaces

Mostly input (recognition only)


Simple command and control
Simple data entry (over the phone)
Dictation
Interactive conversation (understanding needed)
Information kiosks
Transactional processing
Intelligent agents

6.345 Automatic Speech Recognition Introduction 6


Basic Speech Recognition Challenges

Co-articulation
Speaker independence
Dialect variations
Non-native speakers
Spontaneous speech
Disfluencies
Out-of-vocabulary words
Language modelling
Noise robustness

6.345 Automatic Speech Recognition Introduction 7


Phonological Variation Example
The acoustic realization of a phoneme depends strongly on
the context in which it occurs

Frequency

Time

TEA TREE STEEP CITY BEATEN

6.345 Automatic Speech Recognition Introduction 8


Examples Contrasting
Read and Spontaneous Speech (Navigation Domain)

Filled and unfilled pauses: read, spontaneous


Lengthened words: read, spontaneous
False starts: read, spontaneous

6.345 Automatic Speech Recognition Introduction 9


Sometimes Real Data will Dictate
Technology Requirements (City Name Domain)
Technology Required Example
Simple word spotting Um, Braintree
Complex word spotting Eh yes, Avis rent-a-car in
Boston
Hello, please Brighton,
uh, can I have the number
of Earthscape, in, uh, on
Nonantum Street
Speech understanding Woburn, uh, Somerville.
I'm sorry

6.345 Automatic Speech Recognition Introduction 10


Parameters that Characterize the
Capabilities of ASR Systems

Parameters
Parameters Range
Range
Speaking
Speaking Mode:
Mode: Isolated
Isolated word
word to to continuous
continuous speech
speech
Speaking
Speaking Style:
Style: Read
Read speech
speech toto spontaneous
spontaneous speech
speech
Enrollment:
Enrollment: Speaker-dependent
Speaker-dependent to to speaker-independent
speaker-independent
Vocabulary:
Vocabulary: Small
Small (<20
(<20 words)
words) to to large
large (>50,000
(>50,000 words)
words)
Language
Language Model:
Model: Finite-state
Finite-state to
to context-sensitive
context-sensitive
Perplexity:
Perplexity: Small
Small (<10)
(<10) to
to large
large (>200)
(>200)
SNR:
SNR: High
High (>30dB)
(>30dB) toto low
low (<10dB)
(<10dB)
Transducer:
Transducer: Noise-cancelling
Noise-cancelling microphone
microphone to to cell
cell phone
phone

6.345 Automatic Speech Recognition Introduction 11


ASR Trends*: Then and Now

before
before mid
mid 70's
70's mid
mid 70s
70s -- mid
mid 80s
80s after
after mid
mid 80s
80s
Recognition
Recognition whole-word
whole-word and
and sub-word
sub-word units
units sub-word
sub-word units
units
Units:
Units: sub-word
sub-word units
units
Modeling
Modeling heuristic
heuristic and
and template
template matching
matching mathematical
mathematical
Approaches:
Approaches: ad
ad hoc
hoc and
and formal
formal
rule-based
rule-based and
and deterministic
deterministic and
and probabilistic
probabilistic
declarative
declarative data-driven
data-driven and
and data-driven
data-driven
Knowledge
Knowledge heterogeneous
heterogeneous homogeneous
homogeneous homogeneous
homogeneous
Representation:
Representation: and
and complex
complex and
and simple
simple and
and simple
simple
Knowledge
Knowledge intense
intense knowledge
knowledge embedded
embedded inin automatic
automatic
Acquisition:
Acquisition: engineering
engineering simple
simple structure
structure learning
learning

* There are, of course, many exceptions.

6.345 Automatic Speech Recognition Introduction 12


Speech Recognition: Where Are We Now?

High performance, speaker-independent speech recognition


is now possible
Large vocabulary (for cooperative speakers in benign
environments)
Moderate vocabulary (for spontaneous speech over the phone)
Commercial recognition systems are now available
Dictation (e.g., Dragon, IBM, L&H, Philips) Scansoft
Telephone transactions (e.g., AT&T, Nuance, Philips,
SpeechWorks, TellMe, etc.)
When well-matched to applications, technology is able to
help perform real work

6.345 Automatic Speech Recognition Introduction 13


Examples of ASR Performance
Digits 1K, Read
2K, Sponaneous 20K, Read
Speaker-independent, continuous- 64K, Broadcast 10K, Conversational
speech ASR now possible 100
Digit recognition over the telephone
with word error rate of 0.3%
Error rate cut in half every two years
for moderate vocabulary tasks
10

Word Error Rate (%)


Error for spontaneous speech more
than twice that of read speech
Conversational speech, involving
multiple speakers and poor acoustic
environment, remains a challenge 1
Tens of hours of training data to
port to a different domain
Statistical modeling using automatic
training achieves significant
advances 0.1 1987

1989

1991

1993

1995

1997

1999

2001
Year
6.345 Automatic Speech Recognition Introduction 14
Important Lessons Learned

Statistical modeling and data-driven approaches have


proved to be powerful
Research infrastructure is crucial:
Large amounts of linguistic data
Evaluation methodologies
Availability and affordability of computing power lead to
shorter technology development cycles and real-time
systems
Performance-driven paradigm accelerates technology
development
Interdisciplinary collaboration produces enhanced
capabilities (e.g., spoken language understanding)

6.345 Automatic Speech Recognition Introduction 15


Major Components in a Speech Recognition System

Training Data

Applying Constraints

Acoustic Lexical Language


Models Models Models

Speech Recognized
Signal Words
Representation Search

Speech recognition is the problem of deciding on


How to represent the signal
How to model the constraints
How to search for the most optimal answer
6.345 Automatic Speech Recognition Introduction 16
Demo: Continuous Dictation

IBM ViaVoice running on a ThinkPad


Trained for a quiet office (classroom performance not optimal)

6.345 Automatic Speech Recognition Introduction 17


Demo: Simple Telephone Transactions

Developed by SpeechWorks International (there are others)


Shipping cost information for Fedex (1-800-GO-FEDEX)
Provides information on:
* Package types
* Source and destination zip codes
* Weight, size, value
* Service type
Handles all US rate information calls
Automated Brokerage System for E*Trade
Supports quotes and trades
* Using symbols or names
* For stocks, options, and mutual funds
Users can barge in at any time
Nationwide deployment for over 450,000 customers

6.345 Automatic Speech Recognition Introduction 18


Conversational Interfaces: The Next Generation

Enables us to converse with machines (in much the same


way we communicate with one another) in order to create,
access, and manage information and to solve problems
Augments speech recognition technology with natural
language technology in order to understand the verbal input
Can engage in a dialogue with a user during the interaction
Uses natural language to speak the desired response
Is what Hollywood and every futurist says we should
have!

6.345 Automatic Speech Recognition Introduction 19


A Conversational System Architecture

SPEECH Sentence LANGUAGE


SYNTHESIS GENERATION
Speech
Graphs
& Tables DIALOGUE DATABASE
MANAGEMENT

Speech DISCOURSE Meaning


CONTEXT Representation

Meaning
SPEECH LANGUAGE
RECOGNITION UNDERSTANDING
Words

6.345 Automatic Speech Recognition Introduction 20


Demo: Conversational Interface

Jupiter weather information system


Access through telephone
500 cities worldwide
Harvest weather information from the Web several times daily

Jupiter
A conversational interface for on-line
weather information over the phone.

1-888-573-8255
(outside the USA: 1-617-258-0300)
http://www.sls.lcs.mit.edu/jupiter
Spoken Language Systems Group,
MIT Laboratory for Computer Science

6.345 Automatic Speech Recognition Introduction 21


(Real) Data Improves Performance (Weather Domain)
45 100

Training Data (x1000)


40 Word
35 Data
Error Rate (%)

30
25
10
20
15
10
5
0 1
97 98 99
Apr May Jun Jul Aug Nov Apr Nov May

Longitudinal evaluations show improvements


Collecting real data improves performance:
Enables increased complexity and improved robustness for
acoustic and language models
Better match than laboratory recording conditions
Users come in all kinds
6.345 Automatic Speech Recognition Introduction 22
But We Are Far from Done!

Corpus
Corpus Speech
Speech Lexicon
Lexicon Word
Word Error
Error Human
Human Error
Error
Type
Type Size
Size Rate
Rate (%)
(%) Rate
Rate (%)
(%)

Digit
Digit Strings
Strings (phone)
(phone) spontaneous
spontaneous 10
10 0.3
0.3 0.009
0.009

Resource
Resource Management
Management read
read 1000
1000 3.6
3.6 0.1
0.1

ATIS
ATIS spontaneous
spontaneous 2000
2000 22 ----

Wall
Wall Street
Street Journal
Journal read
read 64000
64000 6.6
6.6 11

Radio
Radio News
News mixed
mixed 64000
64000 13.5
13.5 ----

Switchboard
Switchboard (phone)
(phone) conversation
conversation 10000
10000 19.3
19.3 44

Call
Call Home
Home (phone)
(phone) conversation
conversation 10000
10000 30
30 ----

6.345 Automatic Speech Recognition Introduction 23


Course Outline
Paralinguistic
Paralinguistic Information
Information
Speech Understanding
Speech Understanding
Multi-Modal
Multi-Modal Interfaces
Interfaces
Acoustic-
Acoustic- Pattern
Pattern Finite-State
Finite-State Language
Language
Phonetic
Phonetic Recognition
Recognition Transducers
Transducers Modeling
Modeling
Modeling
Modeling

Acoustic
Acoustic Theory
Theory of
of Robust
Robust Acoustic Lexical Language
Speech
Speech Production
Production ASR
ASR Models Models Models

Adaptation
Adaptation Recognized
Speech Words
Signal
Representation Search

Properties
Properties of
of Signal
Signal Search
Search
Speech
Speech Sounds
Sounds Representation
Representation Algorithms
Algorithms

Vector
Vector Quantization
Quantization Hidden
Hidden Markov
Markov Graphical
Graphical Segmental
Segmental
& Clustering
& Clustering Modeling
Modeling Models
Models Models
Models

6.345 Automatic Speech Recognition Introduction 24


Course Logistics

Lectures:Two sessions/week, 1.5 hours/session


Labs: All week during school hours

Grading
9 Assignments 45%
2 Quizzes 30%
Term Project (about 4 weeks) 25%

6.345 Automatic Speech Recognition Introduction 25


Assignments

There will be 9 weekly assignments


Problems that expand on the lecture material
Lab assignments to reinforce the lecture material
Assignments are due the following week on Wednesday
Lab work will be done in the computer lab
Lab sign-up (on the course web page) is necessary
Solutions will be provided

6.345 Automatic Speech Recognition Introduction 26


Term Project
Investigate a contrasting condition in an ASR experiment
We will provide different recognizers and domains for you to
select from, and will work with you to select a topic
You choose:
Evaluation condition: e.g., phonetic classification, word recognition)
Database (e.g., TIMIT, RM, Jupiter, Aurora, )
Recognizer (e.g., Sphinx, Summit, GMTK, )
Contrasting condition (e.g., signal representation, acoustic model,
language model)
Requirements:
Proposal
Experiments (the bulk of the work)
Write-up
Presentation on extended last day of class

6.345 Automatic Speech Recognition Introduction 27


References (on reserve at Barker)
Huang, Acero, & Hon, Spoken Language Processing,
Prentice-Hall, 2001.
Jelinek, Statistical Methods for Speech Recognition, MIT
Press, 1997.
Rabiner & Juang, Fundamentals of Speech Recognition,
Prentice-Hall, 1983.
Duda, Hart, & Stork, Pattern Classification, Wiley & Sons,
2001.
Stevens, Acoustic Phonetics, MIT Press, 1998.

6.345 Automatic Speech Recognition Introduction 28

You might also like