Computer Based Automatic Speech Processing: Pham Van Tuan
Computer Based Automatic Speech Processing: Pham Van Tuan
Speech Processing
Pham Van Tuan
Electronic & Telecommunication Engineering
Danang University of Technology
Course Administration
! Courses sequence: Signals & Linear Systems; Digital Signal
Processing 1,2; DSP Using Matlab; Computer Based Speech Processing
! Credits: 2 (lecture + project)
! Grading policies:
project (50%) + midterm (20%) + final exam (30%)
! Goals: to provide students fundamental knowledge of
" Human speech production and perception.
" Fundamental theory and practice in speech analysis
" Speech enhancement and speech coding
" Automatic speech recognition
! Textbooks:
" Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Spoken
Language Processing, Prentice-Hall, May 2001 .
Course Content
! Introduction to Computer Speech Processing
! Human Speech Production and Perception
! Speech Spectrogram Reading
! Speech Analysis
! Speech Coding and Speech Enhancement
! Automatic Speech Recognition
! Lab project (Noise Reduction)
Lecture 1
Introduction to Computer Speech
Processing
Outline
Grand challenges in Speech and Language
The role of speech
Technology introduction
Videos and projects
State-of-the-art Voice-based Products
Matlab tools for speech processing
Turing Test
The Turing test is a proposal for a test of a
machine's ability to demonstrate intelligence.
It proceeds as follows: a human judge
engages in a natural language conversation
with one human and one machine, each of
which tries to appear human. All participants
are placed in isolated locations. If the judge
cannot reliably tell the machine from the
human, the machine is said to have passed
the test. In order to test the machine's
intelligence rather than its ability to render
words into audio, the conversation is limited
to a text-only channel such as a
computer keyboard and screen.
Alan M.Turing, 1950
Computing machinery and intelligence. Mind, Vol. LIX. 433-460
Turing Test
I believe that in about fifty years' time it will be possible, to
programme computers, with a storage capacity of about
10
9
, to make them play the imitation game so well that an
average interrogator will not have more than 70 per cent
chance of making the right identification after five minutes of
questioning. The original question, "Can machines think?"
I believe to be too meaningless to deserve discussion.
Nevertheless I believe that at the end of the century the use
of words and general educated opinion will have altered so
much that one will be able to speak of machines thinking
without expecting to be contradicted.
Alan M.Turing, 1950
Computing machinery and intelligence. Mind, Vol. LIX. 433-460
Prediction 59 Years Later
! Turings technology forecast was great!
" Gigabyte memory is common
! Computer beat world chess champion
" with some help from its programming staff!
! Computers help design most things today
! Intelligence forecast was optimistic
! Turing test still stands as a long-term challenge
Challenges Implicit in the Turing Test
1. Read and understand as well as a
human
2. Think and write as well as a human
3. Hear as well as a native speaker:
# Speech Recognition (speech to text)
4. Speak as well as a native speaker:
# Speech Synthesis (text to speech)
5. Remember what is heard and quickly
return it on request.
Grand Challenges
Within 10 years speech will be in every device.
Things like speech and ink are so natural, when
they get the right quality level they will be in
everything. As technical hurdles such as
background noise and context are overcome,
major adoption of speech technology will arrive.
Soon, dictating to PCs and giving commands to
cell phones will be basic modes of interacting with
technology
Bill Gates, March 2004
Outline
Grand challenges in Speech and Language
The role of speech
Technology Introduction
Videos and Projects
State-of-the-art Voice-based Products
Matlab tools for speech processing
High
Internet
TV
Phone
PDA
Ease of text input (keyboard/pen)
Ease
of GUI
(screen/
Pointer)
Low
High
PC
Tablet
PC
Screen
Phone
Screen
Phone
PDA
Tablet
PC
Car
Car
Internet
TV
Role of Speech in Different Devices
Phone
PC
Screen
Phone
PDA
Tablet
PC
Car
Internet
TV
A Roadmap for Speech
Ease of text input (keyboard/pen)
Ease
of GUI
(screen/
Pointer)
High
High
Low
Speech-Only
Telephony
Dictation
Multimodal
Command/Control
Speech Technology
Meeting / Voicemail
Transcription
Market
Opportunity
Mobile Devices / Cars
Telephony / Call Center
Accessibility
Desktop Dictation
Desktop Command &
Control
Technology
Readiness
Customer
Need
Poor
Alternative
Outline
Grand challenges in Speech and Language
The role of speech
Technology Introduction
Videos and Projects
State-of-the-art Voice-based Products
Matlab tools for speech processing
Voice-enabled System Technology
DM
SLU
TTS
Text-to-Speech
Synthesis
Automatic Speech
Recognition
Spoken Language
Understanding
Dialog
Management
ASR
SLG
Spoken Language
Generation
Data,
Rules
Words
Meaning
Speech Speech
Action
Words
DM
SLU
TTS
Text-to-Speech
Synthesis
Automatic Speech
Recognition
Spoken Language
Understanding
Dialog
Management
ASR
SLG
Spoken Language
Generation
Data,
Rules
Words
Meaning
Speech Speech
Action
Words
Voice-enabled System Technology
Feature
Extraction
Language
Model
Word
Lexicon
Confidence
Scoring
Pattern
Classification
(Decoding,
Search)
Acoustic
Model
Input
Speech
Hello World
(0.9) (0.8)
Speech Recognition
SLU
TTS ASR
DM
SLG
DM
SLU
TTS
Text-to-Speech
Synthesis
Automatic Speech
Recognition
Spoken Language
Understanding
Dialog
Management
ASR
SLG
Spoken Language
Generation
Data,
Rules
Words
Meaning
Speech Speech
Action
Words
Voice-enabled System Technology
Text-to-Speech Systems
TTS Engine
Text Analysis
Document Structure Detection
Text Normalization
Linguistic Analysis
Phonetic Analysis
Homograph disambiguation
Grapheme-to-Phoneme Conversion
Speech Synthesis
Voice Rendering
Raw text
or tagged text
tagged text
controls
Prosodic Analysis
Pitch & Duration Attachment
tagged phones
Speech
Audio Out
DM
SLU
TTS
Text-to-Speech
Synthesis
Automatic Speech
Recognition
Spoken Language
Understanding
Dialog
Management
ASR
SLG
Spoken Language
Generation
Data,
Rules
Words
Meaning
Speech Speech
Action
Words
Voice-enabled System Technology
An example sentence
Show me flights from Seattle to New York
would populate the application schema as
<itinerary>
<origin>
<city>Seattle</city>
<state></state>
</origin>
<destination>
<city>New York</city>
<state></state>
</destination>
<date></date>
</itinerary>
DM
SLU
TTS
Text-to-Speech
Synthesis
Automatic Speech
Recognition
Spoken Language
Understanding
Dialog
Management
ASR
SLG
Spoken Language
Generation
Data,
Rules
Words
Meaning
Speech Speech
Action
Words
Voice-enabled System Technology
Who manages the Dialog?
Directed Dialog
" Who would you like to contact?
" Finite State Machine
" Simple CFG
" MSConnect
User Initiative Dialog
" What can I do for you?
" Ngrams
" Windows Airlines
Initiative
Reservations
Flight Status
Baggage Claim
Special Announcements
Visual
Pen
Gesture
Multimodal System Technology
DM
SLU
TTS
Text-to-Speech
Synthesis
Automatic Speech
Recognition
Spoken Language
Understanding
Dialog
Management
ASR
SLG
Spoken Language
Generation
Data,
Rules
Words
Meaning
Speech Speech
Action
Words
MIPad
! Multimodal Interactive Pad
! MiPad
" Tap and Talk combines speech and
pen
" Use context to simplify recognition
" Dictation allows complex command
entry
! Usability studies show double
throughput for English
! Speech is mostly useful in cases
with lots of alternatives
Multimodality Benefits
! Compared to speech-only:
" User sees system response more quickly
" User sees what system understood
" User can know what system expects
! Compared to GUI-only:
" Faster entry
" Better use of small screen
Advanced Topics
lrom !Pu 2002 SuperSlu llnal resenLauon - 8eynolds eL al.
lnLroduclng Loday!
Language Identification Why?
Multi-lingual society
Applications should be able to deal with anyone
Businesses
Automated help systems
Reservations, account access, etc.
Travel
Airport Kiosks
Train stations
Government
Funds research to identify languages
Runs evaluations in it
Flavors of Speaker Recognition
lrom !Pu 2002 SuperSlu llnal resenLauon - 8eynolds eL al.
Our Focus!
Speaker Recognition Why?
Personal Applications
Voice-print passwords
Voicemail transcription who left that message?
Business Applications
Calling your bank
Government
Is that Osama calling from Pakistan?
Prison call monitoring
Automated parolee calling is he where you
think?
Outline
Grand challenges in Speech and Language
The role of speech
Technology Introduction
Videos and Projects
State-of-the-art Voice-based Products
Matlab tools for speech processing
Projects
! Robust speech interfaces for mobile workers (EU-FP6
SNOW, EADS, Siemens-SBS, SAP, FHG-FIRST, ACV,
Loquendo, TUGraz, DUT)
Projects
Projects
Projects
Projects
Projects
Projects
Projects
Projects
Projects
! Robust speech interfaces for mobile workers (EU-FP6
SNOW, EADS, Siemens-SBS, SAP, FHG-FIRST, ACV,
Loquendo, TUGraz, DUT)
! Intelligent and secure information retrieval for multimedia
data bases (FIT-IT MISTRAL, KNOW, Hyperwave Graz,
SailLabs VieSnmarnt Syastem, TUGraz, DUT)
Projects
Audio Unit Results
The application works with multichannel audio signals,
recorded by a linear micro-phone array. First it estimates the
background noise, does suppress it as much as possible,
and extracts the following speech-related features:
Voice Activity (VAD) regions
Speaker's position inside each detected VAD region
Speaker's gender (male/female) and mean pitch value (ex:
210Hz) for each VAD region
Speaker indexing for each VAD region
Projects
Projects
Projects
Projects
Projects
! Robust speech interfaces for mobile workers (EU-FP6
SNOW, EADS, Siemens-SBS, SAP, FHG-FIRST, ACV,
Loquendo, TUGraz, DUT)
! Intelligent and secure information retrieval for multimedia
data bases (FIT-IT MISTRAL, KNOW, Hyperwave Graz,
SailLabs VieSnmarnt Syastem, TUGraz, DUT)
! Air traffic safety enhancement by speaker verification in
analog voice com (Eurocontrol Brtigny, TUGraz, DUT)
Projects
! Robust speech interfaces for mobile workers (EU-FP6
SNOW, EADS, Siemens-SBS, SAP, FHG-FIRST, ACV,
Loquendo, TUGraz, DUT)
! Intelligent and secure information retrieval for multimedia
data bases (FIT-IT MISTRAL, KNOW, Hyperwave Graz,
SailLabs VieSnmarnt Syastem, TUGraz, DUT)
! Air traffic safety enhancement by speaker verification in
analog voice com (Eurocontrol Brtigny, TUGraz, DUT)
! Wavelet-based acoustic-phonetic classification and noise
reduction (Univ. Maribor, Loquendo, Mistral, TUGraz, DUT)
Projects
! Robust speech interfaces for mobile workers (EU-FP6
SNOW, EADS, Siemens-SBS, SAP, FHG-FIRST, ACV,
Loquendo, TUGraz, DUT)
! Intelligent and secure information retrieval for multimedia
data bases (FIT-IT MISTRAL, KNOW, Hyperwave Graz,
SailLabs VieSnmarnt Syastem, TUGraz, DUT)
! Air traffic safety enhancement by speaker verification in
analog voice com (Eurocontrol Brtigny, TUGraz, DUT)
! Wavelet-based acoustic-phonetic classification and noise
reduction (Univ. Maribor, Loquendo, Mistral, TUGraz, DUT)
! COAST-ROBUST Dictation System Improvement (Philips
PSRS, Sailabs Vienna, TUGraz, DUT)
Outline
Grand challenges in Speech and Language
The role of speech
Technology Introduction
Videos and Projects
State-of-the-art Voice-based Products
Matlab tools for speech processing
Demo
! Applications in Multimedia Processing
" Smart Information Retrieval
! Mistral Project at TUGraz
! Real-time Detection, Tracking, Recognition: voice & image: ICG tracking
" Automatic Speech Recognition (ASR)
! Nuance - Dragon Naturally Speaking
! Microsoft - Windows Vista, IBM EmbeddedviaVoice
" ASR based Voice Search
! Applications in Communication
" Modulation: PCM and its variety
" Speech Enhancement / Coding: LPC and Spectral approaches
" CDMA: Spread Spectrum
! Applications in Adaptive Filtering
" LMS: System Modeling
" Adaptive Channel Equalization
" Acoustic Echo Cancellation
Digital Sounds
Recording
Recording: Mic. Frequency Response
Sampling
Sampling: Nyquist - Aliasing
Quantization
Recording: SNR
Sound File Formats
Sound File Formats