CCS369 - TSS-Unit 5

Uploaded by

thirushharidoss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views23 pages

CCS369 - TSS-Unit 5

Uploaded by

thirushharidoss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

CCS369 TEXT AND SPEECH ANALYSIS

UNIT 5
By
C.Jerin Mahibha
Assoc.Prof / CSE
UNIT V AUTOMATIC SPEECH RECOGNITION

Speech recognition: Acoustic modelling – Feature Extraction - HMM, HMM-DNN

systems

COURSE OBJECTIVES:
Develop a speech recognition system
COURSE OUTCOME:
CO5:Apply deep learning models for building speech recognition systems
Text Book : Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition” by Daniel Jurafsky
and James H. Martin - Chapter 26

Reference :
https://maelfabien.github.io/machinelearning/speech_reco/#acoustic-model
Speech Recognition - Automatic Speech Recognition (ASR)
• Speech - natural interface for communication
• Task - map any waveform to the appropriate string of words
• Application
• Smart home appliances
• Personal assistants
• Cellphones
• Telephony - call-routing
• Sophisticated dialogue applications
• Transcription - automatically generating captions for audio or video text
• Field of law - dictation plays an important role
• Augmentative communication - difficulties or inabilities in typing

The blind Milton dictated Paradise Lost to his daughters

Henry James dictated his later novels after a repetitive stress injury
Factors to be considered :
1.Vocabulary size
• High accuracy
➢2 - word vocabulary - yes / no
➢11 - word vocabulary - digit recognition 0 -9
• Much harder
➢Large vocabularies of up to 60,000 words
➢ Open-ended tasks like transcribing videos or human conversations
2. Who the speaker is talking to
• Easy to recognize
➢Humans speaking to machines
➢Read speech – humans reading out loud - audio book
➢Talking more slowly and more clearly
• Difficult
➢Conversational speech - humans speaking to humans - transcribing a business meeting
3. Channel and noise
• Easy to recognize - if recorded
➢ in a quiet room
➢ head-mounted microphones
• Difficult - if recorded
➢distant microphone
➢noisy city street
➢car with the window open
4. Accent or speaker- class characteristics
• Easy to recognize
➢same dialect or variety that the system was trained on
• Difficult
➢Regional or ethnic dialects
➢speech by children
ASR Corpora
Acoustic modelling Ref : https://maelfabien.github.io/machinelearning/speech_reco/#acoustic-
model

• Model the relationship between the audio signal and the phonetic units in
the language
• Isolated word/pattern recognition - acoustic features (Y) - used as an input
to a classifier - output is the correct word
• Continuous speech recognition – Input / Output is a sequence
• acoustic model goes further than a simple classifier
• Output - sequence of phonemes
• Hidden Markov Models
➢are natural candidates for Acoustic Models
➢they are great at modeling sequences
➢Has states si, and at each state, observations oi are generated
Feature Extraction

• Transform the input waveform into a sequence of acoustic feature

vectors - each vector representing the information in a small time
window of the signal-sequences of log mel spectrum vectors
1.Convert the analog representations into a digital signal- Sampling and
quantization
2. Extract spectral features – Windowing, Discrete Fourier Transform
Sampling and Quantization
1.Sampling
• Signal is sampled by measuring its amplitude at a particular time
• Sampling rate - number of samples taken per second
• To accurately measure a wave - need to have at least two samples in each cycle
➢ one measuring the positive part
➢ one measuring the negative part
• More than two samples per cycle - increases the amplitude accuracy
• Less than two samples - cause the frequency of the wave to be completely missed
• Maximum frequency wave that can be measured is one whose frequency is half the
sample rate (since every cycle needs two samples)
• Maximum frequency for a given sampling rate - called the Nyquist frequency
➢ Human speech frequency - below 10,000 Hz - sampling rate 20,000 Hz
➢ Telephone speech - less than 4,000 Hz - sampling rate 8,000 Hz
➢ Microphone speech - 16,000 Hz - sampling rate 32,000 Hz
• Higher sampling rates - produces higher ASR accuracy
• Cannot combine different sampling rates for training and testing ASR systems
➢ If testing on a telephone corpus like Switchboard- downsample training corpus to 8 KHz
2.Quantization
• Process of representing real-valued numbers as integers
• Amplitude measurements are stored as integers
• either 8 bit (values from -128–127) or
• 16 bit (values from -32768–32767)
• All values that are closer together than the minimum granularity (the
quantum size) are represented identically
• each sample at time index n in the digitized, quantized waveform is
referred as x[n]
Windowing
• Small window of speech - characterizes part of a particular phoneme
• Extract spectral features from the window
• Inside the small window- signal is considered as stationary - statistical properties
are constant
• Extract the rough stationary portion of speech by using a window which is non-
zero inside a region and zero elsewhere
• Run the window across the speech signal
• Multiply it by the input waveform to produce a windowed waveform
• Speech extracted from each window is called a frame
• Windowing is characterized by three parameters:
➢ Window size or frame size of the window stride - its width in milliseconds
➢ Frame stride or shift or offset between successive windows
➢ Shape of the window
• To extract the signal - multiply the value of the signal at time n, s[n] by the
value of the window at time n, w[n]:
• The window shape – rectangular
• extracted windowed signal
• looks just like the original signal
• abruptly cuts off the signal at its boundaries- creates problems during Fourier
analysis.
• To avoid discontinuities - shrinks the values of the signal toward zero at the
window boundaries - Hamming window
Equations - assuming a window that is L frames long:
Discrete Fourier Transform
• Extract spectral information for the windowed signal
• Know the amount of energy the signal contains at different frequency bands.
• Discrete Fourier transform or DFT - tool for extracting spectral information for
discrete frequency bands for a discrete-time (sampled) signal
➢Input to the DFT - windowed signal x[n] ...x[m],
➢Output, for each of N discrete frequency bands - complex number X[k]
represent the magnitude and phase of the frequency component in the original signal
• Fourier analysis relies on Euler’s formula, with j as the imaginary unit:

• DFT is defined as follows:

• A commonly used algorithm for computing the DFT is the Fast Fourier
transform or FFT
• This implementation of the DFT is very efficient but only works for values
of N that are powers of 2
Mel Filter Bank and Log
• Results of FFT – represent the energy at each frequency band
• Human hearing - not equally sensitive at all frequency bands- less sensitive at
higher frequencies
• Modeling - human perceptual property - improves speech recognition
performance
• Implemented - by collecting energies - not equally at each frequency band,
but according to the mel scale, an auditory frequency scale
• A mel - is a unit of pitch
• Pairs of sounds that are perceptually equidistant in pitch are separated by an
equal number of mels
• The mel frequency m - computed from the raw acoustic frequency by a log
transformation:

• Implemented - by creating a bank of filters that collect energy from each

frequency band, spread logarithmically - very fine resolution at low
frequencies, and less resolution at high frequencies
• Multiply by the spectrum - mel spectrum
• Take log of each of the mel spectrum values
HMM systems
• 1 phoneme
➢represented by a 3 or 5 state linear HMM
➢generally the beginning, middle and end of the phoneme
• Topology of HMMs is
➢ flexible by nature
➢ each phoneme - represented by a single state, or 3 states
The HMM
• supposes observation independence

• can also output context-dependent

phonemes - triphones
➢Triphones are simply a group of 3
phonemes, the left one being the left
context, and the right one, the right
context
• trained using Baum-Welsch algorithm
• learns to give the probability of each
end of phoneme at time t
HMM-DNN systems
• do not care about the acoustic
model P(X∣W)
• directly tackle P(W∣X) - as the
probability of observing state
sequences given X
• Target

• Aim of DNN - model the posterior

probabilities over HMM states
Considerations on the HMM-DNN framework:
• large number of hidden layers
• the inputs features - extracted from large windows - to have a large
context
• early stopping can be used
• uses Bayes Rule
• Probability of the acoustic feature P(X) is not known - scales all the
likelihoods by the same factor - does not modify the alignment
• Training of HMM-DNN architectures – based on the hybrid HMM-DNN,
using EM(Expectation Maximization)
THANK YOU

Voice Recognition
60% (5)
Voice Recognition
31 pages
Speaker Verification For Remote Authentication
100% (2)
Speaker Verification For Remote Authentication
31 pages
Mba-Ai Speech Technologies: Prof. Brian Mak
No ratings yet
Mba-Ai Speech Technologies: Prof. Brian Mak
56 pages
Xiao Guest Lecture ASR
No ratings yet
Xiao Guest Lecture ASR
39 pages
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
100% (1)
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
65 pages
Speechrecognitionfinalpresentation 141124072610 Conversion Gate01
No ratings yet
Speechrecognitionfinalpresentation 141124072610 Conversion Gate01
30 pages
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
No ratings yet
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
104 pages
Tutorial On Speech Recognition: Alex Acero Microsoft Research
No ratings yet
Tutorial On Speech Recognition: Alex Acero Microsoft Research
38 pages
Speech Recognition Using Artificial Neural Network: - A Review
100% (1)
Speech Recognition Using Artificial Neural Network: - A Review
4 pages
Linear Prediction
No ratings yet
Linear Prediction
18 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
Chapter 2 - Speech Signal Processing
No ratings yet
Chapter 2 - Speech Signal Processing
60 pages
Automatic Speech Recognition: 2.1 Relevant Keywords From Probability Theory and Statistics
No ratings yet
Automatic Speech Recognition: 2.1 Relevant Keywords From Probability Theory and Statistics
14 pages
Speech Recognition1
100% (1)
Speech Recognition1
39 pages
Recall What Are Sound Features? Feature Detection and Extraction Features in Sphinx III
No ratings yet
Recall What Are Sound Features? Feature Detection and Extraction Features in Sphinx III
11 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Voice Recognition PDF
No ratings yet
Voice Recognition PDF
37 pages
Lecture 7 - Automatic Speech Recognition
No ratings yet
Lecture 7 - Automatic Speech Recognition
58 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
Lecture 9 PDF
No ratings yet
Lecture 9 PDF
42 pages
Feature Extraction Methods LPC, PLP and MFCC
100% (1)
Feature Extraction Methods LPC, PLP and MFCC
5 pages
Speech Understanding Content
No ratings yet
Speech Understanding Content
10 pages
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
No ratings yet
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
50 pages
The Diagram Outlines The Key Steps Involved in Co
No ratings yet
The Diagram Outlines The Key Steps Involved in Co
20 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
Introductory Methods of Numerical Analysis: Fifth Edition
No ratings yet
Introductory Methods of Numerical Analysis: Fifth Edition
11 pages
Automatic Speech Recognition 2
No ratings yet
Automatic Speech Recognition 2
22 pages
Speaker Recognition System
No ratings yet
Speaker Recognition System
7 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
34 pages
Assamese Numeral Corpus For Speech Recognition Using ANN: Master of Science
No ratings yet
Assamese Numeral Corpus For Speech Recognition Using ANN: Master of Science
58 pages
Speech Recognition, Synthesis, and Dialogue 2
No ratings yet
Speech Recognition, Synthesis, and Dialogue 2
59 pages
Implementation of Speech Recognition Using Artificial Neural Networks
No ratings yet
Implementation of Speech Recognition Using Artificial Neural Networks
12 pages
Feature Extraction Using PCA
No ratings yet
Feature Extraction Using PCA
36 pages
03 Audio
No ratings yet
03 Audio
32 pages
Intechopen 80419
No ratings yet
Intechopen 80419
18 pages
Blok Diagram Pitch Correction
No ratings yet
Blok Diagram Pitch Correction
37 pages
Term Paper ECE-300 Topic: - Speech Recognition
No ratings yet
Term Paper ECE-300 Topic: - Speech Recognition
14 pages
MFCC Feature Extraction
No ratings yet
MFCC Feature Extraction
9 pages
Speech Recognition UTHM
No ratings yet
Speech Recognition UTHM
30 pages
$Xwrpdwlf6Shhfk5Hfrjqlwlrqxvlqj&Ruuhodwlrq $Qdo/Vlv: $evwudfw - 7Kh Jurzwk LQ Zluhohvv FRPPXQLFDWLRQ
No ratings yet
$Xwrpdwlf6Shhfk5Hfrjqlwlrqxvlqj&Ruuhodwlrq $Qdo/Vlv: $evwudfw - 7Kh Jurzwk LQ Zluhohvv FRPPXQLFDWLRQ
5 pages
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
No ratings yet
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
6 pages
Lecture
No ratings yet
Lecture
7 pages
CS425 Audio and Speech Processing - Hodgkinson - 2012
No ratings yet
CS425 Audio and Speech Processing - Hodgkinson - 2012
106 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
35 pages
Unit 5 (Automatic Speech Recognition)
No ratings yet
Unit 5 (Automatic Speech Recognition)
13 pages
Speech Recognition Using A DSP: Lunds Universitet
No ratings yet
Speech Recognition Using A DSP: Lunds Universitet
12 pages
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
No ratings yet
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
6 pages
Ann LA2 Project
No ratings yet
Ann LA2 Project
23 pages
Reconocimiento de Voz - MATLAB
No ratings yet
Reconocimiento de Voz - MATLAB
5 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
DSP Vtu Lab Manual
No ratings yet
DSP Vtu Lab Manual
137 pages
Digital Signal Processing "Speech Recognition": Paper Presentation On
No ratings yet
Digital Signal Processing "Speech Recognition": Paper Presentation On
12 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
Speech Recognition Using Matlab: Objective
No ratings yet
Speech Recognition Using Matlab: Objective
2 pages
2-Discrete Fourier Transform PDF
No ratings yet
2-Discrete Fourier Transform PDF
80 pages
OFDM Wireless LANS, A Theoretical and Practical Guide - Juha Heiskala
No ratings yet
OFDM Wireless LANS, A Theoretical and Practical Guide - Juha Heiskala
275 pages
Phasor Estimation and Modelling Techniques of PMU A Rev - 2017 - Energy Procedi PDF
No ratings yet
Phasor Estimation and Modelling Techniques of PMU A Rev - 2017 - Energy Procedi PDF
14 pages
EL7133 Exercises
No ratings yet
EL7133 Exercises
92 pages
M.Tech Power System PDF
No ratings yet
M.Tech Power System PDF
30 pages
DSP Lab Report
No ratings yet
DSP Lab Report
58 pages
DSP Learning Material
No ratings yet
DSP Learning Material
105 pages
Digital Signal Processing Introduction Part
100% (1)
Digital Signal Processing Introduction Part
13 pages
Design and Analysis of Digital Filters For Speech Signals Using Multirate Signal Processing
No ratings yet
Design and Analysis of Digital Filters For Speech Signals Using Multirate Signal Processing
8 pages
DSP Course Outline
No ratings yet
DSP Course Outline
4 pages
Te 2004
No ratings yet
Te 2004
498 pages
DSP Lab Manual Updated - August2018
No ratings yet
DSP Lab Manual Updated - August2018
80 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
EC6502 Principles of Digital Signal Processing 1234
No ratings yet
EC6502 Principles of Digital Signal Processing 1234
14 pages
13.401 Engineering Mathematics - Iii (At) (Probability & Random Processes)
No ratings yet
13.401 Engineering Mathematics - Iii (At) (Probability & Random Processes)
7 pages
The Scientist and Engineer's Guide To Digital Signal Processing by Steven W. Smith, PH.D
No ratings yet
The Scientist and Engineer's Guide To Digital Signal Processing by Steven W. Smith, PH.D
9 pages
How Does Shazam Work Coding Geek
No ratings yet
How Does Shazam Work Coding Geek
53 pages
Sigra Manual B2 en PDF
No ratings yet
Sigra Manual B2 en PDF
160 pages
Windows, Harmonic Analysis, Discrete Fourier Transform: and The
No ratings yet
Windows, Harmonic Analysis, Discrete Fourier Transform: and The
70 pages
In-Place, In-Order Prime Factor FFT Algorithm: DFT's
No ratings yet
In-Place, In-Order Prime Factor FFT Algorithm: DFT's
12 pages
1 Introduction & Objective: Lab 19: The Fast Fourier Transform
No ratings yet
1 Introduction & Objective: Lab 19: The Fast Fourier Transform
4 pages
Solutions Manual For Digital Communications in Practice
No ratings yet
Solutions Manual For Digital Communications in Practice
48 pages
DSP QB - 16ec422 - Prasad
No ratings yet
DSP QB - 16ec422 - Prasad
15 pages
Fractional Fourier Texture Masks
No ratings yet
Fractional Fourier Texture Masks
11 pages
Comparison of Different Precoding Techniques For Unbalanced Impairments Compensation in Short-Reach DMT Transmission Systems
No ratings yet
Comparison of Different Precoding Techniques For Unbalanced Impairments Compensation in Short-Reach DMT Transmission Systems
12 pages
9 Signal Processing
No ratings yet
9 Signal Processing
22 pages
HW 05
No ratings yet
HW 05
2 pages
Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
Voice on the Air! Easy FM Transmitter for Beginners
From Everand
Voice on the Air! Easy FM Transmitter for Beginners
GURUPRASAD N H
No ratings yet
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
A Beginner's Guide to Ham Radio
From Everand
A Beginner's Guide to Ham Radio
George Freeman
No ratings yet
Filter Bank: Insights into Computer Vision's Filter Bank Techniques
From Everand
Filter Bank: Insights into Computer Vision's Filter Bank Techniques
Fouad Sabry
No ratings yet

CCS369 - TSS-Unit 5

Uploaded by

CCS369 - TSS-Unit 5

Uploaded by

CCS369 TEXT AND SPEECH ANALYSIS

Speech recognition: Acoustic modelling – Feature Extraction - HMM, HMM-DNN

The blind Milton dictated Paradise Lost to his daughters

• Transform the input waveform into a sequence of acoustic feature

• DFT is defined as follows:

• Implemented - by creating a bank of filters that collect energy from each

• can also output context-dependent

• Aim of DNN - model the posterior

You might also like