0% found this document useful (0 votes)
54 views23 pages

CCS369 - TSS-Unit 5

Uploaded by

thirushharidoss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views23 pages

CCS369 - TSS-Unit 5

Uploaded by

thirushharidoss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

CCS369 TEXT AND SPEECH ANALYSIS

UNIT 5
By
C.Jerin Mahibha
Assoc.Prof / CSE
UNIT V AUTOMATIC SPEECH RECOGNITION

Speech recognition: Acoustic modelling – Feature Extraction - HMM, HMM-DNN


systems

COURSE OBJECTIVES:
Develop a speech recognition system
COURSE OUTCOME:
CO5:Apply deep learning models for building speech recognition systems
Text Book : Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition” by Daniel Jurafsky
and James H. Martin - Chapter 26

Reference :
https://maelfabien.github.io/machinelearning/speech_reco/#acoustic-model
Speech Recognition - Automatic Speech Recognition (ASR)
• Speech - natural interface for communication
• Task - map any waveform to the appropriate string of words
• Application
• Smart home appliances
• Personal assistants
• Cellphones
• Telephony - call-routing
• Sophisticated dialogue applications
• Transcription - automatically generating captions for audio or video text
• Field of law - dictation plays an important role
• Augmentative communication - difficulties or inabilities in typing

The blind Milton dictated Paradise Lost to his daughters


Henry James dictated his later novels after a repetitive stress injury
Factors to be considered :
1.Vocabulary size
• High accuracy
➢2 - word vocabulary - yes / no
➢11 - word vocabulary - digit recognition 0 -9
• Much harder
➢Large vocabularies of up to 60,000 words
➢ Open-ended tasks like transcribing videos or human conversations
2. Who the speaker is talking to
• Easy to recognize
➢Humans speaking to machines
➢Read speech – humans reading out loud - audio book
➢Talking more slowly and more clearly
• Difficult
➢Conversational speech - humans speaking to humans - transcribing a business meeting
3. Channel and noise
• Easy to recognize - if recorded
➢ in a quiet room
➢ head-mounted microphones
• Difficult - if recorded
➢distant microphone
➢noisy city street
➢car with the window open
4. Accent or speaker- class characteristics
• Easy to recognize
➢same dialect or variety that the system was trained on
• Difficult
➢Regional or ethnic dialects
➢speech by children
ASR Corpora
Acoustic modelling Ref : https://maelfabien.github.io/machinelearning/speech_reco/#acoustic-
model

• Model the relationship between the audio signal and the phonetic units in
the language
• Isolated word/pattern recognition - acoustic features (Y) - used as an input
to a classifier - output is the correct word
• Continuous speech recognition – Input / Output is a sequence
• acoustic model goes further than a simple classifier
• Output - sequence of phonemes
• Hidden Markov Models
➢are natural candidates for Acoustic Models
➢they are great at modeling sequences
➢Has states si, and at each state, observations oi are generated
Feature Extraction

• Transform the input waveform into a sequence of acoustic feature


vectors - each vector representing the information in a small time
window of the signal-sequences of log mel spectrum vectors
1.Convert the analog representations into a digital signal- Sampling and
quantization
2. Extract spectral features – Windowing, Discrete Fourier Transform
Sampling and Quantization
1.Sampling
• Signal is sampled by measuring its amplitude at a particular time
• Sampling rate - number of samples taken per second
• To accurately measure a wave - need to have at least two samples in each cycle
➢ one measuring the positive part
➢ one measuring the negative part
• More than two samples per cycle - increases the amplitude accuracy
• Less than two samples - cause the frequency of the wave to be completely missed
• Maximum frequency wave that can be measured is one whose frequency is half the
sample rate (since every cycle needs two samples)
• Maximum frequency for a given sampling rate - called the Nyquist frequency
➢ Human speech frequency - below 10,000 Hz - sampling rate 20,000 Hz
➢ Telephone speech - less than 4,000 Hz - sampling rate 8,000 Hz
➢ Microphone speech - 16,000 Hz - sampling rate 32,000 Hz
• Higher sampling rates - produces higher ASR accuracy
• Cannot combine different sampling rates for training and testing ASR systems
➢ If testing on a telephone corpus like Switchboard- downsample training corpus to 8 KHz
2.Quantization
• Process of representing real-valued numbers as integers
• Amplitude measurements are stored as integers
• either 8 bit (values from -128–127) or
• 16 bit (values from -32768–32767)
• All values that are closer together than the minimum granularity (the
quantum size) are represented identically
• each sample at time index n in the digitized, quantized waveform is
referred as x[n]
Windowing
• Small window of speech - characterizes part of a particular phoneme
• Extract spectral features from the window
• Inside the small window- signal is considered as stationary - statistical properties
are constant
• Extract the rough stationary portion of speech by using a window which is non-
zero inside a region and zero elsewhere
• Run the window across the speech signal
• Multiply it by the input waveform to produce a windowed waveform
• Speech extracted from each window is called a frame
• Windowing is characterized by three parameters:
➢ Window size or frame size of the window stride - its width in milliseconds
➢ Frame stride or shift or offset between successive windows
➢ Shape of the window
• To extract the signal - multiply the value of the signal at time n, s[n] by the
value of the window at time n, w[n]:
• The window shape – rectangular
• extracted windowed signal
• looks just like the original signal
• abruptly cuts off the signal at its boundaries- creates problems during Fourier
analysis.
• To avoid discontinuities - shrinks the values of the signal toward zero at the
window boundaries - Hamming window
Equations - assuming a window that is L frames long:
Discrete Fourier Transform
• Extract spectral information for the windowed signal
• Know the amount of energy the signal contains at different frequency bands.
• Discrete Fourier transform or DFT - tool for extracting spectral information for
discrete frequency bands for a discrete-time (sampled) signal
➢Input to the DFT - windowed signal x[n] ...x[m],
➢Output, for each of N discrete frequency bands - complex number X[k]
represent the magnitude and phase of the frequency component in the original signal
• Fourier analysis relies on Euler’s formula, with j as the imaginary unit:

• DFT is defined as follows:

• A commonly used algorithm for computing the DFT is the Fast Fourier
transform or FFT
• This implementation of the DFT is very efficient but only works for values
of N that are powers of 2
Mel Filter Bank and Log
• Results of FFT – represent the energy at each frequency band
• Human hearing - not equally sensitive at all frequency bands- less sensitive at
higher frequencies
• Modeling - human perceptual property - improves speech recognition
performance
• Implemented - by collecting energies - not equally at each frequency band,
but according to the mel scale, an auditory frequency scale
• A mel - is a unit of pitch
• Pairs of sounds that are perceptually equidistant in pitch are separated by an
equal number of mels
• The mel frequency m - computed from the raw acoustic frequency by a log
transformation:

• Implemented - by creating a bank of filters that collect energy from each


frequency band, spread logarithmically - very fine resolution at low
frequencies, and less resolution at high frequencies
• Multiply by the spectrum - mel spectrum
• Take log of each of the mel spectrum values
HMM systems
• 1 phoneme
➢represented by a 3 or 5 state linear HMM
➢generally the beginning, middle and end of the phoneme
• Topology of HMMs is
➢ flexible by nature
➢ each phoneme - represented by a single state, or 3 states
The HMM
• supposes observation independence

• can also output context-dependent


phonemes - triphones
➢Triphones are simply a group of 3
phonemes, the left one being the left
context, and the right one, the right
context
• trained using Baum-Welsch algorithm
• learns to give the probability of each
end of phoneme at time t
HMM-DNN systems
• do not care about the acoustic
model P(X∣W)
• directly tackle P(W∣X) - as the
probability of observing state
sequences given X
• Target

• Aim of DNN - model the posterior


probabilities over HMM states
Considerations on the HMM-DNN framework:
• large number of hidden layers
• the inputs features - extracted from large windows - to have a large
context
• early stopping can be used
• uses Bayes Rule
• Probability of the acoustic feature P(X) is not known - scales all the
likelihoods by the same factor - does not modify the alignment
• Training of HMM-DNN architectures – based on the hybrid HMM-DNN,
using EM(Expectation Maximization)
THANK YOU

You might also like