The Diagram Outlines The Key Steps Involved in Co
The Diagram Outlines The Key Steps Involved in Co
Recorded Speech: This represents the raw audio signal captured by a microphone when
someone speaks. It's a continuous analog waveform that encodes the sound variations over
time.
Signal Analysis: The analog speech signal is converted into a digital format suitable for
processing by a computer. This typically involves Analog-to-Digital Conversion (ADC), where
the signal is sampled at a specific rate and its amplitude values are discretized into digital
bits.
Acoustic Model: This component analyzes the characteristics of the digitized speech signal.
It extracts features like Mel-Frequency Cepstral Coefficients (MFCCs) that represent the
speech's spectral information relevant for ASR.
Search Space: This represents the possible sequence of words or sounds that the ASR
system considers as potential matches for the speech input. It can be vast, encompassing all
possible words and combinations in the target language.
Training Data: To function accurately, the ASR system needs to be trained on a large corpus
of speech data with corresponding text transcripts. This data allows the system to learn the
relationships between acoustic features and words or phonemes (basic units of sound).
Language Model: This component incorporates knowledge about language structure and
grammar. It helps the ASR system choose the most likely word sequence based on the
extracted features and the context of the surrounding words.
Decoded Text (Transcription): After processing the speech signal and considering both the
acoustic features and language rules, the ASR system outputs the recognized text, which is
the transcription of the spoken input.
Component Description
Recorded Speech Analog audio signal from the microphone
Decoded Text
Recognized text output by the ASR system
(Transcription)
1. Distinguishing Phones:
Source
Pre-emphasis
But we cannot just chop it off at the edge of the frame. The
suddenly fallen in amplitude will create a lot of noise that
shows up in the high-frequency. To slice the audio, the
amplitude should gradually drop off near the edge of a frame.
Let’s say w is the window applied to the original audio clip in
the time domain.
Source
First, we square the output of the DFT. This reflects the power
of the speech at each frequency (x[k]²) and we call it the DFT
power spectrum. We apply these triangular Mel-scale filter
banks to transform it to Mel-scale power spectrum. The output
for each Mel-scale power spectrum slot represents the energy
from a number of frequency bands that it covers. This mapping
is called the Mel Binning. The precise equations for
slot m will be:
Source
Log
Cepstrum — IDFT
Source