Speech Signals Processing
Speech Signals Processing
Speech signals
z z From prehistory to the new media of the future, speech has been and will be a primary form of communication between humans. Nevertheless, there often occur conditions under which we measure and then transform the speech to another form, speech signal, in order to enhance our ability to communicate. The speech signal is extended, through technological media such as telephony, movies, radio, television, and now Internet. This trend reflects the primacy of speech communication in human psychology. Speech will become the next major trend in the personal computer market in the near future.
Speech signal processing is a diverse field that relies on knowledge of language at the levels of Signal processing Acoustics Phonetics Language-independent Phonology Morphology Syntax Language-dependent Semantics Pragmatics
Computer Science& Electronic Engineering
It is based on the fact that Most of energy between 20 Hz to about 7KHz , Human ear sensitive to energy between 50 Hz and 4KHz z z In terms of acoustic or perceptual, above features are considered. From Speech to Speech Signal, in terms of Phonetics (Speech production), the digital model of Speech Signal will be discussed in Chapter 2.
An example:
Voiced : created by air passed through vocal cords (e.g., ah, v) Unvoiced : created by air through mouth and lips (e.g., s, f )
Historical Review of Speech Signal Processing 1874, invention of telephone, waveform based; in 1939, vocoder, Homer Dudley, parametric model of speech; in 1947, spectrograph, Bell Lab., a power tool to speech analysis; in 1950s, the first speech printer, simple word recognition; in 1960s, Acoustic Theory of Speech Production, G. Faut; Digital Signal Processing techniques such as Digital filter, FFT, and etc. were adopted in speech processing. in 1970s, An ambitious speech understanding project was funded by ARPA(American Research Projects Agency) which led to many seminal system and technologies. LPC(Linear Prediction Coding) analysis and processing; in 1980s, VQ(Vector Quantification) is used in coding, HMM (Hide Markov Model) based speech analysis,
processing and recognition. from 1990, more practical and commercial in Speech-to-speech translation, Automatic speech recognition, Text-to-speech conversion, and etc. Techniques include model based signal analysis (say HMM), Multiresolution analysis (wavelet), Artificial Neural Network (ANN) Modification z The goal in speech modification is to alter the speech signal to have some desired property. z Modifications of interest include time-scale, pitch, and spectral changes. z Applications of time-scale modification are fitting radio and TV commercials into an allocated time slot and the synchronization of audio and video presentations. z In addition, speeding up speech has use in message playback, voice mail, and reading machines and books for the blind, while slowing down speech has application to learning a foreign language. Coding z In the application of speech coding, the goal is to reduce the information rate, measured in bits per second, while maintaining the quality of the original speech waveform. z By quality we mean speech attributes such as naturalness, intelligibility, and speaker recognizability. z Broadly, there are three classes of speech coders: 1) Waveform coders, which represent the speech waveform directly and do not rely on a speech production model. Operate in the high range of 16-64 kbps (bps, denoting bits per second). 2) Vocoders are largely speech model-based and rely on a small set of model parameters; they operate at the low bit rate range of 1.2-4.8 kbps, and tend to be of lower quality than waveform coders. 3) Hybrid coders are partly waveform-based and partly speech model-based and operate in the 4.8-16 kbps range with a quality between waveform coders and vocoders.
Applications of speech coders include digital telephony over constrained bandwidth channels, such as cellular, satellite, and Internet communications. Other applications are video phones where bits are traded off between speech and image data, secure speech links for government and military communications, and voice storage as with computer voice mail where storage capacity is limited.
Enhancement z In speech enhancement, the goal is to improve the quality of degraded speech. z One approach is to preprocess the speech waveform before it is degraded. z Another is postprocessing enhancement after signal degradation. z Preprocessing include increasing the transmitted power, constrained by a peak power transmission limit, for example, automatic gain control (AGC) in a noisy environment. z Postprocessing include: 1) Reduction of additive noise in digital telephony, and vehicle communication and aircraft communication. 2) Reduction of interfering backgrounds and speakers for the hearing-impaired. 3) Removal of unwanted convolutional channel distortion and reverberation. Speaker Recognition z This area of speech signal processing exploits the variability of speech model parameters across speakers. z Applications include verifying a person's identity for entrance to a secure facility or personal account, and voice identification in forensic investigation. z An understanding of the speech model features that cue a person's identity is also important in speech modification (e.g., speaker conversion). z Thus, speech modification and speaker recognition can be developed synergistically. Speech Synthesize
What: make virtual voice or music Why: For technology to communicate when a display would be inconvenient because: (a) Too big, (b) Eyes busy, (c) Via phone , (d) In the dark, (e) Moving around Problems: The spelling of words doesnt match their sound Some words have multiple meanings+sounds Simplistic speech models sound mechanical Speech sounds are influenced by adjacent phonemes Important words must be slightly louder Voice pitch and talking speed must vary smoothly throughout a sentence Speech Recognition
What: To convert a speech waveform into text Why: To communicate and control technology when a keyboard would be inconvenient because: (a) Too big, (b) Hands busy, (c) Via phone, (d) In the dark, (e) Moving around Problems: The spelling of words doesnt match their sound The waveform of a word varies a lot between different speakers (or even the same speaker) The extracted features wont be exactly repeatable Speech sounds are influenced by adjacent phonemes Speaking speed varies enormously No clear boundary between words or phonemes