Speech Coding
Speech Coding
1 INTRODUCTION ……………………………………………………………..…. 1
1.1 Motivation ……………………………………………………………….. 1
1.2 Overview of Speech Coding ……………………………………………... 2
1.3 Applications of Speech Coders ………………………..………………… 2
1.4 Objective of Present Work ………………………………………………. 3
1.5 Report Organization ………………………………………………….….. 3
3 PRESENT WORK.……………………………………………………………….. 9
3.1 Structure of Speech Coders ……………………………………………… 9
3.2 Classification of Speech Coders …………………………………...…… 14
3.2.1 Classification by Bit-Rate ……………………………….…… 14
3.2.2 Classification by Coding Techniques ………………….…….. 15
3.3 About Algorithms …………………………………….………………… 16
3.4 Pulse Code Modulation ………………………………………………… 17
3.4.1 Modulation …………………………………………………… 18
3.4.2 Demodulation ………………………………………………… 19
3.4.3 Digitization …………………………………………………… 19
3.5 Differential Pulse Code Modulation …………………………………… 20
3.6 Other Popular Algorithms ……………………………………………… 21
4 RESULTS AND DISCUSSIONS ………………………………………………. 23
4.1 Implementation Details ………………………………………………… 23
4.2 Results ………………………………………………………………….. 23
5 CONCLUSION …………………………………………………………………. 26
REFERENCES
List of Figures
3G Third Generation
AbS Analysis-by-Synthesis
ACELP Algebraic Code-Excited Linear Prediction
ACR Absolute Category Rating
ADPCM Adaptive Differential Pulse Code Modulation
CDMA Code Division Multiple Access
CELP Code-Excited Linear Prediction
DMOS Degradation Mean Opinion Score
DoD U.S. Department of Defense
DPCM Differential Pulse Code Modulation
DSVD Digital Simultaneous Voice and Data
DTAD Digital Telephone Answering Device
GSM Groupe Speciale Mobile
ICASSP International Conference on Acoustics, Speech, and Signal Processing
IDFT Inverse discrete Fourier transform
IEC International Electrotechnical Commission
IEEE Institute of Electrical and Electronics Engineers
IP Internet Protocol
ITU International Telecommunications Union
ITU–R ITU–Radiocommunication Sector
ITU–T ITU–Telecommunications Standardization Sector
MOS Mean Opinion Score
NCS National Communications System
PC Personal Computer
PCM Pulse Code Modulation
POTS Plain Old Telephone Service
PSTN Public Switched Telephone Network
RAM Random Access Memory
RC Reflection Coefficient
RCR Research and Development Center for Radio Systems of Japan
RMS Root Mean Square
ROM Read Only Memory
SNR Signal to Noise Ratio
TDMA Time Division Multiple Access
TI Texas Instruments
TIA Telecommunications Industry Association
TTS Text to Speech
UMTS Universal Mobile Telecommunications System
VBR Variable Bit-rate
VoIP Voice over Internet Protocol
VSELP Vector Sum Excited Linear Prediction
ACKNOWLEDGEMENT
We take the opportunity to remember and acknowledge the cooperation, good will
and support, both moral and technical, extended by several noble individuals of which
the report is involved. We shall always cherish out associations with them.
INTRODUCTION
1.1 MOTIVATION
In the era of third-generation (3G) wireless personal communications standards,
despite the emergence of broad-band access network standard proposals, the most
important mobile radio services are still based on voice communications. Even when
the predicted surge of wireless data and Internet services becomes a reality, voice will
remain the most natural means of human communication, although it may be
delivered via the Internet, predominantly after compression.
Due to the increasing demand for speech communication, speech coding
technology has received augmenting levels of interest from the research,
standardization, and business communities. Advances in microelectronics and the vast
availability of low-cost programmable processors and dedicated chips have enabled
rapid technology transfer from research to product development; this encourages the
research community to investigate alternative schemes for speech coding, with the
objectives of overcoming deficiencies and limitations [2]. The standardization
community pursues the establishment of standard speech coding methods for various
applications that will be widely accepted and implemented by the industry. The
business communities capitalize on the ever-increasing demand and opportunities in
the consumer, corporate, and network environments for speech processing products.
1.2 OVERVIEW OF SPEECH CODING
This section describes the structure, properties, and applications of speech coding
technology.
Speech coding is the art of creating a minimally redundant representation of the
speech signal that can be efficiently transmitted or stored in digital media, and
decoding the signal with the best possible perceptual quality. Like any other
continuous-time signal, speech may be represented digitally through the processes of
sampling and quantization; speech is typically quantized using either 16-bit uniform
or 8-bit companded quantization [2]. Like many other signals, however, a sampled
speech signal contains a great deal of information that is either redundant (nonzero
mutual information between successive samples in the signal) or perceptually
irrelevant (information that is not perceived by human listeners). Most
telecommunications coders are lossy, meaning that the synthesized speech is
perceptually similar to the original but may be physically dissimilar.
Speech coding is performed using numerous steps or operations specified as
an algorithm. An algorithm is any well-defined computational procedure that takes
some value, or set of values, as input and produces some value, or set of values, as
output. An algorithm is thus a sequence of computational steps that transform the
input into the output. Many signal processing problems—including speech coding—
can be formulated as a well-specified computational problem; hence, a particular
coding scheme can be defined as an algorithm. In general, an algorithm is specified
with a set of instructions, providing the computational steps needed to perform a task.
With these instructions, a computer or processor can execute them so as to complete
the coding task. The instructions can also be translated to the structure of a digital
circuit, carrying out the computation directly at the hardware level [2].
LITERATURE REVIEW
The history of audio and music compression was beginning in the 1930s with research
into pulse-code modulation (PCM) and PCM coding. Compression of digital audio
was started in the 1960s by telephone companies who were concerned with the cost of
transmission bandwidth. The 1990s have seen improvements in these earlier
algorithms and an increase in compression ratios at given audio quality levels. Speech
compression is often referred to as speech coding which is defined as a method for
reducing the amount of information needed to represent a speech signal. Most forms
of speech coding are usually based on a lossy algorithm. Lossy algorithms are
considered acceptable when encoding speech because the loss of quality is often
undetectable to the human ear.
2.1 INTRODUCTION
Speech coding is fundamental to the operation of the public switched telephone
network (PSTN), videoconferencing systems, digital cellular communications, and
emerging voice over Internet protocol (VoIP) applications. The goal of speech coding
is to represent speech in digital form with as few bits as possible while maintaining
the intelligibility and quality required for the particular application [4]. Interest in
speech coding is motivated by the evolution to digital communications and the
requirement to minimize bit rate, and hence, conserve bandwidth. There is always a
tradeoff between lowering the bit rate and maintaining the delivered voice quality and
intelligibility; however, depending on the application, many other constraints also
must be considered, such as complexity, delay, and performance with bit errors or
packet losses.
Based on these developments, it is possible today, and it is likely in the near
future, that our day-to-day voice communications will involve multiple hops
including heterogeneous networks. This is a considerable departure from the plain old
telephone service (POTS) on the PSTN, and indeed, these future voice connections
will differ greatly even from the digital cellular calls connected through the PSTN
today. As the networks supporting our voice calls become less homogeneous and
include more wireless links, many new challenges and opportunities emerge. There
was almost an exponential growth of speech coding standards in the 1990's for a wide
range of networks and applications, including the PSTN, digital cellular, and
multimedia streaming.
In order to compare the various speech coding methods and standards, it is
necessary to have methods for establishing the quality and intelligibility produced by
a speech coder. It is a difficult task to find objective measures of speech quality, and
often, the only acceptable approach is to perform subjective listening tests [5].
However, there have been some recent successes in developing objective quantities,
experimental procedures, and mathematical expressions that have a good correlation
with speech quality and intelligibility.
PRESENT WORK
The goal of all speech coding systems is to transmit speech with the highest possible
quality using the least possible channel capacity. In general, there is a positive
correlation between coder bit-rate efficiency and the algorithmic complexity required
to achieve it. The more complex an algorithm is, the more is processing delay and
cost of implementation. A speech coder converts a digitized speech signal into a
coded representation, which is usually transmitted in frames. A speech decoder
receives coded frames and synthesizes reconstructed speech. Standards typically
dictate the input–output relationships of both coder and decoder. The input–output
relationship is specified using a reference implementation, but novel implementations
are allowed, provided that input–output equivalence is maintained. Speech coders
differ primarily in bit rate (measured in bits per sample or bits per second),
complexity (measured in operations per second), delay (measured in milliseconds
between recording and playback), and perceptual quality of the synthesized speech.
The above bit-rate, also known as input bit-rate, is what the source encoder
attempts to reduce (Figure 3.1). The output of the source encoder represents the
encoded digital speech and in general has substantially lower bit-rate than the input.
The linear prediction coding algorithm, for instance, has an output rate of 2.4 kbps, a
reduction of more than 53 times with respect to the input. The encoded digital speech
data is further processed by the channel encoder, providing error protection to the bit-
stream before transmission to the communication channel, where various noise and
interference can sabotage the reliability of the transmitted data. Even though in Figure
1.1 the source encoder and channel encoder are separated, it is also possible to jointly
implement them so that source and channel encoding are done in a single step. The
channel decoder processes the error-protected data to recover the encoded data, which
is then passed to the source decoder to generate the output digital speech signal,
having the original rate. This output digital speech signal is converted to continuous-
time analog form through standard procedures: digital to analog conversion followed
by ant-aliasing filtering [2].
The input speech (a discrete-time signal having a bit-rate of 128 kbps) enters
the encoder to produce the encoded bit-stream, or compressed speech data. Bit-rate of
the bit-stream is normally much lower than that of the input speech.
The decoder takes the encoded bit-stream as its input to produce the output
speech signal, which is a discrete-time signal having the same rate as the input speech.
As we will see later in this book, many diverse approaches can be used to design the
encoder/decoder pair. Different methods provide differing speech quality and bit-rate,
as well as implementation complexity. The encoder/decoder structure represented in
Figure 3.2 is known as a speech coder, where the input speech is encoded to produce a
low-rate bit-stream. This bit-stream is input to the decoder, which constructs an
approximation of the original signal.
Coding Delay
Consider the delay measured using the topology shown in Figure 3.3. The delay
obtained in this way is known as coding delay, or one-way coding delay [Chen,
1995], which is given by the elapsed time from the instant a speech sample arrives at
the encoder input to the instant when the same speech sample appears at the decoder
output [2]. The definition does not consider exterior factors, such as communication
distance or equipment, which are not controllable by the algorithm designer.
Based on the definition, the coding delay can be decomposed into four major
components (see Figure 3.4):
1. Encoder Buffering Delay: Many speech encoders require the collection of a
certain number of samples before processing. For instance, typical linear
prediction (LP)-based coders need to gather one frame of samples ranging
from 160 to 240 samples, or 20 to 30 ms, before proceeding with the actual
encoding process.
2. Encoder Processing Delay: The encoder consumes a certain amount of time to
process the buffered data and construct the bit-stream. This delay can be
shortened by increasing the computational power of the underlying platform
and by utilizing efficient algorithms. The processing delay must be shorter
than the buffering delay; otherwise the encoder will not be able to handle data
from the next frame.
3. Transmission Delay: Once the encoder finishes processing one frame of input
samples, the resultant bits representing the compressed bit-stream are
transmitted to the decoder. Many transmission modes are possible and the
choice depends on the particular system requirements.
4. Decoder Processing Delay: This is the time required to decode in order to
produce one frame of synthetic speech. As for the case of the encoder
processing delay, its upper limit is given by the encoder buffering delay, since
a whole frame of synthetic speech data must be completed within this time
frame in order to be ready for the next frame.
A given method works fine at a certain bit-rate range, but the quality of the
decoded speech will drop drastically if it is decreased below a certain threshold. The
minimum bit-rate that speech coders will achieve is limited by the information content
of the speech signal. Judging from the recoverable message rate from a linguistic
perspective for typical speech signals, it is reasonable to say that the minimum lies
somewhere around 100 bps. Current coders can produce good quality at 2 kbps and
above, suggesting that there is plenty of room for future improvement.
Parametric Coders: Within the framework of parametric coders, the speech signal is
assumed to be generated from a model, which is controlled by some parameters.
During encoding, parameters of the model are estimated from the input speech signal,
with the parameters transmitted as the encoded bit-stream. This type of coder makes
no attempt to preserve the original shape of the waveform, and hence SNR is a useless
quality measure. Perceptual quality of the decoded speech is directly related to the
accuracy and sophistication of the underlying model. Due to this limitation, the coder
is signal specific, having poor performance for non-speech signals.
There are several proposed models in the literature. The most successful,
however, is based on linear prediction. In this approach, the human speech production
mechanism is summarized using a time-varying filter, with the coefficients of the
filter found using the linear prediction analysis procedure.
Hybrid Coders: As its name implies, a hybrid coder combines the strength of a
waveform coder with that of a parametric coder. Like a parametric coder, it relies on a
speech production model; during encoding, parameters of the model are located.
Additional parameters of the model are optimized in such a way that the decoded
speech is as close as possible to the original waveform, with the closeness often
measured by a perceptually weighted error signal. As in waveform coders, an attempt
is made to match the original signal with the decoded signal in the time domain.
This class dominates the medium bit-rate coders, with the code-excited linear
prediction algorithm and its variants the most outstanding representatives. From a
technical perspective, the difference between a hybrid coder and a parametric coder is
that the former attempts to quantize or represent the excitation signal to the speech
production model, which is transmitted as part of the encoded bit-stream. The latter,
however, achieves low bit-rate by discarding all detail information of the excitation
signal; only coarse parameters are extracted. A hybrid coder tends to behave like a
waveform coder for high bit-rate, and like a parametric coder at low bit-rate, with fair
to good quality for medium bit-rate.
3.3 ABOUT ALGORITHMS
A speech coder is generally specified as an algorithm, which is defined as a
computational procedure that takes some input values to produce some output values.
An algorithm can be implemented as software (i.e., a program to command a
processor) or as hardware (direct execution through digital circuitry) [6]. With the
widespread availability of low-cost high-performance digital signal processors (DSPs)
and general-purpose microprocessors, many signal processing tasks—done in the old
days using analog circuitry—are predominantly executed in the digital domain.
Advantages of going digital are many: programmability, reliability, and the ability to
handle very complex procedures, such as the operations involved in a speech coder,
so complex that the analog world would have never dreamed of it. In this section the
various aspects of algorithmic implementation are explained.
3.4.1 Modulation
A sine wave (red curve) is sampled and quantized for PCM. The sine wave is sampled
at regular intervals, shown as ticks on the x-axis. For each sample, one of the
available values is chosen by some algorithm (in this case, the floor function is used).
This produces a fully discrete representation of the input signal (shaded area) that can
be easily encoded as digital data for storage or manipulation [8]. For the sine wave
example at right, we can verify that the quantized values at the sampling moments are
7, 9, 11, 12, 13, 14, 14, 15, 15, 15, 14, etc. Encoding these values as binary numbers
would result in the following set of nibbles: 0111, 1001, 1011, 1100, 1101, 1110,
1110, 1111, 1111, 1111, 1110, etc. These digital values could then be further
processed or analyzed by a purpose-specific digital signal processor or general
purpose CPU. Several Pulse Code Modulation streams could also be multiplexed into
a larger aggregate data stream, generally for transmission of multiple streams over a
single physical link. This technique is called time-division multiplexing, or TDM, and
is widely used, notably in the modern public telephone system. The diagram of
sampling and quantization of sine wave is:
Fig. 3.5 Sampling and quantization of a signal (red) for 4-bit PCM
There are many ways to implement a real device that performs this task. In real
systems, such a device is commonly implemented on a single integrated circuit that
lacks only the clock necessary for sampling, and is generally referred to as an ADC
(Analog-to-Digital converter). These devices will produce on their output a binary
representation of the input whenever they are triggered by a clock signal, which
would then be read by a processor of some sort.
3.4.2 Demodulation
To produce output from the sampled data, the procedure of modulation is applied in
reverse [8]. After each sampling period has passed, the next value is read and the
output of the system is shifted instantaneously (in an idealized system) to the new
value. As a result of these instantaneous transitions, the discrete signal will have a
significant amount of inherent high frequency energy, mostly harmonics of the
sampling frequency. To smooth out the signal and remove these undesirable
harmonics, the signal would be passed through analog filters that suppress artifacts
outside the expected frequency range (i.e. greater than ½ fs, the maximum resolvable
frequency). Some systems use digital filtering to remove the lowest and largest
harmonics. In some systems, no explicit filtering is done at all; as it's impossible for
any system to reproduce a signal with infinite bandwidth, inherent losses in the
system compensate for the artifacts — or the system simply does not require much
precision. The sampling theorem suggests that practical PCM devices, provided a
sampling frequency that is sufficiently greater than that of the input signal, can
operate without introducing significant distortions within their designed frequency
bands. The electronics involved in producing an accurate analog signal from the
discrete data are similar to those used for generating the digital signal. These devices
are DACs (digital-to-analog converters), and operate similarly to ADCs. They
produce on their output a voltage or current (depending on type) that represents the
value presented on their inputs. This output would then generally be filtered and
amplified for use.
3.4.3 Digitization
DPCM encodes the PCM values as differences between the current and the
predicted value. An algorithm predicts the next sample based on the previous
samples, and the encoder stores only the difference between this prediction
and the actual value. If the prediction is reasonable, fewer bits can be used to
represent the same information. For audio, this type of encoding reduces the
number of bits required per sample by about 25% compared to PCM.
Adaptive DPCM (ADPCM) is a variant of DPCM that varies the size of the
quantization step, to allow further reduction of the required bandwidth for a
given signal-to-noise ratio.
Delta modulation, another variant, uses one bit per sample.
Quality - This is related to the pitch setting in the DPCM instrument editor,
use the same to play the sample at original pitch. Quality of 15 will give best
result, but samples can only be a little less than one second.
Volume - Sets the conversion volume level, higher levels helps removing
noise.
Click-elimination - The volume of triangle and noise is decreased when a
DPCM sample is playing and can be restored with a note-off in the DPCM
channel, but this will normally result in an audible click-sound. These two
options will help restoring the volume after the sample is finished without
causing a click:
Restore delta counter - Restores the channels delta counter by adding zeroes
after the sample. Could cause a small echo-like sound.
Clip sample - Cuts the available volume-range for the sample and leaves the
delta counter near zero after end. Decreases volume and heavily distorts the
sample. What's best depends on the sample, most likely is that you won't need
any option. Max size of DPCM samples are 3.9 kb, at quality 15 (33 kHz) it's
a little less than one second and lowest quality (4 kHz) about eight seconds.
There are other types of waveform algorithms which are used in speech coding like
A-law PCM, µ-law PCM, ADPCM etc. An ADPCM algorithm is used to map a series
of 8-bit µ-law or A-law PCM samples into a series of 4-bit ADPCM samples. In this
way, the capacity of the line is doubled. The technique is detailed in the G.726
standard. Some ADPCM techniques are used in Voice over IP communications.
Similarly other types of coders are also available like CELP, VCELP etc. Code
excited linear prediction (CELP) is a speech coding algorithm originally proposed by
M.R. Schroeder and B.S. Atal in 1985. At the time, it provided significantly better
quality than existing low bit-rate algorithms, such as RELP and LPC vocoders (e.g.
FS-1015). Along with its variants, such as ACELP, RCELP, LD-CELP and VSELP, it
is currently the most widely used speech coding algorithm. CELP is now used as a
generic term for a class of algorithms and not for a particular codec. Vector sum
excited linear prediction (VSELP) is a speech coding method used in several cellular
standards. Variations of this codec have been used in several 2G cellular telephony
standards, including IS-54, IS-136 (D-AMPS) and GSM (Half Rate speech). It was
also used in the first version of RealAudio for audio over the Internet. The IS-54
VSELP standard was published by the Telecommunications Industry Association in
1989.
CHAPTER 4
We studied the basics of speech coding system and various coding techniques with
their design procedures and application scopes. Then we implemented PCM and
DPCM coders in MATLAB 7. Further we compared the PCM and DPCM coders on
criteria like speech quality, error, execution time etc. by varying bit-rate and sampling
frequency.
Input speech has been sampled at 8 kHz (for comparison with standard
coders).
For PCM, we have used a uniform quantizer with 2^16 = 65536 levels. The
bit-rate is 8k*16 = 128 kbps.
For DPCM, we have used an adaptive first-order linear predictor with
coefficient α = 0.45. Again, we have a uniform quantizer with 2^16 = 65536
levels. Here also, the bit-rate is 128 kbps.
4.2 RESULTS
Table 4.1 Results for Quantization Bits = 16 and Sampling Frequency = 8 kHz
Table 4.3 Results for Quantization Bits = 16 and Sampling Frequency = 16 kHz
Note: The execution time also depends upon the computer system in which the codes
are tested. All these simulations are done in Windows Vista based system with
processor speed 2.8 GHz and 2GB RAM.
CHAPTER 5
CONCLUSIONS
The advantages with coded speech signals are lower sensitivity to channel
noise, easier to error-protect, encrypt, multiplex and packetize and lastly it is efficient
transmission over bandwidth constrained channels due to lower bit rate. PCM coders
produce better quality speech than most of the parametric coders but relatively higher
compression ratio is achieved in parametric coders. Further error introduced in DPCM
algorithm is greater than that introduced in PCM algorithm but simultaneously
compression ratio reduces in latter. Thus the choice of the coder is based upon the
application requirements. Further hybrid coders having the characteristics of both
waveform coders and parametric coders are being designed to provide better
compression ratio maintaining reasonable speech quality.
For this project, the implementation of the speech coding based on PCM and
DPCM is successfully done in MATLAB. The output results from the input speech
are very good and acceptable. The targeted objective has been achieved.
CHAPTER 6
FUTURE SCOPE
In recent years, there has been significant progress in the fundamental building blocks
of source coding: flexible methods of time-frequency analysis, adaptive vector
quantization, and noiseless coding. Compelling applications of these techniques to
speech coding are relatively less mature. The present research is focused on meeting
the critical need for high quality speech transmission over digital cellular channels at
4 kbps. Research on properly coordinated source and channel coding is needed to
realize a good solution to this problem. Although, high-quality low-delay coding at 16
kbps has been achieved, low-delay coding at lower rates is still a challenging
problem. Improving the performance of low-rate coders operating in noisy channels is
also an open problem. Additionally there is a demand for robust low-rate coders that
will accommodate signals other than speech such as music. Further, current research
is also focused in the area of VoIP.
REFERENCES
[1] T. P. Barnwell III, K. Nayebi and C. H. Richardson, “SPEECH CODING, A
computer Laboratory Textbook”, John Wiley & Sons, Inc. 1996.
[2] Wai C. Chu, “Speech Coding Algorithms: Foundation and Evolution of
Standardized Coders”, Wiley Inter-science.
[3] P. C. Loizou, Speech Enhancement, Theory and Practice. CRC Press, 2007.
[4] Mark Hasegawa Johnson and Abeer Alwan, “Speech Coding: Fundamentals and
Applications”.
[5] Lawrence R. Rabiner and Ronald W. Schafer, “Introduction to Digital Speech
Processing”, Vol. 1, Nos. 1–2 (2007) 1–194.
[6] DDVPC, CELP Speech Coding Standard, Technical Report FS-1016, U.S. Dept.
of Defense Voice Processing Consortium, 1989.
[7] A. Das and A. Gersho, Low-rate multimode multiband spectral coding of speech,
Int. J. Speech Tech. 2(4): 317–327 (1999).
[8] J. H. Chung and R. W. Schafer, “Performance evaluation of analysis-bysynthesis
homomorphic vocoders,” Proceedings of IEEE ICASSP, vol. 2, pp. 117–120,
March 1992.
[9] R. Goldberg and L. Riek, A Practical Handbook of Speech Coders, CRC Press,
Boca Raton, FL, 2000.
[10] http://www.mathworks.com