0% found this document useful (0 votes)
64 views80 pages

Manual

Uploaded by

supmaxred
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views80 pages

Manual

Uploaded by

supmaxred
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

The Speex Codec Manual

(version 1.2-beta1)
Jean-Marc Valin

August 12, 2006

1
2

Copyright (c) 2002-2006 Jean-Marc Valin/Xiph.org Foundation


Permission is granted to copy, distribute and/or modify this document under the
terms of the GNU Free Documentation License, Version 1.1 or any later version pub-
lished by the Free Software Foundation; with no Invariant Section, with no Front-Cover
Texts, and with no Back-Cover. A copy of the license is included in the section entitled
"GNU Free Documentation License".
CONTENTS 3

Contents
1 Introduction to Speex 6

2 Codec description 8
2.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Adaptive Jitter Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Acoustic Echo Canceller . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Compiling 11

4 Command-line encoder/decoder 12
4.1 speexenc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 speexdec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Programming with Speex (the libspeex API) 15


5.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4 Echo Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.4.1 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . 20
5.5 Codec Options (speex_*_ctl) . . . . . . . . . . . . . . . . . . . . . . 21
5.6 Mode queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.7 Preprocessor options . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.8 Packing and in-band signalling . . . . . . . . . . . . . . . . . . . . . 24

6 Formats and standards 25


6.1 RTP Payload Format . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 MIME Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Ogg file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Introduction to CELP Coding 28


7.1 Source-Filter Model of Speech Prediction . . . . . . . . . . . . . . . 28
7.2 Linear Prediction (LPC) . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3 Pitch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.4 Innovation Codebook . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.5 Noise Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CONTENTS 4

7.6 Analysis-by-Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 32

8 Speex narrowband mode 33


8.1 Whole-Frame Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2 Sub-Frame Analysis-by-Synthesis . . . . . . . . . . . . . . . . . . . 33
8.3 Bit allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.4 Perceptual enhancement . . . . . . . . . . . . . . . . . . . . . . . . 36

9 Speex wideband mode (sub-band CELP) 38


9.1 Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.2 Pitch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.3 Excitation Quantization . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.4 Bit allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A FAQ 40

B Sample code 44
B.1 sampleenc.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.2 sampledec.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

C IETF RTP Profile 48

D Speex License 71

E GNU Free Documentation License 72


LIST OF TABLES 5

List of Tables
1 In-band signalling codes . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Ogg/Speex header packet . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Bit allocation for narrowband modes . . . . . . . . . . . . . . . . . . 36
4 Quality versus bit-rate . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Bit allocation for high-band in wideband mode . . . . . . . . . . . . 39
1 INTRODUCTION TO SPEEX 6

1 Introduction to Speex
The Speex project (http://www.speex.org/) has been started because there was a
need for a speech codec that was open-source and free from software patents. These
are essential conditions for being used by any open-source software. There is already
Vorbis that does general audio, but it is not really suitable for speech. Also, unlike
many other speech codecs, Speex is not targeted at cell phones but rather at voice over
IP (VoIP) and file-based compression.
As design goals, we wanted to have a codec that would allow both very good quality
speech and low bit-rate (unfortunately not at the same time!), which led us to develop-
ing a codec with multiple bit-rates. Of course very good quality also meant we had to
do wideband (16 kHz sampling rate) in addition to narrowband (telephone quality, 8
kHz sampling rate).
Designing for VoIP instead of cell phone use means that Speex must be robust to
lost packets, but not to corrupted ones since packets either arrive unaltered or don’t ar-
rive at all. Also, the idea was to have a reasonable complexity and memory requirement
without compromising too much on the efficiency of the codec.
All this led us to the choice of CELP as the encoding technique to use for Speex.
One of the main reasons is that CELP has long proved that it could do the job and
scale well to both low bit-rates (think DoD CELP @ 4.8 kbps) and high bit-rates (think
G.728 @ 16 kbps).
The main characteristics can be summarized as follows:

• Free software/open-source, patent and royalty-free

• Integration of narrowband and wideband using an embedded bit-stream

• Wide range of bit-rates available (from 2 kbps to 44 kbps)

• Dynamic bit-rate switching and Variable Bit-Rate (VBR)

• Voice Activity Detection (VAD, integrated with VBR)

• Variable complexity

• Ultra-wideband mode at 32 kHz (up to 48 kHz)

• Intensity stereo encoding option

• Fixed-point implementation (work in progress)


1 INTRODUCTION TO SPEEX 7

This document is divided in the following way. Section 2 describes the different Speex
features and defines some terms that will be used in later sections. Section 4 provides
information about the standard command-line tools, while 5 contains information about
programming using the Speex API. Section 6 has some information related to Speex
and standards. The three last sections describe the internals of the codec and require
some signal processing knowledge. Section 7 explains the general idea behind CELP,
while sections 8 and 9 are specific to Speex. Note that if you are only interested in
using Speex, those three last sections are not required.
2 CODEC DESCRIPTION 8

2 Codec description
This section describes the main features provided by Speex.

2.1 Concepts
Here are some concepts in speech coding that help better understand the rest of the
manual. Emphasis is placed on the Speex features.

Sampling rate
Speex is mainly designed for 3 different sampling rates: 8 kHz, 16 kHz, and 32 kHz.
These are respectively refered to as narrowband, wideband and ultra-wideband. For a
sampling rate of Fs kHz, the highest frequency that can be represented is equal to Fs /2
kHz. This is a consequence of Nyquist’s sampling theorem (and Fs /2 is known as the
Nyquist frequency).

Quality
Speex encoding is controlled most of the time by a quality parameter that ranges from
0 to 10. In constant bit-rate (CBR) operation, the quality parameter is an integer, while
for variable bit-rate (VBR), the parameter is a float.

Complexity (variable)
With Speex, it is possible to vary the complexity allowed for the encoder. This is done
by controlling how the search is performed with an integer ranging from 1 to 10 in
a way that’s similar to the -1 to -9 options to gzip and bzip2 compression utilities.
For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than at
complexity 10, but the CPU requirements for complexity 10 is about 5 times higher
than for complexity 1. In practice, the best trade-off is between complexity 2 and 4,
though higher settings are often useful when encoding non-speech sounds like DTMF
tones.

Variable Bit-Rate (VBR)


Variable bit-rate (VBR) allows a codec to change its bit-rate dynamically to adapt to
the “difficulty” of the audio being encoded. In the example of Speex, sounds like
vowels and high-energy transients require a higher bit-rate to achieve good quality,
2 CODEC DESCRIPTION 9

while fricatives (e.g. s,f sounds) can be coded adequately with less bits. For this
reason, VBR can achive lower bit-rate for the same quality, or a better quality for a
certain bit-rate. Despite its advantages, VBR has two main drawbacks: first, by only
specifying quality, there’s no guaranty about the final average bit-rate. Second, for
some real-time applications like voice over IP (VoIP), what counts is the maximum
bit-rate, which must be low enough for the communication channel.

Average Bit-Rate (ABR)


Average bit-rate solves one of the problems of VBR, as it dynamically adjusts VBR
quality in order to meet a specific target bit-rate. Because the quality/bit-rate is adjusted
in real-time (open-loop), the global quality will be slightly lower than that obtained by
encoding in VBR with exactly the right quality setting to meet the target average bit-
rate.

Voice Activity Detection (VAD)


When enabled, voice activity detection detects whether the audio being encoded is
speech or silence/background noise. VAD is always implicitly activated when encoding
in VBR, so the option is only useful in non-VBR operation. In this case, Speex detects
non-speech periods and encode them with just enough bits to reproduce the background
noise. This is called “comfort noise generation” (CNG).

Discontinuous Transmission (DTX)


Discontinuous transmission is an addition to VAD/VBR operation, that allows to stop
transmitting completely when the background noise is stationary. In file-based opera-
tion, since we cannot just stop writing to the file, only 5 bits are used for such frames
(corresponding to 250 bps).

Perceptual enhancement
Perceptual enhancement is a part of the decoder which, when turned on, tries to reduce
(the perception of) the noise produced by the coding/decoding process. In most cases,
perceptual enhancement make the sound further from the original objectively (if you
use SNR), but in the end it still sounds better (subjective improvement).
2 CODEC DESCRIPTION 10

Algorithmic delay
Every speech codec introduces a delay in the transmission. For Speex, this delay is
equal to the frame size, plus some amount of “look-ahead” required to process each
frame. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16
kHz), the delay is 34 ms. These values don’t account for the CPU time it takes to
encode or decode the frames.

2.2 Codec

2.3 Preprocessor
This part refers to the preprocessor module introduced in the 1.1.x branch. The prepro-
cessor is designed to be used on the audio before running the encoder. The preprocessor
provides three main functionalities:

• denoising

• automatic gain control (AGC)

• voice activity detection (VAD)

The denoiser can be used to reduce the amount of background noise present in the
input signal. This provides higher quality speech whether or not the denoised signal
is encoded with Speex (or at all). However, when using the denoised signal with the
codec, there is an additional benefit. Speech codecs in general (Speex included) tend to
perform poorly on noisy input, which tends to amplify the noise. The denoiser greatly
reduces this effect.
Automatic gain control (AGC) is a feature that deals with the fact that the record-
ing volume may vary by a large amount between different setups. The AGC provides a
way to adjust a signal to a reference volume. This is useful for voice over IP because it
removes the need for manual adjustment of the microphone gain. A secondary advan-
tage is that by setting the microphone gain to a conservative (low) level, it is easier to
avoid clipping.
The voice activity detector (VAD) provided by the preprocessor is more advanced
than the one directly provided in the codec.

2.4 Adaptive Jitter Buffer

2.5 Acoustic Echo Canceller


3 COMPILING 11

3 Compiling
Compiling Speex under UNIX or any platform supported by autoconf (e.g. Win32/cygwin)
is as easy as typing:

% ./configure [options]
% make
% make install

The options supported by the Speex configure script are:

–prefix=<path> Specifies where to install Speex

–enable-shared/–disable-shared Whether to compile shared libraries

–enable-static/–disable-static Whether to compile static libraries

–disable-wideband Disable the wideband part of Speex (typically to same space)

–enable-valgrind Enable extra information when (and only when) running with val-
grind

–enable-sse Enable use of SSE instructions (x86/float only)

–enable-fixed-point Compile Speex for a processor that does not have a floating point
unit (FPU)

–enable-arm4-asm Enable assembly specific to the ARMv4 architecture (gcc only)

–enable-arm5e-asm Enable assembly specific to the ARMv5E architecture (gcc only)

–enable-fixed-point-debug Use only for debugging the fixed-point code (very slow)

–enable-epic-48k Enable a special (and non-compatible) 4.8 kbps narrowband mode

–enable-ti-c55x Enable support for the TI C5x family

–enable-blackfin-asm Enable assembly specific to the Blackfin DSP architecture (gcc


only)

–enable-16bit-precision Reduces precision to 16 bits in time-critical areas (fixed-


point only)
4 COMMAND-LINE ENCODER/DECODER 12

4 Command-line encoder/decoder
The base Speex distribution includes a command-line encoder (speexenc) and decoder
(speexdec). This section describes how to use these tools.

4.1 speexenc
The speexenc utility is used to create Speex files from raw PCM or wave files. It can
be used by calling:

speexenc [options] input_file output_file

The value ’-’ for input_file or output_file corresponds respectively to stdin and stdout.
The valid options are:

–narrowband (-n) Tell Speex to treat the input as narrowband (8 kHz). This is the
default

–wideband (-w) Tell Speex to treat the input as wideband (16 kHz)

–ultra-wideband (-u) Tell Speex to treat the input as “ultra-wideband” (32 kHz)

–quality n Set the encoding quality (0-10), default is 8

–bitrate n Encoding bit-rate (use bit-rate n or lower)

–vbr Enable VBR (Variable Bit-Rate), disabled by default

–abr n Enable ABR (Average Bit-Rate) at n kbps, disabled by default

–vad Enable VAD (Voice Activity Detection), disabled by default

–dtx Enable DTX (Discontinuous Transmission), disabled by default

–nframes n Pack n frames in each Ogg packet (this saves space at low bit-rates)

–comp n Set encoding speed/quality tradeoff. The higher the value of n, the slower
the encoding (default is 3)

-V Verbose operation, print bit-rate currently in use

–help (-h) Print the help

–version (-v) Print version information


4 COMMAND-LINE ENCODER/DECODER 13

Speex comments

–comment Add the given string as an extra comment. This may be used multiple
times.

–author Author of this track.

–title Title for this track.

Raw input options

–rate n Sampling rate for raw input

–stereo Consider raw input as stereo

–le Raw input is little-endian

–be Raw input is big-endian

–8bit Raw input is 8-bit unsigned

–16bit Raw input is 16-bit signed

4.2 speexdec
The speexdec utility is used to decode Speex files and can be used by calling:

speexdec [options] speex_file [output_file]

The value ’-’ for input_file or output_file corresponds respectively to stdin and stdout.
Also, when no output_file is specified, the file is played to the soundcard. The valid
options are:

–enh enable post-filter (default)

–no-enh disable post-filter

–force-nb Force decoding in narrowband

–force-wb Force decoding in wideband

–force-uwb Force decoding in ultra-wideband

–mono Force decoding in mono

–stereo Force decoding in stereo


4 COMMAND-LINE ENCODER/DECODER 14

–rate n Force decoding at n Hz sampling rate

–packet-loss n Simulate n % random packet loss

-V Verbose operation, print bit-rate currently in use

–help (-h) Print the help

–version (-v) Print version information


5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 15

5 Programming with Speex (the libspeex API)


This section explains how to use the Speex API. Examples of code can also be found
in appendix B.

5.1 Encoding
In order to encode speech using Speex, you first need to:

#include <speex/speex.h>

You then need to declare a Speex bit-packing struct

SpeexBits bits;

and a Speex encoder state

void *enc_state;

The two are initialized by:

speex_bits_init(&bits);
enc_state = speex_encoder_init(&speex_nb_mode);

For wideband coding, speex_nb_mode will be replaced by speex_wb_mode. In most


cases, you will need to know the frame size used by the mode you are using. You can
get that value in the frame_size variable with:

speex_encoder_ctl(enc_state,SPEEX_GET_FRAME_SIZE,&frame_size);

In practice, frame_size will correspond to 20 ms when using 8, 16, or 32 kHz sampling


rate.
Once the initialization is done, for every input frame:

speex_bits_reset(&bits);
speex_encode_int(enc_state, input_frame, &bits);
nbBytes = speex_bits_write(&bits, byte_ptr, MAX_NB_BYTES);

where input_frame is a (short *) pointing to the beginning of a speech frame, byte_ptr


is a (char *) where the encoded frame will be written, MAX_NB_BYTES is the maxi-
mum number of bytes that can be written to byte_ptr without causing an overflow and
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 16

nbBytes is the number of bytes actually written to byte_ptr (the encoded size in bytes).
Before calling speex_bits_write, it is possible to find the number of bytes that need to
be written by calling speex_bits_nbytes(&bits), which returns a number of bytes.
It is still possible to use the speex_encode() function, which takes a (float *) for the
audio. However, this would make an eventual port to an FPU-less platform (like ARM)
more complicated. Internally, speex_encode() and speex_encode_int() are processed in
the same way. Whether the encoder uses the fixed-point version is only decided by the
compile-time flags, not at the API level.
After you’re done with the encoding, free all resources with:

speex_bits_destroy(&bits);
speex_encoder_destroy(enc_state);

That’s about it for the encoder.

5.2 Decoding
In order to decode speech using Speex, you first need to:

#include <speex/speex.h>

You also need to declare a Speex bit-packing struct

SpeexBits bits;

and a Speex decoder state

void *dec_state;

The two are initialized by:

speex_bits_init(&bits);
dec_state = speex_decoder_init(&speex_nb_mode);

For wideband decoding, speex_nb_mode will be replaced by speex_wb_mode. If you


need to obtain the size of the frames that will be used by the decoder, you can get that
value in the frame_size variable with:

speex_decoder_ctl(dec_state, SPEEX_GET_FRAME_SIZE, &frame_size);


5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 17

There is also a parameter that can be set for the decoder: whether or not to use a
perceptual enhancer. This can be set by:

speex_decoder_ctl(dec_state, SPEEX_SET_ENH, &enh);

where enh is an int with value 0 to have the enhancer disabled and 1 to have it enabled.
As of 1.2-beta1, the default is now to enable the enhancer.
Again, once the decoder initialization is done, for every input frame:

speex_bits_read_from(&bits, input_bytes, nbBytes);


speex_decode_int(dec_state, &bits, output_frame);

where input_bytes is a (char *) containing the bit-stream data received for a frame,
nbBytes is the size (in bytes) of that bit-stream, and output_frame is a (short *) and
points to the area where the decoded speech frame will be written. A NULL value as
the first argument indicates that we don’t have the bits for the current frame. When a
frame is lost, the Speex decoder will do its best to "guess" the correct signal.
As for the encoder, the speex_decode() function can still be used, with a (float *) as
the output for the audio.
After you’re done with the decoding, free all resources with:

speex_bits_destroy(&bits);
speex_decoder_destroy(dec_state);

5.3 Preprocessor
In order to use the Speex preprocessor, you first need to:

#include <speex/speex_preprocess.h>

Then, a preprocessor state can be created as:

SpeexPreprocessState *preprocess_state = speex_preprocess_state_init(frame_size, sampling_

It is recommended to use the same value for frame_size as is used by the encoder (20
ms).
For each input frame, you need to call:

speex_preprocess(preprocess_state, audio_frame, echo_residue);


5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 18

where audio_frame is used both as input and output and echo_residue is either an
array filled by the echo canceller, or NULL if the preprocessor is used without the echo
canceller.
In cases where the output audio is not useful for a certain frame, it is possible to
use instead:

speex_preprocess_estimate_update(preprocess_state, audio_frame, echo_residue);

This call will update all the preprocessor internal state variables without computing the
output audio, thus saving some CPU cycles.
The behaviour of the preprocessor can be changed using:

speex_preprocess_ctl(preprocess_state, request, ptr);

which is used in the same way as the encoder and decoder equivalent. Options are
listed in Section .
The preprocessor state can be destroyed using:

speex_preprocess_state_destroy(preprocess_state);

5.4 Echo Cancellation


The Speex library now includes an echo cancellation algorithm suitable for Acoustic
Echo Cancellation (AEC). In order to use the echo canceller, you first need to

#include <speex/speex_echo.h>

Then, an echo canceller state can be created by:

SpeexEchoState *echo_state = speex_echo_state_init(frame_size, filter_length);

where frame_size is the amount of data (in samples) you want to process at once and
filter_length is the length (in samples) of the echo cancelling filter you want to use
(also known as tail length). It is recommended to use a frame size in the order of 20
ms (or equal to the codec frame size) and make sure it is easy to perform an FFT of
that size (powers of two are better than prime sizes). The recommended tail length is
approximately the third of the room reverberation time. For example, in a small room,
reverberation time is in the order of 300 ms, so a tail length of 100 ms is a good choice
(800 samples at 8000 Hz sampling rate).
Once the echo canceller state is created, audio can be processed by:
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 19

speex_echo_cancel(echo_state, input_frame, echo_frame, output_frame, residue);

where input_frame is the audio as captured by the microphone, echo_frame is the


signal that was played in the speaker (and needs to be removed) and output_frame
is the signal with echo removed. The residue parameter is optional (you can set it to
NULL) and is used to return the estimated power spectrum of the echo residue so it can
be removed by the preprocessor (if you with to use it).
One important thing to keep in mind is the relationship between input_frame and
echo_frame. It is important that, at any time, any echo that is present in the input
has already been sent to the echo canceller as echo_frame. In other words, the echo
canceller cannot remove a signal that it hasn’t yet received. On the other hand, the delay
between the input signal and the echo signal must be small enough because otherwise
part of the echo cancellation filter is inefficient. In the ideal case, you code would look
like:

write_to_soundcard(echo_frame, frame_size);
read_from_soundcard(input_frame, frame_size);
speex_echo_cancel(echo_state, input_frame, echo_frame, output_frame, residue);

As stated above, if you wish to further reduce the echo present in the signal, you can
do so by passing residue as the last parameter of speex_preprocess() function (see
Section 5.3).
As of version 1.2-beta1, there is an alternative, simpler API that can be used in-
stead of speex_echo_cancel(). When audio capture and playback are handled asyn-
chronously (e.g. in different threads or using the poll() or select() system call), it can
be difficult to keep track of what input_frame comes with what echo_frame. Instead,
the playback comtext/thread can simply call:

speex_echo_playback(echo_state, echo_frame);

every time an audio frame is played. Then, the capture context/thread calls:

speex_echo_capture(echo_state, input_frame, output_frame, residue);

for every frame captured. Internally, speex_echo_playback() simply buffers the play-
back frame so it can be used by speex_echo_capture() to call speex_echo_cancel().
When capture and playback are done synchronously, speex_echo_cancel() is still pref-
ered since it gives better control on the exact input/echo timing.
The echo cancellation state can be destroyed with:
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 20

speex_echo_state_destroy(echo_state);

It is also possible to reset the state of the echo canceller so it can be reused without the
need to create another state with:

speex_echo_state_reset(echo_state);

5.4.1 Troubleshooting

There are several things that may prevent the echo canceller from working properly.
One of them is a bug (or something suboptimal) in the code, but there are many others
you should consider first

• Using a different soundcard to do the capture and plaback will *not* work, re-
gardless of what you may think. The only exception to that is if the two cards
can be made to have their sampling clock “locked” on the same clock source.

• The delay between the record and playback signals must be minimal. Any signal
played has to “appear” on the playback (far end) signal slightly before the echo
canceller “sees” it in the near end signal, but excessive delay means that part
of the filter length is wasted. In the worst situations, the delay is such that it is
longer than the filter length, in which case, no echo can be cancelled.

• When it comes to echo tail length (filter length), longer is *not* better. Actually,
the longer the tail length, the longer it takes for the filter to adapt. Of course, a
tail length that is too short will not cancel enough echo, but the most common
problem seen is that people set a very long tail length and then wonder why no
echo is being cancelled.

• Non-linear distortion cannot (by definition) be modeled by the linear adaptive


filter used in the echo canceller and thus cannot be cancelled. Use good audio
gear and avoid saturation/clipping.

Also useful is reading Echo Cancellation Demystified by Alexey Frunze1 , which ex-
plains the fundamental principles of echo cancellation. The details of the algorithm
described in the article are different, but the general ideas of echo cancellation through
adaptive filters are the same.
1 http://www.embeddedstar.com/articles/2003/7/article20030720-1.html
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 21

5.5 Codec Options (speex_*_ctl)


The Speex encoder and decoder support many options and requests that can be accessed
through the speex_encoder_ctl and speex_decoder_ctl functions. These functions are
similar to the ioctl system call and their prototypes are:

void speex_encoder_ctl(void *encoder, int request, void *ptr);


void speex_decoder_ctl(void *encoder, int request, void *ptr);

The different values of request allowed are (note that some only apply to the encoder
or the decoder):

SPEEX_SET_ENH** Set perceptual enhancer to on (1) or off (0) (integer)

SPEEX_GET_ENH** Get perceptual enhancer status (integer)

SPEEX_GET_FRAME_SIZE Get the frame size used for the current mode (integer)

SPEEX_SET_QUALITY* Set the encoder speech quality (integer 0 to 10)

SPEEX_GET_QUALITY* Get the current encoder speech quality (integer 0 to 10)

SPEEX_SET_MODE*†

SPEEX_GET_MODE*†

SPEEX_SET_LOW_MODE*†

SPEEX_GET_LOW_MODE*†

SPEEX_SET_HIGH_MODE*†

SPEEX_GET_HIGH_MODE*†

SPEEX_SET_VBR* Set variable bit-rate (VBR) to on (1) or off (0) (integer)

SPEEX_GET_VBR* Get variable bit-rate (VBR) status (integer)

SPEEX_SET_VBR_QUALITY* Set the encoder VBR speech quality (float 0 to 10)

SPEEX_GET_VBR_QUALITY* Get the current encoder VBR speech quality (float


0 to 10)

SPEEX_SET_COMPLEXITY* Set the CPU resources allowed for the encoder (in-
teger 1 to 10)
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 22

SPEEX_GET_COMPLEXITY* Get the CPU resources allowed for the encoder (in-
teger 1 to 10)

SPEEX_SET_BITRATE* Set the bit-rate to use to the closest value not exceeding
the parameter (integer in bps)

SPEEX_GET_BITRATE Get the current bit-rate in use (integer in bps)

SPEEX_SET_SAMPLING_RATE Set real sampling rate (integer in Hz)

SPEEX_GET_SAMPLING_RATE Get real sampling rate (integer in Hz)

SPEEX_RESET_STATE Reset the encoder/decoder state to its original state (zeros


all memories)

SPEEX_SET_VAD* Set voice activity detection (VAD) to on (1) or off (0) (integer)

SPEEX_GET_VAD* Get voice activity detection (VAD) status (integer)

SPEEX_SET_DTX* Set discontinuous transmission (DTX) to on (1) or off (0) (inte-


ger)

SPEEX_GET_DTX* Get discontinuous transmission (DTX) status (integer)

SPEEX_SET_ABR* Set average bit-rate (ABR) to a value n in bits per second (inte-
ger in bps)

SPEEX_GET_ABR* Get average bit-rate (ABR) setting (integer in bps)

SPEEX_SET_PLC_TUNING* Tell the encoder to optimize encoding for a certain


percentage of packet loss (integer in percent)

SPEEX_GET_PLC_TUNING* Get the current tuning of the encoder for PLC (inte-
ger in percent)

* applies only to the encoder

** applies only to the decoder

† normally only used internally


5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 23

5.6 Mode queries


Speex modes have a query system similar to the speex_encoder_ctl and speex_decoder_ctl
calls. Since modes are read-only, it is only possible to get information about a particular
mode. The function used to do that is:

void speex_mode_query(SpeexMode *mode, int request, void *ptr);

The admissible values for request are (unless otherwise note, the values are returned
through ptr):

SPEEX_MODE_FRAME_SIZE Get the frame size (in samples) for the mode

SPEEX_SUBMODE_BITRATE Get the bit-rate for a submode number specified


through ptr (integer in bps).

5.7 Preprocessor options


SPEEX_PREPROCESS_SET_DENOISE Turns denoising on(1) or off(2) (integer)

SPEEX_PREPROCESS_GET_DENOISE Get denoising status (integer)

SPEEX_PREPROCESS_SET_AGC Turns automatic gain control (AGC) on(1) or


off(2) (integer)

SPEEX_PREPROCESS_GET_AGC Get AGC status (integer)

SPEEX_PREPROCESS_SET_VAD Turns voice activity detector (VAD) on(1) or


off(2) (integer)

SPEEX_PREPROCESS_GET_VAD Get VAD status (integer)

SPEEX_PREPROCESS_SET_AGC_LEVEL

SPEEX_PREPROCESS_GET_AGC_LEVEL

SPEEX_PREPROCESS_SET_DEREVERB Turns reverberation removal on(1) or


off(2) (integer)

SPEEX_PREPROCESS_GET_DEREVERB Get reverberation removal status (in-


teger)

SPEEX_PREPROCESS_SET_DEREVERB_LEVEL
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 24

SPEEX_PREPROCESS_GET_DEREVERB_LEVEL

SPEEX_PREPROCESS_SET_DEREVERB_DECAY

SPEEX_PREPROCESS_GET_DEREVERB_DECAY

5.8 Packing and in-band signalling


Sometimes it is desirable to pack more than one frame per packet (or other basic unit
of storage). The proper way to do it is to call speex_encode N times before writing the
stream with speex_bits_write. In cases where the number of frames is not determined
by an out-of-band mechanism, it is possible to include a terminator code. That termi-
nator consists of the code 15 (decimal) encoded with 5 bits, as shown in Table 4. Note
that as of version 1.0.2, calling speex_bits_write automatically inserts the terminator
so as to fill the last byte. This doesn’t involves any overhead and makes sure Speex can
always detect when there is no more frame in a packet.
It is also possible to send in-band “messages” to the other side. All these messages
are encoded as “pseudo-frames” of mode 14 which contain a 4-bit message type code,
followed by the message. Table 1 lists the available codes, their meaning and the size
of the message that follows. Most of these messages are requests that are sent to the
encoder or decoder on the other end, which is free to comply or ignore them. By
default, all in-band messages are ignored.
Finally, applications may define custom in-band messages using mode 13. The size
of the message in bytes is encoded with 5 bits, so that the decoder can skip it if it
doesn’t know how to interpret it.
6 FORMATS AND STANDARDS 25

Code Size (bits) Content


0 1 Asks decoder to set perceptual enhancement off (0) or on(1)
1 1 Asks (if 1) the encoder to be less “agressive” due to high packet loss
2 4 Asks encoder to switch to mode N
3 4 Asks encoder to switch to mode N for low-band
4 4 Asks encoder to switch to mode N for high-band
5 4 Asks encoder to switch to quality N for VBR
6 4 Request acknowloedge (0=no, 1=all, 2=only for in-band data)
7 4 Asks encoder to set CBR (0), VAD(1), DTX(3), VBR(5), VBR+DTX(7)
8 8 Transmit (8-bit) character to the other end
9 8 Intensity stereo information
10 16 Announce maximum bit-rate acceptable (N in bytes/second)
11 16 reserved
12 32 Acknowledge receiving packet N
13 32 reserved
14 64 reserved
15 64 reserved

Table 1: In-band signalling codes

6 Formats and standards


Speex can encode speech in both narrowband and wideband and provides different bit-
rates. However, not all features need to be supported by a certain implementation or
device. In order to be called “Speex compatible” (whatever that means), an implemen-
tation must implement at least a basic set of features.
At the minimum, all narrowband modes of operation MUST be supported at the
decoder. This includes the decoding of a wideband bit-stream by the narrowband de-
coder2. If present, a wideband decoder MUST be able to decode a narrowband stream,
and MAY either be able to decode all wideband modes or be able to decode the em-
bedded narrowband part of all modes (which includes ignoring the high-band bits).
For encoders, at least one narrowband or wideband mode MUST be supported. The
main reason why all encoding modes do not have to be supported is that some platforms
may not be able to handle the complexity of encoding in some modes.

6.1 RTP Payload Format


The RTP payload draft is included in appendix C and the latest version is available at
http://www.speex.org/drafts/latest. This draft has been sent (2003/02/26) to
2 The wideband bit-stream contains an embedded narrowband bit-stream which can be decoded alone
6 FORMATS AND STANDARDS 26

the Internet Engineering Task Force (IETF) and will be discussed at the March 18th
meeting in San Francisco.

6.2 MIME Type


For now, you should use the MIME type audio/x-speex for Speex-in-Ogg. We will
apply for type audio/speex in the near future.

6.3 Ogg file format


Speex bit-streams can be stored in Ogg files. In this case, the first packet of the Ogg
file contains the Speex header described in table 2. All integer fields in the headers
are stored as little-endian. The speex_string field must contain the “Speex ” (with
3 trailing spaces), which identifies the bit-stream. The next field, speex_version
contains the version of Speex that encoded the file. For now, refer to speex_header.[ch]
for more info. The beginning of stream (b_o_s) flag is set to 1 for the header. The
header packet has packetno=0 and granulepos=0.
The second packet contains the Speex comment header. The format used is the Vor-
bis comment format described here: http://www.xiph.org/ogg/vorbis/doc/v-comment.html
. This packet has packetno=1 and granulepos=0.
The third and subsequent packets each contain one or more (number found in
header) Speex frames. These are identified with packetno starting from 2 and the
granulepos is the number of the last sample encoded in that packet. The last of these
packets has the end of stream (e_o_s) flag is set to 1.
6 FORMATS AND STANDARDS 27

Field Type Size


speex_string char[] 8
speex_version char[] 20
speex_version_id int 4
header_size int 4
rate int 4
mode int 4
mode_bitstream_version int 4
nb_channels int 4
bitrate int 4
frame_size int 4
vbr int 4
frames_per_packet int 4
extra_headers int 4
reserved1 int 4
reserved2 int 4

Table 2: Ogg/Speex header packet


7 INTRODUCTION TO CELP CODING 28

7 Introduction to CELP Coding


Speex is based on CELP, which stands for Code Excited Linear Prediction. This section
attempts to introduce the principles behind CELP, so if you are already familiar with
CELP, you can safely skip to section 8. The CELP technique is based on three ideas:

1. The use of a linear prediction (LP) model to model the vocal tract

2. The use of (adaptive and fixed) codebook entries as input (excitation) of the LP
model

3. The search performed in closed-loop in a “perceptually weighted domain”

This section describes the basic ideas behind CELP. This is still a work in progress.

7.1 Source-Filter Model of Speech Prediction


The source-filter model of speech production assumes that the vocal cords are the
source of spectrally flat sound (the excitation signal), and that the vocal tract acts as a
filter to spectrally shape the various sounds of speech. While still an approximation,
the model is widely used in speech coding because of its simplicity.Its use is also the
reason why most speech codecs (Speex included) perform badly on music signals. The
different phonemes can be distinguished by their excitation (source) and spectral shape
(filter). Voiced sounds (e.g. vowels) have an excitation signal that is periodic and that
can be approximated by an impulse train in the time domain or by regularly-spaced
harmonics in the frequency domain. On the other hand, fricatives (such as the "s",
"sh" and "f" sounds) have an excitation signal that is similar to white Gaussian noise.
So called voice fricatives (such as "z" and "v") have excitation signal composed of an
harmonic part and a noisy part.
The source-filter model is usually tied with the use of Linear prediction. The CELP
model is based on source-filter model, as can be seen from the CELP decoder illustrated
in Figure 1.

7.2 Linear Prediction (LPC)


Linear prediction is at the base of many speech coding techniques, including CELP.
The idea behind it is to predict the signal x[n] using a linear combination of its past
samples:
7 INTRODUCTION TO CELP CODING 29

Figure 1: The CELP model of speech synthesis (decoder)

N
y[n] = ∑ ai x[n − i]
i=1

where y[n] is the linear prediction of x[n]. The prediction error is thus given by:

N
e[n] = x[n] − y[n] = x[n] − ∑ ai x[n − i]
i=1

The goal of the LPC analysis is to find the best prediction coefficients ai which
minimize the quadratic error function:
" #2
L−1 L−1 N
E= ∑ [e[n]] 2
= ∑ x[n] − ∑ ai x[n − i]
n=0 n=0 i=1

∂E
That can be done by making all derivatives ∂ai equal to zero:

" #2
∂E ∂ L−1 N

∂ai
= ∑ x[n] − ∑ ai x[n − i] = 0
∂ai n=0 i=1

For an order N filter, the filter coefficients ai are found by solving the system N × N
7 INTRODUCTION TO CELP CODING 30

linear system Ra = r, where


 
R(0) R(1) ··· R(N − 1)
R(1) R(0) ··· R(N − 2)
 
 
R= .. .. .. .. 
.
 
 . . . 
R(N − 1) R(N − 2) · · · R(0)

 
R(1)
R(2)
 
 
r=
 .. 

 . 
R(N)

with R(m), the auto-correlation of the signal x[n], computed as:

N−1
R(m) = ∑ x[i]x[i − m]
i=0

Because R is toeplitz hermitian, the Levinson-Durbin algorithm can be used, mak-


ing the solution to the problem O N 2 instead of O N 3 . Also, it can be proven that
 

all the roots of A(z) are within the unit circle, which means that 1/A(z) is always stable.
This is in theory; in practice because of finite precision, there are two commonly used
techniques to make sure we have a stable filter. First, we multiply R(0) by a number
slightly above one (such as 1.0001), which is equivalent to adding noise to the signal.
Also, we can apply a window to the auto-correlation, which is equivalent to filtering in
the frequency domain, reducing sharp resonances.

7.3 Pitch Prediction


During voiced segments, the speech signal is periodic, so it is possible to take advantage
of that property by approximating the excitation signal e[n] by a gain times the past of
the excitation:

e[n] ≃ p[n] = βe[n − T]

where T is the pitch period, β is the pitch gain. We call that long-term prediction
since the excitation is predicted from e[n − T ] with T ≫ N.
7 INTRODUCTION TO CELP CODING 31

7.4 Innovation Codebook


The final excitation e[n] will be the sum of the pitch prediction and an innovation signal
c[n] taken from a fixed codebook, hence the name Code Excited Linear Prediction. The
final excitation is given by:

e[n] = p[n] + c[n] = βe[n − T] + c[n]

The quantization of c[n] is where most of the bits in a CELP codec are allocated. It
represents the information that couldn’t be obtained either from linear prediction or
pitch prediction. In the z-domain we can represent the final signal X(z) as

C(z)
X(z) =
A(z) (1 − βz−T )

7.5 Noise Weighting


Most (if not all) modern audio codecs attempt to “shape” the noise so that it appears
mostly in the frequency regions where the ear cannot detect it. For example, the ear is
more tolerant to noise in parts of the spectrum that are louder and vice versa. In order to
maximize speech quality, CELP codecs minimize the mean square of the error (noise)
in the perceptually weighted domain. This means that a perceptual noise weighting
filter W (z) is applied to the error signal in the encoder. In most CELP codecs, W (z)
is a pole-zero weighting filter derived from the linear prediction coefficients (LPC),
generally using bandwidth expansion. Let the spectral envelope be represented by the
synthesis filter 1/A(z), CELP codecs typically derive the noise weighting filter as:

A(z/γ1 )
W (z) = (1)
A(z/γ2 )

where γ1 = 0.9 and γ2 = 0.6 in the Speex reference implementation. If a filter A(z) has
(complex) poles at pi in the z-plane, the filter A(z/γ) will have its poles at p′i = γpi ,
making it a flatter version of A(z).
The weighting filter is applied to the error signal used to optimize the codebook
search through analysis-by-synthesis (AbS). This results in a spectral shape of the noise
that tends towards 1/W (z). While the simplicity of the model has been an important
reason for the success of CELP, it remains that W (z) is a very rough approximation for
the perceptually optimal noise weighting function. Fig. 2 illustrates the noise shaping
that results from Eq. 1. Throughout this paper, we refer to W (z) as the noise weighting
filter and to 1/W (z) as the noise shaping filter (or curve).
7 INTRODUCTION TO CELP CODING 32

30
Speech signal
LPC synthesis filter
Reference shaping
20

10
Response (dB)

-10

-20

-30

-40
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)

Figure 2: Standard noise shaping in CELP. Arbitrary y-axis offset.

7.6 Analysis-by-Synthesis
One of the main principles behind CELP is called Analysis-by-Synthesis (AbS), mean-
ing that the encoding (analysis) is performed by perceptually optimising the decoded
(synthesis) signal in a closed loop. In theory, the best CELP stream would be pro-
duced by trying all possible bit combinations and selecting the one that produces the
best-sounding decoded signal. This is obviously not possible in practice for two rea-
sons: the required complexity is beyond any currently available hardware and the “best
sounding” selection criterion implies a human listener.
In order to achieve real-time encoding using limited computing resources, the CELP
optimisation is broken down into smaller, more manageable, sequential searches using
the perceptual weighting function described earlier.
8 SPEEX NARROWBAND MODE 33

8 Speex narrowband mode


This section looks at how Speex works for narrowband (8 kHz sampling rate) operation.
The frame size for this mode is 20 ms, corresponding to 160 samples. Each frame is
also subdivided into 4 sub-frames of 40 samples each.
Also many design decisions were based on the original goals and assumptions:

• Minimizing the amount of information extracted from past frames (for robust-
ness to packet loss)

• Dynamically-selectable codebooks (LSP, pitch and innovation)

• sub-vector fixed (innovation) codebooks

8.1 Whole-Frame Analysis


In narrowband, Speex frames are 20 ms long (160 samples) and are subdivided in 4 sub-
frames of 5 ms each (40 samples). For most narrowband bit-rates (8 kbps and above),
the only parameters encoded at the frame level are the Line Spectral Pairs (LSP) and a
global excitation gain g f rame , as shown in Fig. 3. All other parameters are encoded at
the sub-frame level.
Linear prediction analysis is performed once per frame using an asymmetric Ham-
ming window centered on the fourth sub-frame. Because linear prediction coefficients
(LPC) are not robust to quantization, they are first are converted to line spectral pairs
(LSP). The LSP’s are considered to be associated to the 4th sub-frames and the LSP’s
associated to the first 3 sub-frames are linearly interpolated using the current and pre-
vious LSP coefficients. The LSP coefficients and converted back to the LPC filter Â(z).
The non-quantized interpolated filter is denoted A(z) and can be used for the weighting
filter W (z) because it does not need to be available to the decoder.
To make Speex more robust to packet loss, no prediction is applied on the LSP
coefficients prior to quantization. The LSPs are encoded using vector quantizatin (VQ)
with 30 bits for higher quality modes and 18 bits for lower quality.

8.2 Sub-Frame Analysis-by-Synthesis


The analysis-by-synthesis (AbS) encoder loop is described in Fig. 4. There are three
main aspects where Speex significantly differs from most other CELP codecs. First,
while most recent CELP codecs make use of fractional pitch estimation with a single
8 SPEEX NARROWBAND MODE 34

Figure 3: Frame open-loop analysis

Figure 4: Analysis-by-synthesis closed-loop optimization on a sub-frame.


8 SPEEX NARROWBAND MODE 35

gain, Speex uses an integer to encode the pitch period, but uses a 3-tap predictor (3
gains). The adaptive codebook contribution ea [n] can thus be expressed as:

ea [n] = g0 e[n − T − 1] + g1e[n − T ] + g2e[n − T + 1] (2)

where g0 , g1 and g2 are the jointly quantized pitch gains and e[n] is the codec excitation
memory. It is worth noting that when the pitch is smaller than the sub-frame size, we
repeat the excitation at a period T . For example, when n − T + 1 ≥ 0, we use n − 2T + 1
instead. In most modes, the pitch period is encoded with 7 bits in the [17, 144] range
and the βi coefficients are vector-quantized using 7 bits at higher bit-rates (15 kbps
narrowband and above) and 5 bits at lower bit-rates (11 kbps narrowband and below).
Many current CELP codecs use moving average (MA) prediction to encode the
fixed codebook gain. This provides slightly better coding at the expense of introducing
a dependency on previously encoded frames. A second difference is that Speex encodes
the fixed codebook gain as the product of the global excitation gain g f rame with a sub-
frame gain corrections gsub f . This increases robustness to packet loss by eliminating
the inter-frame dependency. The sub-frame gain correction is encoded before the fixed
codebook is searched (not closed-loop optimized) and uses between 0 and 3 bits per
sub-frame, depending on the bit-rate.
The third difference is that Speex uses sub-vector quantization of the innovation
(fixed codebook) signal instead of an algebraic codebook. Each sub-frame is divided
into sub-vectors of lengths ranging between 5 and 20 samples. Each sub-vector is
chosen from a bitrate-dependent codebook and all sub-vectors are concatenated to form
a sub-frame. As an example, the 3.95 kbps mode uses a sub-vector size of 20 samples
with 32 entries in the codebook (5 bits). This means that the innovation is encoded
with 10 bits per sub-frame, or 2000 bps. On the other hand, the 18.2 kbps mode uses a
sub-vector size of 5 samples with 256 entries in the codebook (8 bits), so the innovation
uses 64 bits per sub-frame, or 12800 bps.

8.3 Bit allocation


There are 7 different narrowband bit-rates defined for Speex, ranging from 250 bps to
24.6 kbps, although the modes below 5.9 kbps should not be used for speech. The
bit-allocation for each mode is detailed in table 3. Each frame starts with the mode ID
encoded with 4 bits which allows a range from 0 to 15, though only the first 7 values are
used (the others are reserved). The parameters are listed in the table in the order they
are packed in the bit-stream. All frame-based parameters are packed before sub-frame
8 SPEEX NARROWBAND MODE 36

parameters. The parameters for a certain sub-frame are all packed before the following
sub-frame is packed. Note that the “OL” in the parameter description means that the
parameter is an open loop estimation based on the whole frame.

Parameter Update rate 0 1 2 3 4 5 6 7 8


Wideband bit frame 1 1 1 1 1 1 1 1 1
Mode ID frame 4 4 4 4 4 4 4 4 4
LSP frame 0 18 18 18 18 30 30 30 18
OL pitch frame 0 7 7 0 0 0 0 0 7
OL pitch gain frame 0 4 0 0 0 0 0 0 4
OL Exc gain frame 0 5 5 5 5 5 5 5 5
Fine pitch sub-frame 0 0 0 7 7 7 7 7 0
Pitch gain sub-frame 0 0 5 5 5 7 7 7 0
Innovation gain sub-frame 0 1 0 1 1 3 3 3 0
Innovation VQ sub-frame 0 0 16 20 35 48 64 96 10
Total frame 5 43 119 160 220 300 364 492 79

Table 3: Bit allocation for narrowband modes

So far, no MOS (Mean Opinion Score) subjective evaluation has been performed
for Speex. In order to give an idea of the quality achivable with it, table 4 presents my
own subjective opinion on it. It sould be noted that different people will perceive the
quality differently and that the person that designed the codec often has a bias (one way
or another) when it comes to subjective evaluation. Last thing, it should be noted that
for most codecs (including Speex) encoding quality sometimes varies depending on
the input. Note that the complexity is only approximate (within 0.5 mflops and using
the lowest complexity setting). Decoding requires approximately 0.5 mflops in most
modes (1 mflops with perceptual enhancement).

8.4 Perceptual enhancement


This section was only valid for version 1.1.12 and earlier. It does not apply to
version 1.2-beta1 (and later), for which the new perceptual enhancement is not
yet documented.
This part of the codec only applies to the decoder and can even be changed without
affecting inter-operability. For that reason, the implementation provided and described
here should only be considered as a reference implementation. The enhancement sys-
tem is divided into two parts. First, the synthesis filter S(z) = 1/A(z) is replaced by an
enhanced filter:
A (z/a2 ) A (z/a3 )
S′ (z) =
A (z) A (z/a1 )
8 SPEEX NARROWBAND MODE 37

Mode Bit-rate (bps) mflops Quality/description


0 250 N/A No transmission (DTX)
1 2,150 6 Vocoder (mostly for comfort noise)
2 5,950 9 Very noticeable artifacts/noise, good intelligibility
3 8,000 10 Artifacts/noise sometimes noticeable
4 11,000 14 Artifacts usually noticeable only with headphones
5 15,000 11 Need good headphones to tell the difference
6 18,200 17.5 Hard to tell the difference even with good headphones
7 24,600 14.5 Completely transparent for voice, good quality music
8 3,950 10.5 Very noticeable artifacts/noise, good intelligibility
9 N/A N/A reserved
10 N/A N/A reserved
11 N/A N/A reserved
12 N/A N/A reserved
13 N/A N/A Application-defined, interpreted by callback or skipped
14 N/A N/A Speex in-band signaling
15 N/A N/A Terminator code

Table 4: Quality versus bit-rate

 
1 1−ra1
where a1 and a2 depend on the mode in use and a3 = r 1 − 1−ra 2
with r = .9. The
second part of the enhancement consists of using a comb filter to enhance the pitch in
the excitation domain.
9 SPEEX WIDEBAND MODE (SUB-BAND CELP) 38

9 Speex wideband mode (sub-band CELP)


For wideband, the Speex approach uses a quadrature mirror f ilter (QMF) to split the
band in two. The 16 kHz signal is thus divided into two 8 kHz signals, one repre-
senting the low band (0-4 kHz), the other the high band (4-8 kHz). The low band is
encoded with the narrowband mode described in section 8 in such a way that the re-
sulting “embedded narrowband bit-stream” can also be decoded with the narrowband
decoder. Since the low band encoding has already been described, only the high band
encoding is described in this section.

9.1 Linear Prediction


The linear prediction part used for the high-band is very similar to what is done for
narrowband. The only difference is that we use only 12 bits to encode the high-band
LSP’s using a multi-stage vector quantizer (MSVQ). The first level quantizes the 10
coefficients with 6 bits and the error is then quantized using 6 bits, too.

9.2 Pitch Prediction


That part is easy: there’s no pitch prediction for the high-band. There are two reasons
for that. First, there is usually little harmonic structure in this band (above 4 kHz).
Second, it would be very hard to implement since the QMF folds the 4-8 kHz band into
4-0 kHz (reversing the frequency axis), which means that the location of the harmonics
is no longer at multiples of the fundamental (pitch).

9.3 Excitation Quantization


The high-band excitation is coded in the same way as for narrowband.

9.4 Bit allocation


For the wideband mode, the entire narrowband frame is packed before the high-band
is encoded. The narrowband part of the bit-stream is as defined in table 3. The high-
band follows, as described in table 5. This also means that a wideband frame may be
correctly decoded by a narrowband decoder with the only caveat that if more than one
frame is packed in the same packet, the decoder will need to skip the high-band parts
in order to sync with the bit-stream.
9 SPEEX WIDEBAND MODE (SUB-BAND CELP) 39

Parameter Update rate 0 1 2 3 4


Wideband bit frame 1 1 1 1 1
Mode ID frame 3 3 3 3 3
LSP frame 0 12 12 12 12
Excitation gain sub-frame 0 5 4 4 4
Excitation VQ sub-frame 0 0 20 40 80
Total frame 4 36 112 192 352

Table 5: Bit allocation for high-band in wideband mode


A FAQ 40

A FAQ
Vorbis is open-source and patent-free; why do we need Speex?
Vorbis is a great project but its goals are not the same as Speex. Vorbis is mostly aimed
at compressing music and audio in general, while Speex targets speech only. For that
reason Speex can achieve much better results than Vorbis on speech, typically 2-4 times
higher compression at equal quality.

Isn’t there an open-source implementation of the GSM-FR codec?


Why is Speex necessary?
First of all, it’s not clear whether GSM-FR is covered by a Philips patent (see http://kbs.cs.tu-
berlin.de/~jutta/toast.html). Also, GSM-FR offers mediocre quality at a relatively high
bit-rate, while Speex can offer equivalent quality at almost half the bit-rate. Last but
not least, Speex offers a wide range of bit-rates and sampling rates, while GSM-FR is
limited to 8 kHz speech at 13 kbps.

Under what license is Speex released?


As of version 1.0 beta 1, Speex is released under Xiph’s version of the (revised) BSD
license (see Appendix D). This license is one of the most permissive open-source
licenses.

Am I allowed to use Speex in commercial software?


Yes. As long as you comply with the license. This basically means you have to keep
the copyright notice and you can’t use our name to promote your product without
authorization. For more details, see license in Appendix D.

Ogg, Speex, Vorbis, what’s the difference?


Ogg is a container format for holding multimedia data. Vorbis is an audio codec that
uses Ogg to store its bit-streams as files, hence the name Ogg Vorbis. Speex also
uses the Ogg format to store its bit-streams as files, so technically they would be “Ogg
Speex” files (I prefer to call them just Speex files). One difference with Vorbis however,
is that Speex is less tied with Ogg. Actually, if you just do Voice over IP (VoIP), you
don’t need Ogg at all.
A FAQ 41

What’s the extension for Speex?


Speex files have the .spx extension. Note, however that the Speex tools (speexenc,
speexdec) do not rely on the extension at all, so any extension will work.

Can I use Speex for compressing music?


Just like Vorbis is not really adapted to speech, Speex is really not adapted for music.
In most cases, you’ll be better of with Vorbis when it comes to music.

I converted some MP3s to Speex and the quality is bad. What’s


wrong?
This is called transcoding and it will always result in much poorer quality than the
original MP3. Unless you have a really good (size) reason to do so, never transcode
speech. This is even valid for self transcoding (tandeming), i.e. If you decode a Speex
file and re-encode it again at the same bit-rate, you will lose quality.

Does Speex run on Windows?


Compilation on Windows has been supported since version 0.8.0. There are also sev-
eral front-ends available from the website.

Why is encoding so slow compared to decoding?


For most kinds of compression, encoding is inherently slower than decoding. In the
case of Speex, encoding consists of finding, for each vector of 5 to 10 samples, the
entry that matches the best within a codebook consisting of 16 to 256 entries. On
the other hand, at decoding all that needs to be done is look up the right entry in the
codebook using the encoded index. Since a lookup is much faster than a search, the
decoder works much faster than the encoder.

Why is Speex so slow on my iPaq (or insert any platform without an


FPU)?
You probably didn’t build Speex with the fixed-point option (–enable-fixed-point).
Even if you did, not all modes have been ported to use fixed-point arithmetic, so the
code may be slowed down by a few float operations left (e.g. in the wideband mode).
A FAQ 42

I’m getting unusual background noise (hiss) when using libspeex in


my application. How do I fix that?
One of the causes could be scaling of the input speech. Speex expects signals to have
a ±215 (signed short) dynamic range. If the dynamic range of your signals is too small
(e.g. ±1.0), you will suffer important quantization noise. A good target is to have a
dynamic range around ±8000 which is large enough, but small enough to make sure
there’s no clipping when converting back to signed short.

I get very distorted speech when using libspeex in my application.


What’s wrong?
There are many possible causes for that. One of them is errors in the way the bits
are manipulated. Another possible cause is the use of the same encoder or decoder
state for more than one audio stream (channel), which produces strange effects with
the filter memories. If the input speech has an amplitude close to ±215 , it is possible
that at decoding, the amplitude be a bit higher than that, causing clipping when saving
as 16-bit PCM.

How does Speex compare to other proprietary codecs?


It’s hard to give precise figures since no formal listening tests have been performed yet.
All I can say is that in terms of quality, Speex competes on the same ground as other
proprietary codecs (not necessarily the best, but not the worst either). Speex also has
many features that are not present in most other codecs. These include variable bit-rate
(VBR), integration of narrowband and wideband, as well as stereo support. Of course,
another area where Speex is really hard to beat is the quality/price ratio. Unlike many
very expensive codecs, Speex is free and anyone may distribute or modify it at will.

Can Speex pass DTMF?


I guess it all depends on the bit-rate used. Though no formal testing has yet been
performed, I’d say is correctly at 8 kbps and above. Also, make sure you don’t use the
lowest complexity (see SPEEX_SET_COMPLEXITY or –comp option), as it causes
significant noise.
A FAQ 43

Can Speex pass V.9x modem signals correctly?


If I could do that I’d be very rich by now :-) Seriously, that would break fundamental
laws of information theory.

What is your (Jean-Marc) relationship with the University of Sher-


brooke and how does Speex fit into that?
I have completed my Ph.D. at the University of Sherbrooke in 2005 in mobile robotics.
Although I did my master with the Sherbrooke speech coding group (in speech en-
hancement, not coding), was no longer associated with them when developing Speex.
It should not be understood that they or the University of Sherbrooke have anything
to do with the Speex project. Furthermore, Speex does not make use of any code or
proprietary technology developed in the Sherbrooke speech coding group.

CELP, ACELP, what’s the difference?


CELP stands for “Code Excited Linear Prediction”, while ACELP stands for “Alge-
braic Code Excited Linear Prediction”. That means ACELP is a CELP technique that
uses an algebraic codebook represented as a sum of unit pulses, thus making the code-
book search much more efficient. This technique was invented at the University of
Sherbrooke and is now one of the most widely used form of CELP. Unfortunately,
since it is patented, it cannot be used in Speex.
B SAMPLE CODE 44

B Sample code
This section shows sample code for encoding and decoding speech using the Speex
API. The commands can be used to encode and decode a file by calling:
% sampleenc in_file.sw | sampledec out_file.sw
where both files are raw (no header) files encoded at 16 bits per sample (in the machine
natural endianness).

B.1 sampleenc.c
sampleenc takes a raw 16 bits/sample file, encodes it and outputs a Speex stream to
stdout. Note that the packing used is NOT compatible with that of speexenc/speexdec.

#include <speex/speex.h>
#include <stdio.h>

/*The frame size in hardcoded for this sample code but it doesn’t have to be*/
#define FRAME_SIZE 160
int main(int argc, char **argv)
{
char *inFile;
FILE *fin;
short in[FRAME_SIZE];
float input[FRAME_SIZE];
char cbits[200];
int nbBytes;
/*Holds the state of the encoder*/
void *state;
/*Holds bits so they can be read and written to by the Speex routines*/
SpeexBits bits;
int i, tmp;

/*Create a new encoder state in narrowband mode*/


state = speex_encoder_init(&speex_nb_mode);

/*Set the quality to 8 (15 kbps)*/


tmp=8;
speex_encoder_ctl(state, SPEEX_SET_QUALITY, &tmp);
B SAMPLE CODE 45

inFile = argv[1];
fin = fopen(inFile, "r");

/*Initialization of the structure that holds the bits*/


speex_bits_init(&bits);
while (1)
{
/*Read a 16 bits/sample audio frame*/
fread(in, sizeof(short), FRAME_SIZE, fin);
if (feof(fin))
break;
/*Copy the 16 bits values to float so Speex can work on them*/
for (i=0;i<FRAME_SIZE;i++)
input[i]=in[i];

/*Flush all the bits in the struct so we can encode a new frame*/
speex_bits_reset(&bits);

/*Encode the frame*/


speex_encode(state, input, &bits);
/*Copy the bits to an array of char that can be written*/
nbBytes = speex_bits_write(&bits, cbits, 200);

/*Write the size of the frame first. This is what sampledec expects but
it’s likely to be different in your own application*/
fwrite(&nbBytes, sizeof(int), 1, stdout);
/*Write the compressed data*/
fwrite(cbits, 1, nbBytes, stdout);

/*Destroy the encoder state*/


speex_encoder_destroy(state);
/*Destroy the bit-packing struct*/
speex_bits_destroy(&bits);
fclose(fin);
B SAMPLE CODE 46

return 0;
}

B.2 sampledec.c
sampledec reads a Speex stream from stdin, decodes it and outputs it to a raw 16
bits/sample file. Note that the packing used is NOT compatible with that of speex-
enc/speexdec.

#include <speex/speex.h>
#include <stdio.h>

/*The frame size in hardcoded for this sample code but it doesn’t have to be*/
#define FRAME_SIZE 160
int main(int argc, char **argv)
{
char *outFile;
FILE *fout;
/*Holds the audio that will be written to file (16 bits per sample)*/
short out[FRAME_SIZE];
/*Speex handle samples as float, so we need an array of floats*/
float output[FRAME_SIZE];
char cbits[200];
int nbBytes;
/*Holds the state of the decoder*/
void *state;
/*Holds bits so they can be read and written to by the Speex routines*/
SpeexBits bits;
int i, tmp;

/*Create a new decoder state in narrowband mode*/


state = speex_decoder_init(&speex_nb_mode);

/*Set the perceptual enhancement on*/


tmp=1;
speex_decoder_ctl(state, SPEEX_SET_ENH, &tmp);

outFile = argv[1];
B SAMPLE CODE 47

fout = fopen(outFile, "w");

/*Initialization of the structure that holds the bits*/


speex_bits_init(&bits);
while (1)
{
/*Read the size encoded by sampleenc, this part will likely be
different in your application*/
fread(&nbBytes, sizeof(int), 1, stdin);
fprintf (stderr, "nbBytes: %d\n", nbBytes);
if (feof(stdin))
break;

/*Read the "packet" encoded by sampleenc*/


fread(cbits, 1, nbBytes, stdin);
/*Copy the data into the bit-stream struct*/
speex_bits_read_from(&bits, cbits, nbBytes);

/*Decode the data*/


speex_decode(state, &bits, output);

/*Copy from float to short (16 bits) for output*/


for (i=0;i<FRAME_SIZE;i++)
out[i]=output[i];

/*Write the decoded audio to file*/


fwrite(out, sizeof(short), FRAME_SIZE, fout);
}

/*Destroy the decoder state*/


speex_decoder_destroy(state);
/*Destroy the bit-stream truct*/
speex_bits_destroy(&bits);
fclose(fout);
return 0;
}
C IETF RTP PROFILE 48

C IETF RTP Profile

AVT Working Group G. Herlein


Internet-Draft S. Morlat
Expires: October 3, 2005 J. Jean-Marc
R. Hardiman
P. Kerr
April 04, 2005

draft-herlein-speex-rtp-profile-02
RTP Payload Format for the Speex Codec

Status of this Memo

This document is an Internet-Draft and is subject to all provisions


of section 3 of RFC 3667. By submitting this Internet-Draft, each
author represents that any applicable patent or other IPR claims of
which he or she is aware have been or will be disclosed, and any of
which he or she become aware will be disclosed, in accordance with
RFC 3668.

Internet-Drafts are working documents of the Internet Engineering


Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months


and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at


http://www.ietf.org/ietf/1id-abstracts.txt.
C IETF RTP PROFILE 49

The list of Internet-Draft Shadow Directories can be accessed at


http://www.ietf.org/shadow.html.

This Internet-Draft will expire on October 3, 2005.

Copyright Notice

Copyright (C) The Internet Society (2005).

Abstract

Speex is an open-source voice codec suitable for use in Voice over IP


(VoIP) type applications. This document describes the payload format
for Speex generated bit streams within an RTP packet. Also included
here are the necessary details for the use of Speex with the Session
Description Protocol (SDP) and a preliminary method of using Speex

Herlein, et al. Expires October 3, 2005 [Page 1]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

within H.323 applications.

Table of Contents

1. Conventions used in this document . . . . . . . . . . . . . 3


2. Overview of the Speex Codec . . . . . . . . . . . . . . . . 3
3. RTP payload format for Speex . . . . . . . . . . . . . . . . 3
4. RTP Header . . . . . . . . . . . . . . . . . . . . . . . . . 3
5. Speex payload . . . . . . . . . . . . . . . . . . . . . . . 5
6. Example Speex packet . . . . . . . . . . . . . . . . . . . . 6
7. Multiple Speex frames in a RTP packet . . . . . . . . . . . 6
8. MIME registration of Speex . . . . . . . . . . . . . . . . . 7
9. SDP usage of Speex . . . . . . . . . . . . . . . . . . . . . 8
C IETF RTP PROFILE 50

10. ITU H.323/H.245 Use of Speex . . . . . . . . . . . . . . . . 10


11. NonStandardMessage format . . . . . . . . . . . . . . . . . 10
12. RTP Payload Types . . . . . . . . . . . . . . . . . . . . . 11
13. Security Considerations . . . . . . . . . . . . . . . . . . 11
14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 12
15. References . . . . . . . . . . . . . . . . . . . . . . . . . 12
15.1 Normative References . . . . . . . . . . . . . . . . . . . 12
15.2 Informative References . . . . . . . . . . . . . . . . . . 13
Authors’ Addresses . . . . . . . . . . . . . . . . . . . . . 13
Intellectual Property and Copyright Statements . . . . . . . 15
C IETF RTP PROFILE 51

Herlein, et al. Expires October 3, 2005 [Page 2]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

1. Conventions used in this document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1].

2. Overview of the Speex Codec

Speex is based on the CELP [10] encoding technique with support for
either narrowband (nominal 8kHz), wideband (nominal 16kHz) or
ultra-wideband (nominal 32kHz), and (non-optimal) rates up to 48 kHz
sampling also available. The main characteristics can be summarized
as follows:

o Free software/open-source
o Integration of wideband and narrowband in the same bit-stream
o Wide range of bit-rates available
o Dynamic bit-rate switching and variable bit-rate (VBR)
o Voice Activity Detection (VAD, integrated with VBR)
o Variable complexity

3. RTP payload format for Speex

For RTP based transportation of Speex encoded audio the standard RTP
header [2] is followed by one or more payload data blocks. An
optional padding terminator may also be used.

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP Header |
C IETF RTP PROFILE 52

+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| one or more frames of Speex .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| one or more frames of Speex .... | padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

4. RTP Header

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| synchronization source (SSRC) identifier |

Herlein, et al. Expires October 3, 2005 [Page 3]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| contributing source (CSRC) identifiers |
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The RTP header begins with an octet of fields (V, P, X, and CC) to
support specialized RTP uses (see [2] and [7] for details). For
Speex the following values are used.

Version (V): 2 bits

This field identifies the version of RTP. The version used by this
C IETF RTP PROFILE 53

specification is two [2].

Padding (P): 1 bit

If the padding bit is set, the packet contains one or more additional
padding octets at the end which are not part of the payload. P is
set if the total packet size is less than the MTU.

Extension (X): 1 bit

If the extension, X, bit is set, the fixed header MUST be followed by


exactly one header extension, with a format defined in Section 5.3.1.
of [2].

CSRC count (CC): 4 bits

The CSRC count contains the number of CSRC identifiers.

Marker (M): 1 bit

The M bit indicates if the packet contains comfort noise. This field
is used in conjunction with the cng SDP attribute and is detailed
further in section 5 below. In normal usage this bit is set if the
packet contains comfort noise.

Payload Type (PT): 7 bits

An RTP profile for a class of applications is expected to assign a


payload type for this format, or a dynamically allocated payload type
SHOULD be chosen which designates the payload as Speex.

Sequence number: 16 bits

The sequence number increments by one for each RTP data packet sent,
and may be used by the receiver to detect packet loss and to restore
packet sequence. This field is detailed further in [2].
C IETF RTP PROFILE 54

Herlein, et al. Expires October 3, 2005 [Page 4]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

Timestamp: 32 bits

A timestamp representing the sampling time of the first sample of the


first Speex packet in the RTP packet. The clock frequency MUST be
set to the sample rate of the encoded audio data. Speex uses 20 msec
frames and a variable sampling rate clock. The RTP timestamp MUST be
in units of 1/X of a second where X is the sample rate used. Speex
uses a nominal 8kHz sampling rate for narrowband use, a nominal 16kHz
sampling rate for wideband use, and a nominal 32kHz sampling rate for
ultra-wideband use.

SSRC/CSRC identifiers:

These two fields, 32 bits each with one SSRC field and a maximum of
16 CSRC fields, are as defined in [2].

5. Speex payload

For the purposes of packetizing the bit stream in RTP, it is only


necessary to consider the sequence of bits as output by the Speex
encoder [9], and present the same sequence to the decoder. The
payload format described here maintains this sequence.

A typical Speex frame, encoded at the maximum bitrate, is approx.


110 octets and the total number of Speex frames SHOULD be kept less
than the path MTU to prevent fragmentation. Speex frames MUST NOT be
fragmented across multiple RTP packets,

An RTP packet MAY contain Speex frames of the same bit rate or of
varying bit rates, since the bit-rate for a frame is conveyed in band
C IETF RTP PROFILE 55

with the signal.

The encoding and decoding algorithm can change the bit rate at any 20
msec frame boundary, with the bit rate change notification provided
in-band with the bit stream. Each frame contains both "mode"
(narrowband, wideband or ultra-wideband) and "sub-mode" (bit-rate)
information in the bit stream. No out-of-band notification is
required for the decoder to process changes in the bit rate sent by
the encoder.

It is RECOMMENDED that values of 8000, 16000 and 32000 be used for


normal internet telephony applications, though the sample rate is
supported at rates as low as 6000 Hz and as high as 48 kHz.

The RTP payload MUST be padded to provide an integer number of octets


as the payload length. These padding bits are LSB aligned in network
octet order and consist of a 0 followed by all ones (until the end of
the octet). This padding is only required for the last frame in the

Herlein, et al. Expires October 3, 2005 [Page 5]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

packet, and only to ensure the packet contents ends on an octet


boundary.

6. Example Speex packet

In the example below we have a single Speex frame with 5 bits of


padding to ensure the packet size falls on an octet boundary.

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
C IETF RTP PROFILE 56

|V=2|P|X| CC |M| PT | sequence number |


+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| synchronization source (SSRC) identifier |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| contributing source (CSRC) identifiers |
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ..speex data.. |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ..speex data.. |0 1 1 1 1|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

7. Multiple Speex frames in a RTP packet

Below is an example of two Speex frames contained within one RTP


packet. The Speex frame length in this example fall on an octet
boundary so there is no padding.

Speex codecs [9] are able to detect the the bitrate from the payload
and are responsible for detecting the 20 msec boundaries between each
frame.

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
C IETF RTP PROFILE 57

Herlein, et al. Expires October 3, 2005 [Page 6]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

| synchronization source (SSRC) identifier |


+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| contributing source (CSRC) identifiers |
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ..speex data.. |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ..speex data.. | ..speex data.. |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ..speex data.. |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

8. MIME registration of Speex

Full definition of the MIME [3] type for Speex will be part of the
Ogg Vorbis MIME type definition application [8].

MIME media type name: audio

MIME subtype: speex

Optional parameters:

Required parameters: to be included in the Ogg MIME specification.

Encoding considerations:
C IETF RTP PROFILE 58

Security Considerations:

See Section 6 of RFC 3047.

Interoperability considerations: none

Published specification:

Applications which use this media type:

Additional information: none

Person & email address to contact for further information:

Greg Herlein <[email protected]>


Jean-Marc Valin <[email protected]>

Intended usage: COMMON

Herlein, et al. Expires October 3, 2005 [Page 7]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

Author/Change controller:

Author: Greg Herlein <[email protected]>


Change controller: Greg Herlein <[email protected]>
Change controller: IETF AVT Working Group

This transport type signifies that the content is to be interpreted


according to this document if the contents are transmitted over RTP.
Should this transport type appear over a lossless streaming protocol
such as TCP, the content encapsulation should be interpreted as an
C IETF RTP PROFILE 59

Ogg Stream in accordance with [8], with the exception that the
content of the Ogg Stream may be assumed to be Speex audio and Speex
audio only.

9. SDP usage of Speex

When conveying information by SDP [4], the encoding name MUST be set
to "speex". An example of the media representation in SDP for
offering a single channel of Speex at 8000 samples per second might
be:

m=audio 8088 RTP/AVP 97


a=rtpmap:97 speex/8000

Note that the RTP payload type code of 97 is defined in this media
definition to be ’mapped’ to the speex codec at an 8kHz sampling
frequency using the ’a=rtpmap’ line. Any number from 96 to 127 could
have been chosen (the allowed range for dynamic types).

The value of the sampling frequency is typically 8000 for narrow band
operation, 16000 for wide band operation, and 32000 for ultra-wide
band operation.

If for some reason the offerer has bandwidth limitations, the client
may use the "b=" header, as explained in SDP [4]. The following
example illustrates the case where the offerer cannot receive more
than 10 kbit/s.

m=audio 8088 RTP/AVP 97


b=AS:10
a=rtmap:97 speex/8000

In this case, if the remote part agrees, it should configure its


Speex encoder so that it does not use modes that produce more than 10
kbit/s. Note that the "b=" constraint also applies on all payload
types that may be proposed in the media line ("m=").
C IETF RTP PROFILE 60

An other way to make recommendations to the remote Speex encoder is

Herlein, et al. Expires October 3, 2005 [Page 8]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

to use its specific parameters via the a=fmtp: directive. The


following parameters are defined for use in this way:

ptime: duration of each packet in milliseconds.

sr: actual sample rate in Hz.

ebw: encoding bandwidth - either ’narrow’ or ’wide’ or ’ultra’


(corresponds to nominal 8000, 16000, and 32000 Hz sampling rates).

vbr: variable bit rate - either ’on’ ’off’ or ’vad’ (defaults


to off). If on, variable bit rate is enabled. If off, disabled.
If set to ’vad’ then constant bit rate is used but silence will be
encoded with special short frames to indicate a lack of voice for
that period.

cng: comfort noise generation - either ’on’ or ’off’. If off


then silence frames will be silent; if ’on’ then those frames will
be filled with comfort noise.

mode: Speex encoding mode. Can be {1,2,3,4,5,6,any} defaults to


3 in narrowband, 6 in wide and ultra-wide.

penh: use of perceptual enhancement. 1 indicates to the decoder


that perceptual enhancement is recommended, 0 indicates that it is
not. Defaults to on (1).
C IETF RTP PROFILE 61

Examples:

m=audio 8008 RTP/AVP 97


a=rtpmap:97 speex/8000
a=fmtp:97 mode=4

This examples illustrate an offerer that wishes to receive a Speex


stream at 8000Hz, but only using speex mode 3.

The offerer may suggest to the remote decoder to activate its


perceptual enhancement filter like this:

m=audio 8088 RTP/AVP 97


a=rtmap:97 speex/8000
a=fmtp:97 penh=1

Several Speex specific parameters can be given in a single a=fmtp


line provided that they are separated by a semi-colon:

Herlein, et al. Expires October 3, 2005 [Page 9]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

a=fmtp:97 mode=any;penh=1

The offerer may indicate that it wishes to send variable bit rate
frames with comfort noise:

m=audio 8088 RTP/AVP 97


a=rtmap:97 speex/8000
a=fmtp:97 vbr=on;cng=on
C IETF RTP PROFILE 62

The "ptime" attribute is used to denote the packetization interval


(ie, how many milliseconds of audio is encoded in a single RTP
packet). Since Speex uses 20 msec frames, ptime values of multiples
of 20 denote multiple Speex frames per packet. Values of ptime which
are not multiples of 20 MUST be ignored and clients MUST use the
default value of 20 instead.

In the example below the ptime value is set to 40, indicating that
there are 2 frames in each packet.

m=audio 8008 RTP/AVP 97


a=rtpmap:97 speex/8000
a=ptime:40

Note that the ptime parameter applies to all payloads listed in the
media line and is not used as part of an a=fmtp directive.

Values of ptime not multiple of 20 msec are meaningless, so the


receiver of such ptime values MUST ignore them. If during the life
of an RTP session the ptime value changes, when there are multiple
Speex frames for example, the SDP value must also reflect the new
value.

Care must be taken when setting the value of ptime so that the RTP
packet size does not exceed the path MTU.

10. ITU H.323/H.245 Use of Speex

Application is underway to make Speex a standard ITU codec. However,


until that is finalized, Speex MAY be used in H.323 [5] by using a
non-standard codec block definition in the H.245 [6] codec capability
negotiations.

11. NonStandardMessage format

For Speex use in H.245 [6] based systems, the fields in the
NonStandardMessage should be:
C IETF RTP PROFILE 63

Herlein, et al. Expires October 3, 2005 [Page 10]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

t35CountryCode = Hex: B5
t35Extension = Hex: 00
manufacturerCode = Hex: 0026
[Length of the Binary Sequence (8 bit number)]
[Binary Sequence consisting of an ASCII string, no NULL
terminator]

The binary sequence is an ascii string merely for ease of use. The
string is not null terminated. The format of this string is

speex [optional variables]

The optional variables are identical to those used for the SDP a=fmtp
strings discussed in section 5 above. The string is built to be all
on one line, each key-value pair separated by a semi-colon. The
optional variables MAY be omitted, which causes the default values to
be assumed. They are:

ebw=narrow;mode=3;vbr=off;cng=off;ptime=20;sr=8000;penh=no;

The fifth octet of the block is the length of the binary sequence.

NOTE: this method can result in the advertising of a large number of


Speex ’codecs’ based on the number of variables possible. For most
VoIP applications, use of the default binary sequence of ’speex’ is
RECOMMENDED to be used in addition to all other options. This
maximizes the chances that two H.323 based applications that support
C IETF RTP PROFILE 64

Speex can find a mutual codec.

12. RTP Payload Types

Dynamic payload type codes MUST be negotiated ’out-of-band’ for the


assignment of a dynamic payload type from the range of 96-127. H.323
applications MUST use the H.245 H2250LogicalChannelParameters
encoding to accomplish this.

13. Security Considerations

RTP packets using the payload format defined in this specification


are subject to the security considerations discussed in the RTP
specification [2], and any appropriate RTP profile. This implies
that confidentiality of the media streams is achieved by encryption.
Because the data compression used with this payload format is applied
end-to-end, encryption may be performed after compression so there is
no conflict between the two operations.

A potential denial-of-service threat exists for data encodings using


compression techniques that have non-uniform receiver-end

Herlein, et al. Expires October 3, 2005 [Page 11]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

computational load. The attacker can inject pathological datagrams


into the stream which are complex to decode and cause the receiver to
be overloaded. However, this encoding does not exhibit any
significant non-uniformity.

As with any IP-based protocol, in some circumstances a receiver may


be overloaded simply by the receipt of too many packets, either
desired or undesired. Network-layer authentication may be used to
C IETF RTP PROFILE 65

discard packets from undesired sources, but the processing cost of


the authentication itself may be too high.

14. Acknowledgments

The authors would like to thank Equivalence Pty Ltd of Australia for
their assistance in attempting to standardize the use of Speex in
H.323 applications, and for implementing Speex in their open source
OpenH323 stack. The authors would also like to thank Brian C. Wiles
<[email protected]> of StreamComm for his assistance in developing
the proposed standard for Speex use in H.323 applications.

The authors would also like to thank the following members of the
Speex and AVT communities for their input: Ross Finlayson, Federico
Montesino Pouzols, Henning Schulzrinne, Magnus Westerlund.

15. References

15.1 Normative References

[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", RFC 2119.

[2] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson,


"RTP: A Transport Protocol for real-time applications", RFC
3550.

[3] "Multipurpose Internet Mail Extensions (MIME) Part One: Format


of Internet Message Bodies", RFC 2045.

[4] Jacobson, V. and M. Handley, "SDP: Session Description


Protocol", RFC 2327.

[5] "Packet-based Multimedia Communications Systems", ITU-T


Recommendation H.323.

[6] "Control of communications between Visual Telephone Systems and


C IETF RTP PROFILE 66

Terminal Equipment", ITU-T Recommendation H.245.

[7] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video

Herlein, et al. Expires October 3, 2005 [Page 12]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

Conferences with Minimal Control.", RFC 3551.

[8] Walleij, L., "The application/ogg Media Type", RFC 3534.

15.2 Informative References

[9] "Speexenc/speexdec, reference command-line encoder/decoder",


Speex website http://www.speex.org/.

[10] "CELP, U.S. Federal Standard 1016.", National Technical


Information Service (NTIS) website http://www.ntis.gov/.

Authors’ Addresses

Greg Herlein
2034 Filbert Street
San Francisco, California 94123
United States

EMail: [email protected]

Simon Morlat
35, av de Vizille App 42
Grenoble 38000
C IETF RTP PROFILE 67

France

EMail: [email protected]

Jean-Marc Valin
Department of Electrical and Computer Engineering
University of Sherbrooke
2500 blvd Universite
Sherbrooke, Quebec J1K 2R1
Canada

EMail: [email protected]

Roger Hardiman
49 Nettleton Road
Cheltenham, Gloucestershire GL51 6NR
England

EMail: [email protected]

Herlein, et al. Expires October 3, 2005 [Page 13]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

Phil Kerr
England

EMail: [email protected]
C IETF RTP PROFILE 68
C IETF RTP PROFILE 69

Herlein, et al. Expires October 3, 2005 [Page 14]

Internet-Draft draft-herlein-speex-rtp-profile-02 April 2005

Intellectual Property Statement

The IETF takes no position regarding the validity or scope of any


Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.

Copies of IPR disclosures made to the IETF Secretariat and any


assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.

The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
[email protected].
C IETF RTP PROFILE 70

Disclaimer of Validity

This document and the information contained herein are provided on an


"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Copyright Statement

Copyright (C) The Internet Society (2005). This document is subject


to the rights, licenses and restrictions contained in BCP 78, and
except as set forth therein, the authors retain all their rights.

Acknowledgment

Funding for the RFC Editor function is currently provided by the


Internet Society.

Herlein, et al. Expires October 3, 2005 [Page 15]


D SPEEX LICENSE 71

D Speex License
Redistribution and use in source and binary forms, with or without modification, are
permitted provided that the following conditions are met:

• Redistributions of source code must retain the above copyright notice, this list of
conditions and the following disclaimer.

• Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or other
materials provided with the distribution.

• Neither the name of the Xiph.org Foundation nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CON-


TRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, IN-
CLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MER-
CHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CON-
SEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCURE-
MENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROF-
ITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THE-
ORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
E GNU FREE DOCUMENTATION LICENSE 72

E GNU Free Documentation License


Version 1.1, March 2000
Copyright (C) 2000 Free Software Foundation, Inc. 59 Temple Place, Suite 330,
Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim
copies of this license document, but changing it is not allowed.

0. PREAMBLE
The purpose of this License is to make a manual, textbook, or other written document
"free" in the sense of freedom: to assure everyone the effective freedom to copy and
redistribute it, with or without modifying it, either commercially or noncommercially.
Secondarily, this License preserves for the author and publisher a way to get credit for
their work, while not being considered responsible for modifications made by others.
This License is a kind of "copyleft", which means that derivative works of the
document must themselves be free in the same sense. It complements the GNU General
Public License, which is a copyleft license designed for free software.
We have designed this License in order to use it for manuals for free software,
because free software needs free documentation: a free program should come with
manuals providing the same freedoms that the software does. But this License is not
limited to software manuals; it can be used for any textual work, regardless of sub-
ject matter or whether it is published as a printed book. We recommend this License
principally for works whose purpose is instruction or reference.

1. APPLICABILITY AND DEFINITIONS


This License applies to any manual or other work that contains a notice placed by
the copyright holder saying it can be distributed under the terms of this License. The
"Document", below, refers to any such manual or work. Any member of the public is a
licensee, and is addressed as "you".
A "Modified Version" of the Document means any work containing the Document
or a portion of it, either copied verbatim, or with modifications and/or translated into
another language.
A "Secondary Section" is a named appendix or a front-matter section of the Doc-
ument that deals exclusively with the relationship of the publishers or authors of the
Document to the Document’s overall subject (or to related matters) and contains noth-
ing that could fall directly within that overall subject. (For example, if the Document
E GNU FREE DOCUMENTATION LICENSE 73

is in part a textbook of mathematics, a Secondary Section may not explain any mathe-
matics.) The relationship could be a matter of historical connection with the subject or
with related matters, or of legal, commercial, philosophical, ethical or political position
regarding them.
The "Invariant Sections" are certain Secondary Sections whose titles are desig-
nated, as being those of Invariant Sections, in the notice that says that the Document is
released under this License.
The "Cover Texts" are certain short passages of text that are listed, as Front-Cover
Texts or Back-Cover Texts, in the notice that says that the Document is released under
this License.
A "Transparent" copy of the Document means a machine-readable copy, repre-
sented in a format whose specification is available to the general public, whose con-
tents can be viewed and edited directly and straightforwardly with generic text editors
or (for images composed of pixels) generic paint programs or (for drawings) some
widely available drawing editor, and that is suitable for input to text formatters or for
automatic translation to a variety of formats suitable for input to text formatters. A
copy made in an otherwise Transparent file format whose markup has been designed
to thwart or discourage subsequent modification by readers is not Transparent. A copy
that is not "Transparent" is called "Opaque".
Examples of suitable formats for Transparent copies include plain ASCII without
markup, Texinfo input format, LATEX input format, SGML or XML using a publicly
available DTD, and standard-conforming simple HTML designed for human modifi-
cation. Opaque formats include PostScript, PDF, proprietary formats that can be read
and edited only by proprietary word processors, SGML or XML for which the DTD
and/or processing tools are not generally available, and the machine-generated HTML
produced by some word processors for output purposes only.
The "Title Page" means, for a printed book, the title page itself, plus such following
pages as are needed to hold, legibly, the material this License requires to appear in the
title page. For works in formats which do not have any title page as such, "Title Page"
means the text near the most prominent appearance of the work’s title, preceding the
beginning of the body of the text.

2. VERBATIM COPYING
You may copy and distribute the Document in any medium, either commercially or
noncommercially, provided that this License, the copyright notices, and the license
notice saying this License applies to the Document are reproduced in all copies, and
E GNU FREE DOCUMENTATION LICENSE 74

that you add no other conditions whatsoever to those of this License. You may not
use technical measures to obstruct or control the reading or further copying of the
copies you make or distribute. However, you may accept compensation in exchange
for copies. If you distribute a large enough number of copies you must also follow the
conditions in section 3.
You may also lend copies, under the same conditions stated above, and you may
publicly display copies.

3. COPYING IN QUANTITY
If you publish printed copies of the Document numbering more than 100, and the Doc-
ument’s license notice requires Cover Texts, you must enclose the copies in covers that
carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover,
and Back-Cover Texts on the back cover. Both covers must also clearly and legibly
identify you as the publisher of these copies. The front cover must present the full title
with all words of the title equally prominent and visible. You may add other mate-
rial on the covers in addition. Copying with changes limited to the covers, as long as
they preserve the title of the Document and satisfy these conditions, can be treated as
verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit legibly, you should
put the first ones listed (as many as fit reasonably) on the actual cover, and continue the
rest onto adjacent pages.
If you publish or distribute Opaque copies of the Document numbering more than
100, you must either include a machine-readable Transparent copy along with each
Opaque copy, or state in or with each Opaque copy a publicly-accessible computer-
network location containing a complete Transparent copy of the Document, free of
added material, which the general network-using public has access to download anony-
mously at no charge using public-standard network protocols. If you use the latter op-
tion, you must take reasonably prudent steps, when you begin distribution of Opaque
copies in quantity, to ensure that this Transparent copy will remain thus accessible at
the stated location until at least one year after the last time you distribute an Opaque
copy (directly or through your agents or retailers) of that edition to the public.
It is requested, but not required, that you contact the authors of the Document well
before redistributing any large number of copies, to give them a chance to provide you
with an updated version of the Document.
E GNU FREE DOCUMENTATION LICENSE 75

4. MODIFICATIONS
You may copy and distribute a Modified Version of the Document under the conditions
of sections 2 and 3 above, provided that you release the Modified Version under pre-
cisely this License, with the Modified Version filling the role of the Document, thus
licensing distribution and modification of the Modified Version to whoever possesses a
copy of it. In addition, you must do these things in the Modified Version:

• A. Use in the Title Page (and on the covers, if any) a title distinct from that of the
Document, and from those of previous versions (which should, if there were any,
be listed in the History section of the Document). You may use the same title as
a previous version if the original publisher of that version gives permission.

• B. List on the Title Page, as authors, one or more persons or entities responsible
for authorship of the modifications in the Modified Version, together with at least
five of the principal authors of the Document (all of its principal authors, if it has
less than five).

• C. State on the Title page the name of the publisher of the Modified Version, as
the publisher.

• D. Preserve all the copyright notices of the Document.

• E. Add an appropriate copyright notice for your modifications adjacent to the


other copyright notices.

• F. Include, immediately after the copyright notices, a license notice giving the
public permission to use the Modified Version under the terms of this License,
in the form shown in the Addendum below.

• G. Preserve in that license notice the full lists of Invariant Sections and required
Cover Texts given in the Document’s license notice.

• H. Include an unaltered copy of this License.

• I. Preserve the section entitled "History", and its title, and add to it an item stating
at least the title, year, new authors, and publisher of the Modified Version as
given on the Title Page. If there is no section entitled "History" in the Document,
create one stating the title, year, authors, and publisher of the Document as given
on its Title Page, then add an item describing the Modified Version as stated in
the previous sentence.
E GNU FREE DOCUMENTATION LICENSE 76

• J. Preserve the network location, if any, given in the Document for public access
to a Transparent copy of the Document, and likewise the network locations given
in the Document for previous versions it was based on. These may be placed in
the "History" section. You may omit a network location for a work that was pub-
lished at least four years before the Document itself, or if the original publisher
of the version it refers to gives permission.

• K. In any section entitled "Acknowledgements" or "Dedications", preserve the


section’s title, and preserve in the section all the substance and tone of each of
the contributor acknowledgements and/or dedications given therein.

• L. Preserve all the Invariant Sections of the Document, unaltered in their text and
in their titles. Section numbers or the equivalent are not considered part of the
section titles.

• M. Delete any section entitled "Endorsements". Such a section may not be in-
cluded in the Modified Version.

• N. Do not retitle any existing section as "Endorsements" or to conflict in title


with any Invariant Section.

If the Modified Version includes new front-matter sections or appendices that qualify
as Secondary Sections and contain no material copied from the Document, you may at
your option designate some or all of these sections as invariant. To do this, add their
titles to the list of Invariant Sections in the Modified Version’s license notice. These
titles must be distinct from any other section titles.
You may add a section entitled "Endorsements", provided it contains nothing but
endorsements of your Modified Version by various parties–for example, statements of
peer review or that the text has been approved by an organization as the authoritative
definition of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage
of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the
Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text
may be added by (or through arrangements made by) any one entity. If the Document
already includes a cover text for the same cover, previously added by you or by arrange-
ment made by the same entity you are acting on behalf of, you may not add another;
but you may replace the old one, on explicit permission from the previous publisher
that added the old one.
E GNU FREE DOCUMENTATION LICENSE 77

The author(s) and publisher(s) of the Document do not by this License give per-
mission to use their names for publicity for or to assert or imply endorsement of any
Modified Version.

5. COMBINING DOCUMENTS
You may combine the Document with other documents released under this License,
under the terms defined in section 4 above for modified versions, provided that you
include in the combination all of the Invariant Sections of all of the original documents,
unmodified, and list them all as Invariant Sections of your combined work in its license
notice.
The combined work need only contain one copy of this License, and multiple iden-
tical Invariant Sections may be replaced with a single copy. If there are multiple In-
variant Sections with the same name but different contents, make the title of each such
section unique by adding at the end of it, in parentheses, the name of the original au-
thor or publisher of that section if known, or else a unique number. Make the same
adjustment to the section titles in the list of Invariant Sections in the license notice of
the combined work.
In the combination, you must combine any sections entitled "History" in the vari-
ous original documents, forming one section entitled "History"; likewise combine any
sections entitled "Acknowledgements", and any sections entitled "Dedications". You
must delete all sections entitled "Endorsements."

6. COLLECTIONS OF DOCUMENTS
You may make a collection consisting of the Document and other documents released
under this License, and replace the individual copies of this License in the various
documents with a single copy that is included in the collection, provided that you follow
the rules of this License for verbatim copying of each of the documents in all other
respects.
You may extract a single document from such a collection, and distribute it individ-
ually under this License, provided you insert a copy of this License into the extracted
document, and follow this License in all other respects regarding verbatim copying of
that document.
E GNU FREE DOCUMENTATION LICENSE 78

7. AGGREGATION WITH INDEPENDENT WORKS


A compilation of the Document or its derivatives with other separate and independent
documents or works, in or on a volume of a storage or distribution medium, does not
as a whole count as a Modified Version of the Document, provided no compilation
copyright is claimed for the compilation. Such a compilation is called an "aggregate",
and this License does not apply to the other self-contained works thus compiled with
the Document, on account of their being thus compiled, if they are not themselves
derivative works of the Document.
If the Cover Text requirement of section 3 is applicable to these copies of the Doc-
ument, then if the Document is less than one quarter of the entire aggregate, the Docu-
ment’s Cover Texts may be placed on covers that surround only the Document within
the aggregate. Otherwise they must appear on covers around the whole aggregate.

8. TRANSLATION
Translation is considered a kind of modification, so you may distribute translations of
the Document under the terms of section 4. Replacing Invariant Sections with trans-
lations requires special permission from their copyright holders, but you may include
translations of some or all Invariant Sections in addition to the original versions of
these Invariant Sections. You may include a translation of this License provided that
you also include the original English version of this License. In case of a disagreement
between the translation and the original English version of this License, the original
English version will prevail.

9. TERMINATION
You may not copy, modify, sublicense, or distribute the Document except as expressly
provided for under this License. Any other attempt to copy, modify, sublicense or
distribute the Document is void, and will automatically terminate your rights under
this License. However, parties who have received copies, or rights, from you under this
License will not have their licenses terminated so long as such parties remain in full
compliance.

10. FUTURE REVISIONS OF THIS LICENSE


The Free Software Foundation may publish new, revised versions of the GNU Free
Documentation License from time to time. Such new versions will be similar in spirit
E GNU FREE DOCUMENTATION LICENSE 79

to the present version, but may differ in detail to address new problems or concerns.
See http://www.gnu.org/copyleft/.
Each version of the License is given a distinguishing version number. If the Docu-
ment specifies that a particular numbered version of this License "or any later version"
applies to it, you have the option of following the terms and conditions either of that
specified version or of any later version that has been published (not as a draft) by
the Free Software Foundation. If the Document does not specify a version number of
this License, you may choose any version ever published (not as a draft) by the Free
Software Foundation.
Index
ACELP, 43 perceptual enhancement, 9, 21, 36
acoustic echo cancellation, 18 pitch, 30
algorithmic delay, 10 preprocessor, 17
analysis-by-synthesis, 31
quadrature mirror filter, 38
API, 15
auto-correlation, 30 quality, 8
average bit-rate, 9, 22 RTP, 25
bit-rate, 37
sampling rate, 8
speexdec, 13
CELP, 6, 28
complexity, 6, 8, 36, 37 speexenc, 12
standards, 25
constant bit-rate, 8

discontinuous transmission, 9, 22 tail length, 18


DTMF, 8, 42
ultra-wideband, 8
echo cancellation, 18 variable bit-rate, 6, 8, 21
error weighting, 31
voice activity detection, 6, 9, 22
Vorbis, 40
fixed-point, 11

in-band signalling, 24 wideband, 6, 8, 38

Levinson-Durbin, 30
libspeex, 15
line spectral pair, 33
linear prediction, 28, 33

mean opinion score, 36


music, 41

narrowband, 6, 8, 33

Ogg, 26, 40
open-source, 6, 40

patent, 6, 40

80

You might also like