Manual
Manual
(version 1.2-beta1)
Jean-Marc Valin
1
2
Contents
1 Introduction to Speex 6
2 Codec description 8
2.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Adaptive Jitter Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Acoustic Echo Canceller . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Compiling 11
4 Command-line encoder/decoder 12
4.1 speexenc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 speexdec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7.6 Analysis-by-Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 32
A FAQ 40
B Sample code 44
B.1 sampleenc.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.2 sampledec.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
D Speex License 71
List of Tables
1 In-band signalling codes . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Ogg/Speex header packet . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Bit allocation for narrowband modes . . . . . . . . . . . . . . . . . . 36
4 Quality versus bit-rate . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Bit allocation for high-band in wideband mode . . . . . . . . . . . . 39
1 INTRODUCTION TO SPEEX 6
1 Introduction to Speex
The Speex project (http://www.speex.org/) has been started because there was a
need for a speech codec that was open-source and free from software patents. These
are essential conditions for being used by any open-source software. There is already
Vorbis that does general audio, but it is not really suitable for speech. Also, unlike
many other speech codecs, Speex is not targeted at cell phones but rather at voice over
IP (VoIP) and file-based compression.
As design goals, we wanted to have a codec that would allow both very good quality
speech and low bit-rate (unfortunately not at the same time!), which led us to develop-
ing a codec with multiple bit-rates. Of course very good quality also meant we had to
do wideband (16 kHz sampling rate) in addition to narrowband (telephone quality, 8
kHz sampling rate).
Designing for VoIP instead of cell phone use means that Speex must be robust to
lost packets, but not to corrupted ones since packets either arrive unaltered or don’t ar-
rive at all. Also, the idea was to have a reasonable complexity and memory requirement
without compromising too much on the efficiency of the codec.
All this led us to the choice of CELP as the encoding technique to use for Speex.
One of the main reasons is that CELP has long proved that it could do the job and
scale well to both low bit-rates (think DoD CELP @ 4.8 kbps) and high bit-rates (think
G.728 @ 16 kbps).
The main characteristics can be summarized as follows:
• Variable complexity
This document is divided in the following way. Section 2 describes the different Speex
features and defines some terms that will be used in later sections. Section 4 provides
information about the standard command-line tools, while 5 contains information about
programming using the Speex API. Section 6 has some information related to Speex
and standards. The three last sections describe the internals of the codec and require
some signal processing knowledge. Section 7 explains the general idea behind CELP,
while sections 8 and 9 are specific to Speex. Note that if you are only interested in
using Speex, those three last sections are not required.
2 CODEC DESCRIPTION 8
2 Codec description
This section describes the main features provided by Speex.
2.1 Concepts
Here are some concepts in speech coding that help better understand the rest of the
manual. Emphasis is placed on the Speex features.
Sampling rate
Speex is mainly designed for 3 different sampling rates: 8 kHz, 16 kHz, and 32 kHz.
These are respectively refered to as narrowband, wideband and ultra-wideband. For a
sampling rate of Fs kHz, the highest frequency that can be represented is equal to Fs /2
kHz. This is a consequence of Nyquist’s sampling theorem (and Fs /2 is known as the
Nyquist frequency).
Quality
Speex encoding is controlled most of the time by a quality parameter that ranges from
0 to 10. In constant bit-rate (CBR) operation, the quality parameter is an integer, while
for variable bit-rate (VBR), the parameter is a float.
Complexity (variable)
With Speex, it is possible to vary the complexity allowed for the encoder. This is done
by controlling how the search is performed with an integer ranging from 1 to 10 in
a way that’s similar to the -1 to -9 options to gzip and bzip2 compression utilities.
For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than at
complexity 10, but the CPU requirements for complexity 10 is about 5 times higher
than for complexity 1. In practice, the best trade-off is between complexity 2 and 4,
though higher settings are often useful when encoding non-speech sounds like DTMF
tones.
while fricatives (e.g. s,f sounds) can be coded adequately with less bits. For this
reason, VBR can achive lower bit-rate for the same quality, or a better quality for a
certain bit-rate. Despite its advantages, VBR has two main drawbacks: first, by only
specifying quality, there’s no guaranty about the final average bit-rate. Second, for
some real-time applications like voice over IP (VoIP), what counts is the maximum
bit-rate, which must be low enough for the communication channel.
Perceptual enhancement
Perceptual enhancement is a part of the decoder which, when turned on, tries to reduce
(the perception of) the noise produced by the coding/decoding process. In most cases,
perceptual enhancement make the sound further from the original objectively (if you
use SNR), but in the end it still sounds better (subjective improvement).
2 CODEC DESCRIPTION 10
Algorithmic delay
Every speech codec introduces a delay in the transmission. For Speex, this delay is
equal to the frame size, plus some amount of “look-ahead” required to process each
frame. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16
kHz), the delay is 34 ms. These values don’t account for the CPU time it takes to
encode or decode the frames.
2.2 Codec
2.3 Preprocessor
This part refers to the preprocessor module introduced in the 1.1.x branch. The prepro-
cessor is designed to be used on the audio before running the encoder. The preprocessor
provides three main functionalities:
• denoising
The denoiser can be used to reduce the amount of background noise present in the
input signal. This provides higher quality speech whether or not the denoised signal
is encoded with Speex (or at all). However, when using the denoised signal with the
codec, there is an additional benefit. Speech codecs in general (Speex included) tend to
perform poorly on noisy input, which tends to amplify the noise. The denoiser greatly
reduces this effect.
Automatic gain control (AGC) is a feature that deals with the fact that the record-
ing volume may vary by a large amount between different setups. The AGC provides a
way to adjust a signal to a reference volume. This is useful for voice over IP because it
removes the need for manual adjustment of the microphone gain. A secondary advan-
tage is that by setting the microphone gain to a conservative (low) level, it is easier to
avoid clipping.
The voice activity detector (VAD) provided by the preprocessor is more advanced
than the one directly provided in the codec.
3 Compiling
Compiling Speex under UNIX or any platform supported by autoconf (e.g. Win32/cygwin)
is as easy as typing:
% ./configure [options]
% make
% make install
–enable-valgrind Enable extra information when (and only when) running with val-
grind
–enable-fixed-point Compile Speex for a processor that does not have a floating point
unit (FPU)
–enable-fixed-point-debug Use only for debugging the fixed-point code (very slow)
4 Command-line encoder/decoder
The base Speex distribution includes a command-line encoder (speexenc) and decoder
(speexdec). This section describes how to use these tools.
4.1 speexenc
The speexenc utility is used to create Speex files from raw PCM or wave files. It can
be used by calling:
The value ’-’ for input_file or output_file corresponds respectively to stdin and stdout.
The valid options are:
–narrowband (-n) Tell Speex to treat the input as narrowband (8 kHz). This is the
default
–wideband (-w) Tell Speex to treat the input as wideband (16 kHz)
–ultra-wideband (-u) Tell Speex to treat the input as “ultra-wideband” (32 kHz)
–nframes n Pack n frames in each Ogg packet (this saves space at low bit-rates)
–comp n Set encoding speed/quality tradeoff. The higher the value of n, the slower
the encoding (default is 3)
Speex comments
–comment Add the given string as an extra comment. This may be used multiple
times.
4.2 speexdec
The speexdec utility is used to decode Speex files and can be used by calling:
The value ’-’ for input_file or output_file corresponds respectively to stdin and stdout.
Also, when no output_file is specified, the file is played to the soundcard. The valid
options are:
5.1 Encoding
In order to encode speech using Speex, you first need to:
#include <speex/speex.h>
SpeexBits bits;
void *enc_state;
speex_bits_init(&bits);
enc_state = speex_encoder_init(&speex_nb_mode);
speex_encoder_ctl(enc_state,SPEEX_GET_FRAME_SIZE,&frame_size);
speex_bits_reset(&bits);
speex_encode_int(enc_state, input_frame, &bits);
nbBytes = speex_bits_write(&bits, byte_ptr, MAX_NB_BYTES);
nbBytes is the number of bytes actually written to byte_ptr (the encoded size in bytes).
Before calling speex_bits_write, it is possible to find the number of bytes that need to
be written by calling speex_bits_nbytes(&bits), which returns a number of bytes.
It is still possible to use the speex_encode() function, which takes a (float *) for the
audio. However, this would make an eventual port to an FPU-less platform (like ARM)
more complicated. Internally, speex_encode() and speex_encode_int() are processed in
the same way. Whether the encoder uses the fixed-point version is only decided by the
compile-time flags, not at the API level.
After you’re done with the encoding, free all resources with:
speex_bits_destroy(&bits);
speex_encoder_destroy(enc_state);
5.2 Decoding
In order to decode speech using Speex, you first need to:
#include <speex/speex.h>
SpeexBits bits;
void *dec_state;
speex_bits_init(&bits);
dec_state = speex_decoder_init(&speex_nb_mode);
There is also a parameter that can be set for the decoder: whether or not to use a
perceptual enhancer. This can be set by:
where enh is an int with value 0 to have the enhancer disabled and 1 to have it enabled.
As of 1.2-beta1, the default is now to enable the enhancer.
Again, once the decoder initialization is done, for every input frame:
where input_bytes is a (char *) containing the bit-stream data received for a frame,
nbBytes is the size (in bytes) of that bit-stream, and output_frame is a (short *) and
points to the area where the decoded speech frame will be written. A NULL value as
the first argument indicates that we don’t have the bits for the current frame. When a
frame is lost, the Speex decoder will do its best to "guess" the correct signal.
As for the encoder, the speex_decode() function can still be used, with a (float *) as
the output for the audio.
After you’re done with the decoding, free all resources with:
speex_bits_destroy(&bits);
speex_decoder_destroy(dec_state);
5.3 Preprocessor
In order to use the Speex preprocessor, you first need to:
#include <speex/speex_preprocess.h>
It is recommended to use the same value for frame_size as is used by the encoder (20
ms).
For each input frame, you need to call:
where audio_frame is used both as input and output and echo_residue is either an
array filled by the echo canceller, or NULL if the preprocessor is used without the echo
canceller.
In cases where the output audio is not useful for a certain frame, it is possible to
use instead:
This call will update all the preprocessor internal state variables without computing the
output audio, thus saving some CPU cycles.
The behaviour of the preprocessor can be changed using:
which is used in the same way as the encoder and decoder equivalent. Options are
listed in Section .
The preprocessor state can be destroyed using:
speex_preprocess_state_destroy(preprocess_state);
#include <speex/speex_echo.h>
where frame_size is the amount of data (in samples) you want to process at once and
filter_length is the length (in samples) of the echo cancelling filter you want to use
(also known as tail length). It is recommended to use a frame size in the order of 20
ms (or equal to the codec frame size) and make sure it is easy to perform an FFT of
that size (powers of two are better than prime sizes). The recommended tail length is
approximately the third of the room reverberation time. For example, in a small room,
reverberation time is in the order of 300 ms, so a tail length of 100 ms is a good choice
(800 samples at 8000 Hz sampling rate).
Once the echo canceller state is created, audio can be processed by:
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 19
write_to_soundcard(echo_frame, frame_size);
read_from_soundcard(input_frame, frame_size);
speex_echo_cancel(echo_state, input_frame, echo_frame, output_frame, residue);
As stated above, if you wish to further reduce the echo present in the signal, you can
do so by passing residue as the last parameter of speex_preprocess() function (see
Section 5.3).
As of version 1.2-beta1, there is an alternative, simpler API that can be used in-
stead of speex_echo_cancel(). When audio capture and playback are handled asyn-
chronously (e.g. in different threads or using the poll() or select() system call), it can
be difficult to keep track of what input_frame comes with what echo_frame. Instead,
the playback comtext/thread can simply call:
speex_echo_playback(echo_state, echo_frame);
every time an audio frame is played. Then, the capture context/thread calls:
for every frame captured. Internally, speex_echo_playback() simply buffers the play-
back frame so it can be used by speex_echo_capture() to call speex_echo_cancel().
When capture and playback are done synchronously, speex_echo_cancel() is still pref-
ered since it gives better control on the exact input/echo timing.
The echo cancellation state can be destroyed with:
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 20
speex_echo_state_destroy(echo_state);
It is also possible to reset the state of the echo canceller so it can be reused without the
need to create another state with:
speex_echo_state_reset(echo_state);
5.4.1 Troubleshooting
There are several things that may prevent the echo canceller from working properly.
One of them is a bug (or something suboptimal) in the code, but there are many others
you should consider first
• Using a different soundcard to do the capture and plaback will *not* work, re-
gardless of what you may think. The only exception to that is if the two cards
can be made to have their sampling clock “locked” on the same clock source.
• The delay between the record and playback signals must be minimal. Any signal
played has to “appear” on the playback (far end) signal slightly before the echo
canceller “sees” it in the near end signal, but excessive delay means that part
of the filter length is wasted. In the worst situations, the delay is such that it is
longer than the filter length, in which case, no echo can be cancelled.
• When it comes to echo tail length (filter length), longer is *not* better. Actually,
the longer the tail length, the longer it takes for the filter to adapt. Of course, a
tail length that is too short will not cancel enough echo, but the most common
problem seen is that people set a very long tail length and then wonder why no
echo is being cancelled.
Also useful is reading Echo Cancellation Demystified by Alexey Frunze1 , which ex-
plains the fundamental principles of echo cancellation. The details of the algorithm
described in the article are different, but the general ideas of echo cancellation through
adaptive filters are the same.
1 http://www.embeddedstar.com/articles/2003/7/article20030720-1.html
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 21
The different values of request allowed are (note that some only apply to the encoder
or the decoder):
SPEEX_GET_FRAME_SIZE Get the frame size used for the current mode (integer)
SPEEX_SET_MODE*†
SPEEX_GET_MODE*†
SPEEX_SET_LOW_MODE*†
SPEEX_GET_LOW_MODE*†
SPEEX_SET_HIGH_MODE*†
SPEEX_GET_HIGH_MODE*†
SPEEX_SET_COMPLEXITY* Set the CPU resources allowed for the encoder (in-
teger 1 to 10)
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 22
SPEEX_GET_COMPLEXITY* Get the CPU resources allowed for the encoder (in-
teger 1 to 10)
SPEEX_SET_BITRATE* Set the bit-rate to use to the closest value not exceeding
the parameter (integer in bps)
SPEEX_SET_VAD* Set voice activity detection (VAD) to on (1) or off (0) (integer)
SPEEX_SET_ABR* Set average bit-rate (ABR) to a value n in bits per second (inte-
ger in bps)
SPEEX_GET_PLC_TUNING* Get the current tuning of the encoder for PLC (inte-
ger in percent)
The admissible values for request are (unless otherwise note, the values are returned
through ptr):
SPEEX_MODE_FRAME_SIZE Get the frame size (in samples) for the mode
SPEEX_PREPROCESS_SET_AGC_LEVEL
SPEEX_PREPROCESS_GET_AGC_LEVEL
SPEEX_PREPROCESS_SET_DEREVERB_LEVEL
5 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 24
SPEEX_PREPROCESS_GET_DEREVERB_LEVEL
SPEEX_PREPROCESS_SET_DEREVERB_DECAY
SPEEX_PREPROCESS_GET_DEREVERB_DECAY
the Internet Engineering Task Force (IETF) and will be discussed at the March 18th
meeting in San Francisco.
1. The use of a linear prediction (LP) model to model the vocal tract
2. The use of (adaptive and fixed) codebook entries as input (excitation) of the LP
model
This section describes the basic ideas behind CELP. This is still a work in progress.
N
y[n] = ∑ ai x[n − i]
i=1
where y[n] is the linear prediction of x[n]. The prediction error is thus given by:
N
e[n] = x[n] − y[n] = x[n] − ∑ ai x[n − i]
i=1
The goal of the LPC analysis is to find the best prediction coefficients ai which
minimize the quadratic error function:
" #2
L−1 L−1 N
E= ∑ [e[n]] 2
= ∑ x[n] − ∑ ai x[n − i]
n=0 n=0 i=1
∂E
That can be done by making all derivatives ∂ai equal to zero:
" #2
∂E ∂ L−1 N
∂ai
= ∑ x[n] − ∑ ai x[n − i] = 0
∂ai n=0 i=1
For an order N filter, the filter coefficients ai are found by solving the system N × N
7 INTRODUCTION TO CELP CODING 30
R(1)
R(2)
r=
..
.
R(N)
N−1
R(m) = ∑ x[i]x[i − m]
i=0
all the roots of A(z) are within the unit circle, which means that 1/A(z) is always stable.
This is in theory; in practice because of finite precision, there are two commonly used
techniques to make sure we have a stable filter. First, we multiply R(0) by a number
slightly above one (such as 1.0001), which is equivalent to adding noise to the signal.
Also, we can apply a window to the auto-correlation, which is equivalent to filtering in
the frequency domain, reducing sharp resonances.
where T is the pitch period, β is the pitch gain. We call that long-term prediction
since the excitation is predicted from e[n − T ] with T ≫ N.
7 INTRODUCTION TO CELP CODING 31
The quantization of c[n] is where most of the bits in a CELP codec are allocated. It
represents the information that couldn’t be obtained either from linear prediction or
pitch prediction. In the z-domain we can represent the final signal X(z) as
C(z)
X(z) =
A(z) (1 − βz−T )
A(z/γ1 )
W (z) = (1)
A(z/γ2 )
where γ1 = 0.9 and γ2 = 0.6 in the Speex reference implementation. If a filter A(z) has
(complex) poles at pi in the z-plane, the filter A(z/γ) will have its poles at p′i = γpi ,
making it a flatter version of A(z).
The weighting filter is applied to the error signal used to optimize the codebook
search through analysis-by-synthesis (AbS). This results in a spectral shape of the noise
that tends towards 1/W (z). While the simplicity of the model has been an important
reason for the success of CELP, it remains that W (z) is a very rough approximation for
the perceptually optimal noise weighting function. Fig. 2 illustrates the noise shaping
that results from Eq. 1. Throughout this paper, we refer to W (z) as the noise weighting
filter and to 1/W (z) as the noise shaping filter (or curve).
7 INTRODUCTION TO CELP CODING 32
30
Speech signal
LPC synthesis filter
Reference shaping
20
10
Response (dB)
-10
-20
-30
-40
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
7.6 Analysis-by-Synthesis
One of the main principles behind CELP is called Analysis-by-Synthesis (AbS), mean-
ing that the encoding (analysis) is performed by perceptually optimising the decoded
(synthesis) signal in a closed loop. In theory, the best CELP stream would be pro-
duced by trying all possible bit combinations and selecting the one that produces the
best-sounding decoded signal. This is obviously not possible in practice for two rea-
sons: the required complexity is beyond any currently available hardware and the “best
sounding” selection criterion implies a human listener.
In order to achieve real-time encoding using limited computing resources, the CELP
optimisation is broken down into smaller, more manageable, sequential searches using
the perceptual weighting function described earlier.
8 SPEEX NARROWBAND MODE 33
• Minimizing the amount of information extracted from past frames (for robust-
ness to packet loss)
gain, Speex uses an integer to encode the pitch period, but uses a 3-tap predictor (3
gains). The adaptive codebook contribution ea [n] can thus be expressed as:
where g0 , g1 and g2 are the jointly quantized pitch gains and e[n] is the codec excitation
memory. It is worth noting that when the pitch is smaller than the sub-frame size, we
repeat the excitation at a period T . For example, when n − T + 1 ≥ 0, we use n − 2T + 1
instead. In most modes, the pitch period is encoded with 7 bits in the [17, 144] range
and the βi coefficients are vector-quantized using 7 bits at higher bit-rates (15 kbps
narrowband and above) and 5 bits at lower bit-rates (11 kbps narrowband and below).
Many current CELP codecs use moving average (MA) prediction to encode the
fixed codebook gain. This provides slightly better coding at the expense of introducing
a dependency on previously encoded frames. A second difference is that Speex encodes
the fixed codebook gain as the product of the global excitation gain g f rame with a sub-
frame gain corrections gsub f . This increases robustness to packet loss by eliminating
the inter-frame dependency. The sub-frame gain correction is encoded before the fixed
codebook is searched (not closed-loop optimized) and uses between 0 and 3 bits per
sub-frame, depending on the bit-rate.
The third difference is that Speex uses sub-vector quantization of the innovation
(fixed codebook) signal instead of an algebraic codebook. Each sub-frame is divided
into sub-vectors of lengths ranging between 5 and 20 samples. Each sub-vector is
chosen from a bitrate-dependent codebook and all sub-vectors are concatenated to form
a sub-frame. As an example, the 3.95 kbps mode uses a sub-vector size of 20 samples
with 32 entries in the codebook (5 bits). This means that the innovation is encoded
with 10 bits per sub-frame, or 2000 bps. On the other hand, the 18.2 kbps mode uses a
sub-vector size of 5 samples with 256 entries in the codebook (8 bits), so the innovation
uses 64 bits per sub-frame, or 12800 bps.
parameters. The parameters for a certain sub-frame are all packed before the following
sub-frame is packed. Note that the “OL” in the parameter description means that the
parameter is an open loop estimation based on the whole frame.
So far, no MOS (Mean Opinion Score) subjective evaluation has been performed
for Speex. In order to give an idea of the quality achivable with it, table 4 presents my
own subjective opinion on it. It sould be noted that different people will perceive the
quality differently and that the person that designed the codec often has a bias (one way
or another) when it comes to subjective evaluation. Last thing, it should be noted that
for most codecs (including Speex) encoding quality sometimes varies depending on
the input. Note that the complexity is only approximate (within 0.5 mflops and using
the lowest complexity setting). Decoding requires approximately 0.5 mflops in most
modes (1 mflops with perceptual enhancement).
1 1−ra1
where a1 and a2 depend on the mode in use and a3 = r 1 − 1−ra 2
with r = .9. The
second part of the enhancement consists of using a comb filter to enhance the pitch in
the excitation domain.
9 SPEEX WIDEBAND MODE (SUB-BAND CELP) 38
A FAQ
Vorbis is open-source and patent-free; why do we need Speex?
Vorbis is a great project but its goals are not the same as Speex. Vorbis is mostly aimed
at compressing music and audio in general, while Speex targets speech only. For that
reason Speex can achieve much better results than Vorbis on speech, typically 2-4 times
higher compression at equal quality.
B Sample code
This section shows sample code for encoding and decoding speech using the Speex
API. The commands can be used to encode and decode a file by calling:
% sampleenc in_file.sw | sampledec out_file.sw
where both files are raw (no header) files encoded at 16 bits per sample (in the machine
natural endianness).
B.1 sampleenc.c
sampleenc takes a raw 16 bits/sample file, encodes it and outputs a Speex stream to
stdout. Note that the packing used is NOT compatible with that of speexenc/speexdec.
#include <speex/speex.h>
#include <stdio.h>
/*The frame size in hardcoded for this sample code but it doesn’t have to be*/
#define FRAME_SIZE 160
int main(int argc, char **argv)
{
char *inFile;
FILE *fin;
short in[FRAME_SIZE];
float input[FRAME_SIZE];
char cbits[200];
int nbBytes;
/*Holds the state of the encoder*/
void *state;
/*Holds bits so they can be read and written to by the Speex routines*/
SpeexBits bits;
int i, tmp;
inFile = argv[1];
fin = fopen(inFile, "r");
/*Flush all the bits in the struct so we can encode a new frame*/
speex_bits_reset(&bits);
/*Write the size of the frame first. This is what sampledec expects but
it’s likely to be different in your own application*/
fwrite(&nbBytes, sizeof(int), 1, stdout);
/*Write the compressed data*/
fwrite(cbits, 1, nbBytes, stdout);
return 0;
}
B.2 sampledec.c
sampledec reads a Speex stream from stdin, decodes it and outputs it to a raw 16
bits/sample file. Note that the packing used is NOT compatible with that of speex-
enc/speexdec.
#include <speex/speex.h>
#include <stdio.h>
/*The frame size in hardcoded for this sample code but it doesn’t have to be*/
#define FRAME_SIZE 160
int main(int argc, char **argv)
{
char *outFile;
FILE *fout;
/*Holds the audio that will be written to file (16 bits per sample)*/
short out[FRAME_SIZE];
/*Speex handle samples as float, so we need an array of floats*/
float output[FRAME_SIZE];
char cbits[200];
int nbBytes;
/*Holds the state of the decoder*/
void *state;
/*Holds bits so they can be read and written to by the Speex routines*/
SpeexBits bits;
int i, tmp;
outFile = argv[1];
B SAMPLE CODE 47
draft-herlein-speex-rtp-profile-02
RTP Payload Format for the Speex Codec
Copyright Notice
Abstract
Table of Contents
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1].
Speex is based on the CELP [10] encoding technique with support for
either narrowband (nominal 8kHz), wideband (nominal 16kHz) or
ultra-wideband (nominal 32kHz), and (non-optimal) rates up to 48 kHz
sampling also available. The main characteristics can be summarized
as follows:
o Free software/open-source
o Integration of wideband and narrowband in the same bit-stream
o Wide range of bit-rates available
o Dynamic bit-rate switching and variable bit-rate (VBR)
o Voice Activity Detection (VAD, integrated with VBR)
o Variable complexity
For RTP based transportation of Speex encoded audio the standard RTP
header [2] is followed by one or more payload data blocks. An
optional padding terminator may also be used.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP Header |
C IETF RTP PROFILE 52
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| one or more frames of Speex .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| one or more frames of Speex .... | padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
4. RTP Header
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| synchronization source (SSRC) identifier |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| contributing source (CSRC) identifiers |
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The RTP header begins with an octet of fields (V, P, X, and CC) to
support specialized RTP uses (see [2] and [7] for details). For
Speex the following values are used.
This field identifies the version of RTP. The version used by this
C IETF RTP PROFILE 53
If the padding bit is set, the packet contains one or more additional
padding octets at the end which are not part of the payload. P is
set if the total packet size is less than the MTU.
The M bit indicates if the packet contains comfort noise. This field
is used in conjunction with the cng SDP attribute and is detailed
further in section 5 below. In normal usage this bit is set if the
packet contains comfort noise.
The sequence number increments by one for each RTP data packet sent,
and may be used by the receiver to detect packet loss and to restore
packet sequence. This field is detailed further in [2].
C IETF RTP PROFILE 54
Timestamp: 32 bits
SSRC/CSRC identifiers:
These two fields, 32 bits each with one SSRC field and a maximum of
16 CSRC fields, are as defined in [2].
5. Speex payload
An RTP packet MAY contain Speex frames of the same bit rate or of
varying bit rates, since the bit-rate for a frame is conveyed in band
C IETF RTP PROFILE 55
The encoding and decoding algorithm can change the bit rate at any 20
msec frame boundary, with the bit rate change notification provided
in-band with the bit stream. Each frame contains both "mode"
(narrowband, wideband or ultra-wideband) and "sub-mode" (bit-rate)
information in the bit stream. No out-of-band notification is
required for the decoder to process changes in the bit rate sent by
the encoder.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
C IETF RTP PROFILE 56
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| contributing source (CSRC) identifiers |
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ..speex data.. |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ..speex data.. |0 1 1 1 1|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Speex codecs [9] are able to detect the the bitrate from the payload
and are responsible for detecting the 20 msec boundaries between each
frame.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
C IETF RTP PROFILE 57
Full definition of the MIME [3] type for Speex will be part of the
Ogg Vorbis MIME type definition application [8].
Optional parameters:
Encoding considerations:
C IETF RTP PROFILE 58
Security Considerations:
Published specification:
Author/Change controller:
Ogg Stream in accordance with [8], with the exception that the
content of the Ogg Stream may be assumed to be Speex audio and Speex
audio only.
When conveying information by SDP [4], the encoding name MUST be set
to "speex". An example of the media representation in SDP for
offering a single channel of Speex at 8000 samples per second might
be:
Note that the RTP payload type code of 97 is defined in this media
definition to be ’mapped’ to the speex codec at an 8kHz sampling
frequency using the ’a=rtpmap’ line. Any number from 96 to 127 could
have been chosen (the allowed range for dynamic types).
The value of the sampling frequency is typically 8000 for narrow band
operation, 16000 for wide band operation, and 32000 for ultra-wide
band operation.
If for some reason the offerer has bandwidth limitations, the client
may use the "b=" header, as explained in SDP [4]. The following
example illustrates the case where the offerer cannot receive more
than 10 kbit/s.
Examples:
a=fmtp:97 mode=any;penh=1
The offerer may indicate that it wishes to send variable bit rate
frames with comfort noise:
In the example below the ptime value is set to 40, indicating that
there are 2 frames in each packet.
Note that the ptime parameter applies to all payloads listed in the
media line and is not used as part of an a=fmtp directive.
Care must be taken when setting the value of ptime so that the RTP
packet size does not exceed the path MTU.
For Speex use in H.245 [6] based systems, the fields in the
NonStandardMessage should be:
C IETF RTP PROFILE 63
t35CountryCode = Hex: B5
t35Extension = Hex: 00
manufacturerCode = Hex: 0026
[Length of the Binary Sequence (8 bit number)]
[Binary Sequence consisting of an ASCII string, no NULL
terminator]
The binary sequence is an ascii string merely for ease of use. The
string is not null terminated. The format of this string is
The optional variables are identical to those used for the SDP a=fmtp
strings discussed in section 5 above. The string is built to be all
on one line, each key-value pair separated by a semi-colon. The
optional variables MAY be omitted, which causes the default values to
be assumed. They are:
ebw=narrow;mode=3;vbr=off;cng=off;ptime=20;sr=8000;penh=no;
The fifth octet of the block is the length of the binary sequence.
14. Acknowledgments
The authors would like to thank Equivalence Pty Ltd of Australia for
their assistance in attempting to standardize the use of Speex in
H.323 applications, and for implementing Speex in their open source
OpenH323 stack. The authors would also like to thank Brian C. Wiles
<[email protected]> of StreamComm for his assistance in developing
the proposed standard for Speex use in H.323 applications.
The authors would also like to thank the following members of the
Speex and AVT communities for their input: Ross Finlayson, Federico
Montesino Pouzols, Henning Schulzrinne, Magnus Westerlund.
15. References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", RFC 2119.
[7] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video
Authors’ Addresses
Greg Herlein
2034 Filbert Street
San Francisco, California 94123
United States
EMail: [email protected]
Simon Morlat
35, av de Vizille App 42
Grenoble 38000
C IETF RTP PROFILE 67
France
EMail: [email protected]
Jean-Marc Valin
Department of Electrical and Computer Engineering
University of Sherbrooke
2500 blvd Universite
Sherbrooke, Quebec J1K 2R1
Canada
EMail: [email protected]
Roger Hardiman
49 Nettleton Road
Cheltenham, Gloucestershire GL51 6NR
England
EMail: [email protected]
Phil Kerr
England
EMail: [email protected]
C IETF RTP PROFILE 68
C IETF RTP PROFILE 69
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
[email protected].
C IETF RTP PROFILE 70
Disclaimer of Validity
Copyright Statement
Acknowledgment
D Speex License
Redistribution and use in source and binary forms, with or without modification, are
permitted provided that the following conditions are met:
• Redistributions of source code must retain the above copyright notice, this list of
conditions and the following disclaimer.
• Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or other
materials provided with the distribution.
• Neither the name of the Xiph.org Foundation nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.
0. PREAMBLE
The purpose of this License is to make a manual, textbook, or other written document
"free" in the sense of freedom: to assure everyone the effective freedom to copy and
redistribute it, with or without modifying it, either commercially or noncommercially.
Secondarily, this License preserves for the author and publisher a way to get credit for
their work, while not being considered responsible for modifications made by others.
This License is a kind of "copyleft", which means that derivative works of the
document must themselves be free in the same sense. It complements the GNU General
Public License, which is a copyleft license designed for free software.
We have designed this License in order to use it for manuals for free software,
because free software needs free documentation: a free program should come with
manuals providing the same freedoms that the software does. But this License is not
limited to software manuals; it can be used for any textual work, regardless of sub-
ject matter or whether it is published as a printed book. We recommend this License
principally for works whose purpose is instruction or reference.
is in part a textbook of mathematics, a Secondary Section may not explain any mathe-
matics.) The relationship could be a matter of historical connection with the subject or
with related matters, or of legal, commercial, philosophical, ethical or political position
regarding them.
The "Invariant Sections" are certain Secondary Sections whose titles are desig-
nated, as being those of Invariant Sections, in the notice that says that the Document is
released under this License.
The "Cover Texts" are certain short passages of text that are listed, as Front-Cover
Texts or Back-Cover Texts, in the notice that says that the Document is released under
this License.
A "Transparent" copy of the Document means a machine-readable copy, repre-
sented in a format whose specification is available to the general public, whose con-
tents can be viewed and edited directly and straightforwardly with generic text editors
or (for images composed of pixels) generic paint programs or (for drawings) some
widely available drawing editor, and that is suitable for input to text formatters or for
automatic translation to a variety of formats suitable for input to text formatters. A
copy made in an otherwise Transparent file format whose markup has been designed
to thwart or discourage subsequent modification by readers is not Transparent. A copy
that is not "Transparent" is called "Opaque".
Examples of suitable formats for Transparent copies include plain ASCII without
markup, Texinfo input format, LATEX input format, SGML or XML using a publicly
available DTD, and standard-conforming simple HTML designed for human modifi-
cation. Opaque formats include PostScript, PDF, proprietary formats that can be read
and edited only by proprietary word processors, SGML or XML for which the DTD
and/or processing tools are not generally available, and the machine-generated HTML
produced by some word processors for output purposes only.
The "Title Page" means, for a printed book, the title page itself, plus such following
pages as are needed to hold, legibly, the material this License requires to appear in the
title page. For works in formats which do not have any title page as such, "Title Page"
means the text near the most prominent appearance of the work’s title, preceding the
beginning of the body of the text.
2. VERBATIM COPYING
You may copy and distribute the Document in any medium, either commercially or
noncommercially, provided that this License, the copyright notices, and the license
notice saying this License applies to the Document are reproduced in all copies, and
E GNU FREE DOCUMENTATION LICENSE 74
that you add no other conditions whatsoever to those of this License. You may not
use technical measures to obstruct or control the reading or further copying of the
copies you make or distribute. However, you may accept compensation in exchange
for copies. If you distribute a large enough number of copies you must also follow the
conditions in section 3.
You may also lend copies, under the same conditions stated above, and you may
publicly display copies.
3. COPYING IN QUANTITY
If you publish printed copies of the Document numbering more than 100, and the Doc-
ument’s license notice requires Cover Texts, you must enclose the copies in covers that
carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover,
and Back-Cover Texts on the back cover. Both covers must also clearly and legibly
identify you as the publisher of these copies. The front cover must present the full title
with all words of the title equally prominent and visible. You may add other mate-
rial on the covers in addition. Copying with changes limited to the covers, as long as
they preserve the title of the Document and satisfy these conditions, can be treated as
verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit legibly, you should
put the first ones listed (as many as fit reasonably) on the actual cover, and continue the
rest onto adjacent pages.
If you publish or distribute Opaque copies of the Document numbering more than
100, you must either include a machine-readable Transparent copy along with each
Opaque copy, or state in or with each Opaque copy a publicly-accessible computer-
network location containing a complete Transparent copy of the Document, free of
added material, which the general network-using public has access to download anony-
mously at no charge using public-standard network protocols. If you use the latter op-
tion, you must take reasonably prudent steps, when you begin distribution of Opaque
copies in quantity, to ensure that this Transparent copy will remain thus accessible at
the stated location until at least one year after the last time you distribute an Opaque
copy (directly or through your agents or retailers) of that edition to the public.
It is requested, but not required, that you contact the authors of the Document well
before redistributing any large number of copies, to give them a chance to provide you
with an updated version of the Document.
E GNU FREE DOCUMENTATION LICENSE 75
4. MODIFICATIONS
You may copy and distribute a Modified Version of the Document under the conditions
of sections 2 and 3 above, provided that you release the Modified Version under pre-
cisely this License, with the Modified Version filling the role of the Document, thus
licensing distribution and modification of the Modified Version to whoever possesses a
copy of it. In addition, you must do these things in the Modified Version:
• A. Use in the Title Page (and on the covers, if any) a title distinct from that of the
Document, and from those of previous versions (which should, if there were any,
be listed in the History section of the Document). You may use the same title as
a previous version if the original publisher of that version gives permission.
• B. List on the Title Page, as authors, one or more persons or entities responsible
for authorship of the modifications in the Modified Version, together with at least
five of the principal authors of the Document (all of its principal authors, if it has
less than five).
• C. State on the Title page the name of the publisher of the Modified Version, as
the publisher.
• F. Include, immediately after the copyright notices, a license notice giving the
public permission to use the Modified Version under the terms of this License,
in the form shown in the Addendum below.
• G. Preserve in that license notice the full lists of Invariant Sections and required
Cover Texts given in the Document’s license notice.
• I. Preserve the section entitled "History", and its title, and add to it an item stating
at least the title, year, new authors, and publisher of the Modified Version as
given on the Title Page. If there is no section entitled "History" in the Document,
create one stating the title, year, authors, and publisher of the Document as given
on its Title Page, then add an item describing the Modified Version as stated in
the previous sentence.
E GNU FREE DOCUMENTATION LICENSE 76
• J. Preserve the network location, if any, given in the Document for public access
to a Transparent copy of the Document, and likewise the network locations given
in the Document for previous versions it was based on. These may be placed in
the "History" section. You may omit a network location for a work that was pub-
lished at least four years before the Document itself, or if the original publisher
of the version it refers to gives permission.
• L. Preserve all the Invariant Sections of the Document, unaltered in their text and
in their titles. Section numbers or the equivalent are not considered part of the
section titles.
• M. Delete any section entitled "Endorsements". Such a section may not be in-
cluded in the Modified Version.
If the Modified Version includes new front-matter sections or appendices that qualify
as Secondary Sections and contain no material copied from the Document, you may at
your option designate some or all of these sections as invariant. To do this, add their
titles to the list of Invariant Sections in the Modified Version’s license notice. These
titles must be distinct from any other section titles.
You may add a section entitled "Endorsements", provided it contains nothing but
endorsements of your Modified Version by various parties–for example, statements of
peer review or that the text has been approved by an organization as the authoritative
definition of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage
of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the
Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text
may be added by (or through arrangements made by) any one entity. If the Document
already includes a cover text for the same cover, previously added by you or by arrange-
ment made by the same entity you are acting on behalf of, you may not add another;
but you may replace the old one, on explicit permission from the previous publisher
that added the old one.
E GNU FREE DOCUMENTATION LICENSE 77
The author(s) and publisher(s) of the Document do not by this License give per-
mission to use their names for publicity for or to assert or imply endorsement of any
Modified Version.
5. COMBINING DOCUMENTS
You may combine the Document with other documents released under this License,
under the terms defined in section 4 above for modified versions, provided that you
include in the combination all of the Invariant Sections of all of the original documents,
unmodified, and list them all as Invariant Sections of your combined work in its license
notice.
The combined work need only contain one copy of this License, and multiple iden-
tical Invariant Sections may be replaced with a single copy. If there are multiple In-
variant Sections with the same name but different contents, make the title of each such
section unique by adding at the end of it, in parentheses, the name of the original au-
thor or publisher of that section if known, or else a unique number. Make the same
adjustment to the section titles in the list of Invariant Sections in the license notice of
the combined work.
In the combination, you must combine any sections entitled "History" in the vari-
ous original documents, forming one section entitled "History"; likewise combine any
sections entitled "Acknowledgements", and any sections entitled "Dedications". You
must delete all sections entitled "Endorsements."
6. COLLECTIONS OF DOCUMENTS
You may make a collection consisting of the Document and other documents released
under this License, and replace the individual copies of this License in the various
documents with a single copy that is included in the collection, provided that you follow
the rules of this License for verbatim copying of each of the documents in all other
respects.
You may extract a single document from such a collection, and distribute it individ-
ually under this License, provided you insert a copy of this License into the extracted
document, and follow this License in all other respects regarding verbatim copying of
that document.
E GNU FREE DOCUMENTATION LICENSE 78
8. TRANSLATION
Translation is considered a kind of modification, so you may distribute translations of
the Document under the terms of section 4. Replacing Invariant Sections with trans-
lations requires special permission from their copyright holders, but you may include
translations of some or all Invariant Sections in addition to the original versions of
these Invariant Sections. You may include a translation of this License provided that
you also include the original English version of this License. In case of a disagreement
between the translation and the original English version of this License, the original
English version will prevail.
9. TERMINATION
You may not copy, modify, sublicense, or distribute the Document except as expressly
provided for under this License. Any other attempt to copy, modify, sublicense or
distribute the Document is void, and will automatically terminate your rights under
this License. However, parties who have received copies, or rights, from you under this
License will not have their licenses terminated so long as such parties remain in full
compliance.
to the present version, but may differ in detail to address new problems or concerns.
See http://www.gnu.org/copyleft/.
Each version of the License is given a distinguishing version number. If the Docu-
ment specifies that a particular numbered version of this License "or any later version"
applies to it, you have the option of following the terms and conditions either of that
specified version or of any later version that has been published (not as a draft) by
the Free Software Foundation. If the Document does not specify a version number of
this License, you may choose any version ever published (not as a draft) by the Free
Software Foundation.
Index
ACELP, 43 perceptual enhancement, 9, 21, 36
acoustic echo cancellation, 18 pitch, 30
algorithmic delay, 10 preprocessor, 17
analysis-by-synthesis, 31
quadrature mirror filter, 38
API, 15
auto-correlation, 30 quality, 8
average bit-rate, 9, 22 RTP, 25
bit-rate, 37
sampling rate, 8
speexdec, 13
CELP, 6, 28
complexity, 6, 8, 36, 37 speexenc, 12
standards, 25
constant bit-rate, 8
Levinson-Durbin, 30
libspeex, 15
line spectral pair, 33
linear prediction, 28, 33
narrowband, 6, 8, 33
Ogg, 26, 40
open-source, 6, 40
patent, 6, 40
80