100% found this document useful (1 vote)
137 views145 pages

6-Audio Basics FileType MP3Compr

1. The document provides an overview of sound as a multimedia element, including the basic principles of how sound works through vibration, transmission through a medium like air, and reception and perception by the ear and brain. 2. It discusses guidelines for using sound in multimedia, such as maintaining consistency in style and quality, coordinating with other media, and using sound cues for specific events. 3. Key aspects of digital sound are explained, including the conversion of analog sound to digital audio through sampling and quantization, as well as factors that determine sound quality like sample rate and audio resolution.

Uploaded by

علی احمد
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
137 views145 pages

6-Audio Basics FileType MP3Compr

1. The document provides an overview of sound as a multimedia element, including the basic principles of how sound works through vibration, transmission through a medium like air, and reception and perception by the ear and brain. 2. It discusses guidelines for using sound in multimedia, such as maintaining consistency in style and quality, coordinating with other media, and using sound cues for specific events. 3. Key aspects of digital sound are explained, including the conversion of analog sound to digital audio through sampling and quantization, as well as factors that determine sound quality like sample rate and audio resolution.

Uploaded by

علی احمد
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

Multimedia Systems

And Design
Overview of Sound as Multimedia
Element

1
What is SOUND?
• Sound comprises the spoken word, voices, music
• and even noise. It is a complex relationship
involving:
– a vibrating object (sound
source)

– a transmission medium
(usually air)

– a receiver (ear) and;

– a preceptor (brain).

2
The Power of Sound

Something vibrates Waves of pressure Ear drums will


in the air translate
these changes in wave
Forms as sound
• Sound pressure is measured in  dB
• (decibel)
Sound waves are known as
waveforms.
3
Use of Sound

 Providing controls (such as skip, pause, mute,


volume adjustment) is an effective way of
keeping users involved and motivated.
 The decision to incorporate sound into a
multimedia product should have solid reasoning
behind it.

4
Use of Sound
 Sounds are either content sounds or ambient
sounds.
 Content sounds furnish information
 Narration, dialogue are content sounds.
 Music and other sounds can be considered as content
sounds if they are parts of the topic themselves.
 Ambient sounds reinforce messages and set the
mood
 Background sounds and special effects are ambient
sounds.
 Special sound effects can reinforce or enliven a
message.
5
Guidelines for Using Sound
 Use the same style of music (if multiple sound
files are needed) to maintain a sense of unity
 Coordinate sound files with other media
elements
 Sound quality should be kept consistent
 Record at a rate and resolution that is
appropriate to the delivery mode

6
Guidelines for Using Sound
 Use the same voice for narration and
voiceovers, but different voices for different
characters
 Optimize files for background music
 Use sound cues for specific events
 During voice-overs, background music should be
turned off or adjusted to a low volume such that
the spoken words can be understood without
difficulty

7
Example of
Waveforms
Piano

Pan flute

Snare drum
5
Sound 9

• A pleasant sound has a regular wave


pattern. The pattern is repeated over and
over.

 But
the waves of noise are irregular. They do not
have a repeated pattern.
Basic Principles of Sound

 When we speak, vibrations, called sound


waves, are created.
 Sound waves have a recurring pattern or an
analog wave pattern called a waveform.

This analog wave pattern represents the volume and frequency of a sound.

10
Basic Principles of Sound
 Amplitude: Distance between the valley and the
peak of a waveform; determines volume
 Volume is measured in decibels (dB)
 Decibel (dB) is a logarithmic unit used to describe a ratio.
 One dB is close to Just Noticeable Difference (JND) for
sound level.
 Frequency: Number of peaks that occur in one
second measured by the distance between the
peaks; determines pitch

11
Decibel
Table
dB Watts Example
195 25–40 million Saturn rocket
170 100,000 Jet engine with afterburner
160 10,000 Turbojet engine at 7,000-pounds
thrust
150 1,000 ALSETEX splinter less stun grenade
140 100 2 JBL2226 speakers
130 10 75-piece orchestra, at fortissimo
120 1 Large chipping hammer
110 0.1 Riveting machine
100 0.01 Automobile on highway
90 0.001 Subway train; a shouting voice
80 0.0001 Inside a 1952 Corvette at 60 mph
70 0.00001 Voice conversation; freight
train 100 feet away
60 0.000001 Large department store
50 0.0000001 Average residence or small business
Basic Principles of Sound
 Analog sound is a continuous stream of sound waves.
For sound to be included in multimedia applications,
analog sound must be converted to digital form.
 Digitizing (or sound sampling): the process of
converting analog sound to numbers
 Digital Audio: An analog sound that has been
converted to numbers

14
Basic Principles of Sound

Sound sampling converts analog sound to digital audio.

15
Basic Principles of Sound
 Quantization of Sound: The process of converting a
continuous range of values into a finite range of
discreet values.
- This is a function of analog-to-digital converters,
which create a series of digital values to represent the
original analog signal. 
- bit depth (number of bits available) determines the
accuracy and quality of the quantized value.
Basic Principles of Sound

 During digitizing, sound samples are taken at


regular time instants.
 Time instants are discrete.
 Sound samples (the volumes of sound at time
instants) cannot be stored precisely. Instead, only
quantified values can be stored.
 The feasible quantified values are known as
quantization levels.

17
Basic Principles of Sound

 The number of quantization levels is related to the


quality of digital audio.
 If more quantization levels are allowed, the
difference between the original value and the
quantified value will be smaller and we will get a
better quality of the digital representation.
 However, this would also mean a higher cost for
storage and processing of these values inside a
computer (disks of larger capacity and more
powerful CPUs are required)

18
19
Characteristic of Sound Waves
• Sound is described in terms of two
characteristics:
– Frequency (or pitch)
– Amplitude (or loudness)
Frequency 20

• Frequency is a measure of how many vibrations


occur in one second. This is measured in Hertz
(abbreviation Hz) and directly corresponds to the pitch
of a sound.
– The more frequent vibration occurs the higher the pitch of
the sound.

Low pitch High pitch


 Optimally, people can hear from 20 Hz to
 Sounds below 20 20,000
Hz are infrasonic
Hz (20 kHz)
 sounds above 20 kHz are ultrasonic.
Amplitude 21

Amplitude is the maximum displacement of a wave


from an equilibrium position.
– The louder a sound, the more energy it has. This means loud
sounds have a large amplitude.

Quiet
Loud

Low amplitude High Amplitude


The amplitude relates to how loud a sound is.
22
Characteristic of Sound Waves
Time for one cycle

Amplitude wavelength distance


along wave
Cycle
Sound Quality
 Factors that determine the sound quality of digital
audio
 sample rate
 audio resolution
 Sample rate
 Number of sound samples taken per second
 Also known as sampling rate
 Measured in kilohertz (kHz),
 Common values: 11 kHz, 22 kHz, 44 kHz
 CD quality: 44 kHz

23
Sound Quality
 Audio resolution
 Also known as sample size or bit resolution
 Number of binary bits used to represent each sound
sample
 As the audio resolution increases, the quality of the
digital audio also improves.
 Audio resolution determines the accuracy with which
sound can be digitized.
 Common values: 8 bits, 16 bits
 CD quality: 16 bits

24
Sound Quality

Audio Resolution Number of Quantization Levels

16-bit 216 65536 possible values for the sound sample

8-bit 28 256 possible values for the sound sample

4-bit 24 16 possible values for the sound sample

2-bit 22 4 possible values for the sound sample


Downloaded vs. Streamed

 Web audio: downloaded or streamed


 Downloaded audio file must be entirely saved to
the user’s computer before it can be played.
 Streaming: a more advanced process that allows
audio file to be played as it is downloading (i.e.
before the entire file is transferred to the user’s
computer)
 If we want our audio files to be streamed over the
Internet, the web-hosting service must support
streaming.

26
Downloaded vs. Streamed

 If the user’s computer receives streaming audio


data more quickly than required, the excess data
will be stored in a buffer.
 If the user’s computer receives streaming audio
data slower than required, the data stored in the
buffer will be used. If the buffer becomes empty, the
user will experience a break.

27
Monophonic vs. Stereo Sound

 Monophonic sounds: flat and unrealistic when


compared to stereo sounds
 Stereo sounds: much more dynamic and lifelike
 Monophonic sound files are sometimes a more
appropriate choice where storage and transfer time are
major concerns.
 Narration and voiceovers can effectively be saved in a
monophonic format.
 Music almost always must be recorded and saved in
stereo.

28
Digital Audio File Size
File size of a digital audio recording (in bytes)
(assume that there is no compression)

* Where number of channels, either 1 for monophonic or 2 for stereo

File size of a monophonic digital audio recording (in bytes)


(assume that there is no compression)

File size of a stereo digital audio recording (in bytes)


(assume that there is no compression)

29
Digital Audio File Size
Calculate the file size of a digital audio recording with a sampling rate of 44.1
KHz, recording resolution was 16 bit in stereo channel for a duration of 3 Mins.

= (44.1 x 1000) x(16/8) x (3 x 60) x 2


= 31,752,000 B

= 31,752,000/1024
= 31,007.8152 KB .1024
= 30.28 MB

Calculate the file size of a digital audio recording with a voice quality of 11 KHz,
recording in 8 bit mono channel for a duration of 1,000 Mins.

Find the audio file size of a high quality music, recorded for 61 Mins in
44KHz rate at 16 bit resolution in stereo.
30
Digital Audio
• Digital audio data is the representation of sound, stored
in the form of samples point.
• Quality of digital recording depends on the sampling
rate, that is, the number of samples point taken per
second (Hz).

High Sampling Rate Samples stored in digital form

waveform

Low Sampling Rate


Sampling rates
diagram
It is impossible to reconstruct the
original waveform if the sampling
frequency is too low.
Digital Audio
• The three sampling frequencies most often used in
multimedia are 44.1
kHz, 22.05 kHz and 11.025 kHz.
– The higher the sampling rate, the more the
measurements are taken (better quality).
– The lower the sampling rate, the lesser the
measurements are taken (low quality).

High Sampling Rate Low Sampling Rate


Principles of Digitization
 Sampling: Divide the horizontal axis (time) into discrete pieces
 Quantization: Divide the vertical axis (signal strength - voltage)
into pieces/steps. For example, 8-bit quantization divides the
vertical axis into 256 levels. 16 bit gives you 65536 levels. Lower
the quantization, lower the quality of the sound
 Linear (equla steps) vs. Non-Linear quantization (unequal
steps):
 If the scale used for the vertical axis is linear we say its linear
quantization;
 If its logarithmic then we call it non-linear (-law or A-law in
Europe).
-The non-linear scale is used because small amplitude signals
are more likely to occur than large amplitude signals, and they are
less likely to mask any noise.
Sampling and Quantization

S am ple
S ample

Time Time

3-bit
Sampling quantization
 Sampling rate: Number of  3-bit quantization gives 8
samples per second possible sample values
(measured in Hz)  E.g., CD standard audio
 E.g., CD standard audio uses 16-bit quantization
uses a sampling rate of giving 65536 values.
44,100 Hz (44100 samples  Why Quantize?
per second)  To Digitize!
Nyquist Sampling Theorem

Consider a sine
wave

Sampling once a
cycle
Appears as a
constant signal

Sampling 1.5 times


each cycle
Appears as a low
frequency
 For Lossless digitization, the sampling rate sine signal
should be at least twice the maximum
frequency component
37
Sampling Theorem

A signal can be reconstructed exactly


if it is sampled at a rate at least twice
the maximum frequency component
in it!

• Nyquist rate = 2f m
What should be 38

sampling rate for audio?


• Audible Range: 20 Hz to 20 kHz

Sample (music) >40k samples
• Voice Range: 0 to 4 kHz!

Sample(speech) > 8k samples
39
Quality Sampling Bits per Mono/ Bitrate Signal
rate sample Stereo (if uncompressed) bandwidth
(kHz) (kB/s) (Hz)
Telephone 8 8 Mono 8 200 – 3,400

AM Radio 11.025 8 Mono 11.0 100 – 5,500

FM Radio 22.05 16 Stereo 88.2 20 – 11,000

CD 44.1 16 Stereo 176.4 5 – 20,000

DVD Audio 192 (max) 24 Upto 6 1,200.0 (max) 0 – 96,000


(max) Channles (max)
40
Digital Audio
• Other than that, it also depends on:
– The quality of original audio source.
– The quality of capture device & supporting
hardware.
– The characteristics used for capture.
– The capability of the playback environment.
Editing Digital
Recordings
Once a recording
has been made, it
will almost
certainly need to
Trimming
Trimming Removing “dead air” or blank space from the front of
a recording and any unnecessary extra time off the end is your
first sound editing task.
Trimming even a few seconds here and there might make a big
difference in your file size.
Trimming is typically accomplished by dragging the mouse
cursor over a graphic representation of your recording and
choosing a menu command such as Cut, Clear, Erase, or
Silence.
Trimming
(Audacity)
Splicing and Assembly
 Splicing and Assembly Using the same tools mentioned for
trimming, you will probably want to remove the extraneous
noises that certainly creep into a recording.
 Even the most controlled studio voice-overs require touch-up
Also, you may need to assemble longer recordings by cutting
and pasting together many shorter ones.
 In the old days, this was done by splicing and assembling
actual pieces of magnetic tape.
Split & Merge
Volume Adjustments
Volume Adjustments
 If you are trying to assemble ten
different recordings into a single
sound track, there is little
chance that all the segments
will have the same volume.
 To provide a consistent volume
level, select all the data in the
file, and raise or lower the
overall volume by a certain
amount. Don’t increase the
volume too much, or you may
distort the file.
 It is best to use a sound editor
to normalize the assembled
audio file to a particular level,
say 80 percent to 90 percent of
maximum (without clipping), or
about –16 dB
Resampling or Down sampling
Resampling or Down sampling If you have recorded and
edited your sounds at 16-bit sampling rates but are using
lower rates and resolutions in your project, you must
resample or down sample the file.
Your software will examine the existing digital recording and
work through it to reduce the number of samples. This
process may save considerable disk space.
Resampling
Fade-ins and Fade-outs
Fade-ins and Fade-
outs
Most programs offer
enveloping capability,
useful for long
sections that you
wish to fade in or
fade out gradually.
This enveloping helps
to smooth out the
very beginning and
the very end of a
sound file.
Smooth
Transition
Equalization
Some programs offer digital equalization (EQ) capabilities
that allow you to modify a recording’s frequency content so
that it sounds brighter (more high frequencies) or darker
(low, ominous rumbles).
Equalization
Time Stretching
Time Stretching Advanced programs let you alter the length (in time) of a sound file
without changing its pitch.

This feature can be very useful, but watch out: most time-stretching algorithms will severely
degrade the audio quality of the file if the length is altered more than a few percent in either
direction.
CM3106 Chapter 4: Introduction to
Digital Audio

Prof David Marshall


[email protected]
and
Dr Kirill Sidorov
[email protected]
www.facebook.com/kirill.sidorov

School of Computer Science & Informatics


Cardiff University, UK
What is Sound? (Recap from CM2202)
Sound Generation
Source — Generates Sound
Air Pressure changes
Electrical — Loud Speaker
Acoustic — Direct Pressure Variations

Sound Reception
Destination — Receives Sound
Electrical — Microphone produces electric signal
Ears — Responds to pressure hear sound
(MPEG Audio — exploits this fact)
CM3106 Chapter 4: Digital Audio What is Sound? 2
Digitising Sound

Microphone:
Receives sound
Converts to analog signal.
Computer like discrete entities
Need to convert Analog-to-Digital — Dedicated
Hardware (e.g. Soundcard)

Also known as Digital Sampling


CM3106 Chapter 4: Digital Audio Digital Sampling 3
Sample Rates and Bit Size
Bit Size — Quantisation
How do we store each sample value (Quantisation)?
8 Bit Value (0-255)
16 Bit Value (Integer) (0-65535)

Sample Rate
How many Samples to take?
11.025 KHz — Speech (Telephone 8 KHz)
22.05 KHz — Low Grade Audio
(WWW Audio, AM Radio)
44.1 KHz — CD Quality

CM3106 Chapter 4: Digital Audio Digital Sampling 4


Digital Sampling (1)

Sampling process basically involves:


Measuring the analog signal at regular discrete
intervals
Recording the value at these points

CM3106 Chapter 4: Digital Audio Digital Sampling 5


Digital Sampling (2)

CM3106 Chapter 4: Digital Audio Digital Sampling 6


Nyquist’s Sampling Theorem

The Sampling Frequency is critical to


the accurate reproduction of a digital
version of an analog waveform

Nyquist’s Sampling Theorem


The Sampling frequency for a signal must be at least twice
the highest frequency component in the signal.

CM3106 Chapter 4: Digital Audio Digital Sampling 7


Sampling at Signal Frequency

CM3106 Chapter 4: Digital Audio Digital Sampling 8


Sampling at Twice Nyquist Frequency

CM3106 Chapter 4: Digital Audio Digital Sampling 9


Sampling at above Nyquist Frequency

CM3106 Chapter 4: Digital Audio Digital Sampling 10


If you get Nyquist Sampling Wrong? (1)

Digital Sampling Artefacts Arise — Effect known as Aliasing


Affects Audio, Imagery and Video
CM3106 Chapter 4: Digital Audio Digital Sampling 11
If you get Nyquist Sampling Wrong? (2)

CM3106 Chapter 4: Digital Audio Digital Sampling 12


If you get Nyquist Sampling Wrong? (3)

What does aliasing sound like?


(Click on Images to play sounds)
radfreq plot 1
1

0.5 0.5

−0.5
Sine Wave 0

−0.5
Aliased Sine Wav
−1 −1
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Aliased Piano

MATLAB Code for Sine Demos above: Plot Version ,


Audio Version
More on image and video sampling artefacts later.
CM3106 Chapter 4: Digital Audio Digital Sampling 13
Implications of Sample Rate and Bit Size (1)

Affects Quality of Audio

Ears do not respond to sound in a linear fashion


Decibel (dB) a logarithmic measurement of sound
16-Bit has a signal-to-noise ratio of 98 dB — virtually
inaudible
8-bit has a signal-to-noise ratio of 50 dB
Therefore, 8-bit is roughly 8 times as noisy
6 dB increment is twice as loud

CM3106 Chapter 4: Digital Audio Digital Sampling 14


Implications of Sample Rate and Bit Size (2)

Audio Sample Rate and Bit Size Examples

File Type Audio File (all mono)


44Hz 16 bit
44KHz 8-bit
22 KHz 16-bit
22KHz 8-Bit
11KHz 8-bit

Web Link:
Click Here to Hear Sound Examples
CM3106 Chapter 4: Digital Audio Digital Sampling 15
Implications of Sample Rate and Bit Size (2)

Affects Size of Data

File Type 44.1 KHz 22.05 KHz 11.025 KHz


16 Bit Stereo 10.1 Mb 5.05 Mb 2.52 Mb
16 Bit Mono 5.05 Mb 2.52 Mb 1.26 Mb
8 Bit Mono 2.52 Mb 1.26 Mb 630 Kb
Memory Required for 1 Minute of Digital Audio

CM3106 Chapter 4: Digital Audio Digital Sampling 16


Practical Implications of Nyquist Sampling Theory

Filtering of Signal

Must (low pass) filter signal before sampling:

Otherwise strange artefacts from high frequency


(above Nyquist Limit) signals would appear in the
sampled signal.

CM3106 Chapter 4: Digital Audio Digital Sampling 17


Why are CD Sample Rates 44.1 KHz?

Why are CD Sample Rates 44.1 KHz?

CM3106 Chapter 4: Digital Audio Digital Sampling 18


Why are CD Sample Rates 44.1 KHz?

Why are CD Sample Rates 44.1 KHz?

Upper range of human hearing is around 20-22 KHz —


Apply Nyquist Theorem

CM3106 Chapter 4: Digital Audio Digital Sampling 19


Common Digital Audio Formats

Popular audio file formats include


.au (Origin: Unix, Sun),
.aiff (MAC, SGI),
.wav (PC, DEC)
Compression can be utilised in some of the above but is
not Mandatory
A simple and widely used (by above) audio compression
method is Adaptive Delta Pulse Code Modulation
(ADPCM).
Based on past samples, it predicts the next sample and
encodes the difference between the actual value and the
predicted value.
More on this later (Audio Compression)

CM3106 Chapter 4: Digital Audio Digital Audio Formats 20


Common Audio Formats (Cont.)

Many formats linked to audio applications


Most use some compression
Common ones:
Sounblaster — .voc (Can use Silence Deletion (More on
this later (Audio Compression))
Protools/Sound Designer – .sd2
Realaudio — .ra.
Ogg Vorbis — .ogg
AAC , Apple, mp4 — More Later
Flac — .flac, More Later
Dolby AC coding — More Later
MPEG AUDIO — More Later (MP3 and
MPEG-4)

CM3106 Chapter 4: Digital Audio Digital Audio Formats 21


Synthetic Sounds — reducing bandwidth?

Synthesis Pipeline

Synthesise sounds — hardware or software (more later)


Client produces sound — only send parameters to control
sound (MIDI/MP4/HTML5 later)

CM3106 Chapter 4: Digital Audio Digital Audio Formats 22


Synthesis Methods (More Later)
FM (Frequency Modulation) Synthesis – used in low-end Sound
Blaster cards, OPL-4 chip, Yamaha DX Synthesiser range popular
in Early 1980’s.
Wavetable synthesis – wavetable generated from sampled sound
waves of real instruments
Additive synthesis — make up signal from smaller simpler
waveforms
Subtractive synthesis — modify a (complex) waveform but taking
out (Filtering) elements
Granular Synthesis — use small fragments of existing samples to
make new sounds
Physical Modelling — model how acoustic sound in generated in
software
Sample-based synthesis — record and play back recorded audio,
often small fragments and audio processed.
Most modern Synthesisers use a mixture of sample and synthesis
methods.
CM3106 Chapter 4: Digital Audio
Digital Audio Formats 23
Synthetic Sounds — Analogies with Vector
Graphics
Use more high-level descriptions to represent signals.
Recorded sounds and digital images: regular sampling;
large data size; difficult to modify
Synthetic sounds and vector graphics: high level
descriptions; small data size; easier to edit. Conversion
is needed before display – synthesis or rasterisation
Difference: 1D vs 2D

More on how sound synthesis works soon


CM3106 Chapter 4: Digital Audio Digital Audio Formats 24
CM3106 Multimedia

MPEG Audio Compression


Dr Kirill Sidorov
[email protected]
www.facebook.com/kirill.sidorov

Prof David Marshall


[email protected]

School of Computer Science and Informatics


Cardiff University, UK
Audio compression (MPEG and others)

As with video a number of compression techniques have been applied to


audio.

RECAP (Already Studied)

Traditional lossless compression methods (Huffman, LZW, etc.) usually


don’t work well on audio compression.
• For the same reason as in image and video compression:
Too much variation in data over a short time.
Simple but limited practical methods

• Silence compression — detect the “silence”, or, more generally


run-length encoding (seen examples before).
• Differential Pulse Code Modulation (DPCM).
Relies on the fact that difference in amplitude in successive
samples is small then we can used reduced bits to store the
difference (seen examples before).
• Adaptive Differential Pulse Code Modulation (ADPCM)
e.g., in CCITT G.721 – 16 or 32 Kbits/sec. Encodes the difference
between two consecutive samples but uses adaptive quantisation.
Simple but limited practical methods

• Adaptive Predictive Coding (APC) typically used on speech.


• Input signal is divided into fixed segments (windows)
• For each segment, some sample characteristics are
computed, e.g. pitch, period, loudness.
• These characteristics are used to predict the signal.
• Computerised talking (speech synthesisers use such
methods) but low bandwidth:

Acceptable quality at 8 kbits/sec


Simple but limited practical methods

• Linear Predictive Coding (LPC) fits signal to speech model and then
transmits parameters of model as in APC.
Speech Model:
• Speech Model:

Pitch, period, loudness, vocal tract


parameters (voiced and unvoiced sounds).

• Synthesised speech
• More prediction coefficients than APC – lower sampling rate
• Still sounds like a computer talking,
• Bandwidth as low as 2.4 kbits/sec.
Psychoacoustics and perceptual coding

Basic Idea: Exploit areas where the human


ear is less sensitive to sound to achieve
compression.
E.g. MPEG audio, Dolby AC.
How do we hear sound?
External link: Perceptual Audio Demos
Sound revisited

• Sound is produced by a vibrating source.


• The vibrations disturb air molecules.
• Produce variations in air pressure: lower than average
pressure, rarefactions, and higher than average,
compressions. This produces sound waves.
• When a sound wave impinges on a surface ( e.g. eardrum or
microphone) it causes the surface to vibrate in sympathy:

• In this way acoustic energy is transferred from a source to a


receptor.
Human hearing

• Upon receiving the the waveform the eardrum vibrates in sympathy


• Through a variety of mechanisms the acoustic energy is transferred
to nerve impulses that the brain interprets as sound.

The ear can be regarded as being made up of 3 parts:


• The outer ear,
• The middle ear,
• The inner ear.
We consider:
• The function of the main parts of the ear
• How the transmission of sound is processed.
Click Here to run flash ear demo over the web
(Shockwave Required)
The outer ear

• Ear canal: Focuses the incoming audio.


• Eardrum (tympanic membrane):
• Interface between the external and middle ear.
• Sound is converted into mechanical vibrations via the
middle ear.
• Sympathetic vibrations on the membrane of the eardrum.
The middle ear

• 3 small bones, the ossicles:


malleus, incus, and stapes.
• Form a system of levers which are linked together and driven by the
eardrum
• Bones amplify the force of sound vibrations.
The inner ear

Semicircular canals
• Body’s balance mechanism.
• Thought that it plays no part in
hearing.

The cochlea:

• Transforms mechanical ossicle forces into hydraulic pressure,


• The cochlea is filled with fluid.
• Hydraulic pressure imparts movement to the cochlear duct and to the organ of
Corti.
• Cochlea which is no bigger than the tip of a little finger!
How the cochlea works

• Pressure waves in the cochlea exert energy along a route that begins at the oval
window and ends abruptly at the membrane-covered round window.
• Pressure applied to the oval window is transmitted to all parts of the cochlea.
• Inner surface of the cochlea ( the basilar membrane) is lined with over 20,000
hair-like nerve cells — stereocilia:
Hearing different frequencies

• Basilar membrane is tight at one end, looser at the other


• High tones create their greatest crests where the membrane is tight,
• Low tones where the wall is slack.
• Causes resonant frequencies much like what happens in a tight
string.
• Stereocilia differ in length by minuscule amounts
• they also have different degrees of resiliency to the fluid which
passes over them.
Finally to nerve signals

• Compressional wave moves in middle ear through to the cochlea.


• Stereocilia will be set in motion.
• Each stereocilia sensitive to a particular frequency.
• Stereocilia cell will resonate with a larger amplitude of
vibration.
• Increased vibrational amplitude induces the cell to release an
electrical impulse which passes along the auditory nerve towards
the brain.

In a process which is not clearly understood, the brain is capable of


interpreting the qualities of the sound upon reception of these electric
nerve impulses.
Sensitivity of the ear

• Range is about 20 Hz to 20 kHz, most sensitive at


2 to 4 KHz.
• Dynamic range (quietest to loudest) is about 96 dB.
Recall:    
P1 A1
dB = 10 log10 = 20 log10 .
P2 A2

• Approximate threshold of pain: 130 dB.


• Hearing damage: > 90 dB (prolonged exposure).
• Normal conversation: 60–70 dB.
• Typical classroom background noise: 20–30 dB.
• Normal voice range is about 500 Hz to 2 kHz.
• Low frequencies are vowels and bass.
• High frequencies are consonants.
Question: how sensitive is human hearing?
The sensitivity of the human ear with respect to frequency is given by
the following graph:
Frequency dependence

Illustration: Equal loudness curves or Fletcher-Munson curves (pure tone


stimuli producing the same perceived loudness, “Phons”, in dB).
What do the curves mean?

• Curves indicate perceived loudness as a function of both the


frequency and the level (sinusoidal sound signal)
• Equal loudness curves. Each contour:
• Equal loudness.
• Express how much a sound level must be changed as the frequency
varies, to maintain a certain perceived loudness.
Physiological implications

Why are the curves accentuated where they are?


• Accentuates frequency range to coincide with speech.
• Sounds like p and t have very important parts of their spectral
energy within the accentuated range.
• Makes them more easy to discriminate between.
The ability to hear sounds of the accentuated range (around a few kHz)
is thus vital for speech communication.
Frequency masking

• A lower tone can effectively mask (make us unable to hear) a higher


tone played simultaneously.
• The reverse is not true — a higher tone does not mask a lower tone
that well.
• The greater the power in the masking tone, the wider is its
influence — the broader the range of frequencies it can mask.
• If two tones are widely separated in frequency then little masking
occurs.
Frequency masking

• Multiple frequency audio changes the sensitivity with the relative


amplitude of the signals.
• If the frequencies are close and the amplitude of one is less than
the other close frequency then the second frequency may not be
heard (masked).
Frequency masking

Frequency masking due to 1 kHz signal:


Frequency masking

Frequency masking due to 1, 4, 8 kHz signals:


Critical bands

• Range of closeness for frequency masking depends on the


frequencies and relative amplitudes.
• Each band where frequencies are masked is called the Critical
Band
• Critical bandwidth for average human hearing varies with
frequency:
• Constant 100 Hz for frequencies less than 500 Hz
• Increases (approximately) linearly by 100 Hz for each
additional 500 Hz.
• Width of critical band is called a bark.
Critical bands

First 12 of 25 critical bands:


What is the cause of frequency masking?

• The stereocilia are excited by air pressure variations,


transmitted via outer and middle ear.
• Different stereocilia respond to different ranges of
frequencies — the critical bands.

Frequency Masking occurs because after excitation by one frequency


further excitation by a less strong similar frequency of the same group of
cells is not possible.

Click here to hear example of Frequency Masking.


See/Hear also: Click here (in the Masking section).
Temporal masking

After the ear hears a loud sound: It takes a further short while before it
can hear a quieter sound.
Why is this so?
• Stereocilia vibrate with corresponding force of input sound stimuli.
• Temporal masking occurs because any loud tone will cause the hearing receptors
in the inner ear to become saturated and require time to recover.
• If the stimuli is strong then stereocilia will be in a high state of excitation and get
fatigued.
• Hearing Damage: After extended listening to loud music or headphones this
sometimes manifests itself with ringing in the ears and even temporary deafness
(prolonged exposure permanently damages the stereocilia).
Example of temporal masking

• Play 1 kHz masking tone at 60 dB, plus a test tone at 1.1 kHz at 40
dB. Test tone can’t be heard (it’s masked).
Stop masking tone, then stop test tone after a short delay.
Adjust delay time to the shortest time that test tone can be heard
(e.g., 5 ms).
Repeat with different level of the test tone and plot:
Example of temporal masking

Try other frequencies for test tone (masking tone duration constant).
Total effect of masking:
Example of temporal masking

The longer the masking tone is played, the longer it takes for the test
tone to be heard. Solid curve: 200 ms masking tone, dashed curve: 100
ms masking tone.
Compression idea: how to exploit?

• Masking: occurs when ever the presence of a strong audio signal


makes a temporal or spectral neighborhood of weaker audio
signals imperceptible.
• MPEG audio compresses by removing acoustically irrelevant parts
of audio signals
• Takes advantage of human auditory systems inability to hear
quantization noise under auditory masking (frequency or temporal).
• Frequency masking is always utilised in MPEG.
• More complex forms of MPEG also employ temporal masking.
How to compute?

We have met basic tools:


• Bank filtering with IIR/FIR filters.
• Fourier and Discrete Cosine Transforms.
• Work in frequency space.
• (Critical) Band Pass Filtering — imagine a graphic equaliser.
Basic bandpass frequency filtering

MPEG audio compression basically works by:


• Dividing the audio signal up into a set of frequency subbands.
• Use filter banks to achieve this.
• Subbands approximate critical bands.
• Each band quantised according to the audibility of
quantisation noise.

Quantisation is the key to MPEG audio compression and is the reason


why it is lossy.
How good is MPEG compression?

Although (data) lossy


MPEG claims to be perceptually lossless:
• Human tests (part of standard development), Expert
listeners.
• 6:1 compression ratio, stereo 16 bit samples at 48 Khz
compressed to 256 kbits/sec.
• Difficult, real world examples used.
• Under optimal listening conditions no statistically
distinguishable difference between original and MPEG.
MPEG audio coders

• Set of standards for the use of video with sound.


• Compression methods or coders associated with audio
compression are called MPEG audio coders.
• MPEG allows for a variety of different coders to employed.
• Difference in level of sophistication in applying perceptual
compression.
• Different layers for levels of sophistication.
Advantage of MPEG approach

Complex psychoacoustic modelling only in coding phase


• Desirable for real time (hardware or software)
decompression.
• Essential for broadcast purposes.
• Decompression is independent of the psychoacoustic
models used.
• Different models can be used.
• If there is enough bandwidth no models at all.
Basic MPEG: MPEG standards

Evolving standards for MPEG audio compression:


• MPEG-1 is by the most prevalent.
• So called mp3 files we get off Internet are members of
MPEG-1 family.
• Standards now extends to MPEG-4 (structured audio) —
Earlier Lecture.

For now we concentrate on MPEG-1


Basic MPEG: MPEG facts

• MPEG-1: 1.5 Mbits/sec for audio and video


About 1.2 Mbits/sec for video, 0.3 Mbits/sec for audio
(Uncompressed CD audio is 44,100 samples/sec * 16 bits/sample * 2
channels > 1.4 Mbits/sec)
• Compression factor ranging from 2.7 to 24.
• MPEG audio supports sampling frequencies of 32, 44.1 and 48 KHz.
• Supports one or two audio channels in one of the four modes:
1 Monophonic — single audio channel.
2 Dual-monophonic — two independent channels
(functionally identical to stereo).
3 Stereo — for stereo channels that share bits, but not using
joint-stereo coding.
4 Joint-stereo — takes advantage of the correlations between stereo
channels.
MPEG Audio Compression
Basic MPEG-1 encoding/decoding algorithm
Algorithm
Basic MPEG-1 encoding/decoding maybe summarised as:
Basic MPEG-1 compression algorithm

The main stages of the algorithm are:


• The audio signal is first samples and quantised using PCM
• Application dependent: sample rate and number of bits
• The PCM samples are then divided up into a number of frequency
subband and compute subband scaling factors:
Basic MPEG-1 compression algorithm

Analysis filters
• Also called critical-band filters
• Break signal up into equal width subbands
• Use filter banks (modified with discrete cosine
transform (DCT) Level 3)
• Filters divide audio signal into frequency subbands that
approximate the 32 critical bands
• Each band is known as a sub-band sample.
• Example: 16 kHz signal frequency, Sampling rate 32 kHz gives each
subband a bandwidth of 500 Hz.
• Time duration of each sampled segment of input signal is time to
accumulate 12 successive sets of 32 PCM (subband) samples, i.e.
32*12 = 384 samples.
Basic MPEG-1 Compression Algorithm

analysis filters
• In addition to filtering the input, analysis banks determine
• Maximum amplitude of 12 subband samples in each
subband.
• Each known as the scaling factor of the subband.
• Passed to psychoacoustic model and quantiser blocks
Basic MPEG-1 compression algorithm

Psychoacoustic modeller:
• Frequency Masking and may employ temporal masking.
• Performed concurrently with filtering and analysis operations.
• Uses Fourier Transform (FFT) to perform analysis.
• Determine amount of masking for each band caused by nearby
bands.
• Input: set hearing thresholds and subband masking
properties (model dependent) and scaling factors (above).
Basic MPEG-1 compression algorithm

Psychoacoustic modeller (cont):


• Output: a set of signal-to-mask ratios:
• Indicate those frequencies components whose amplitude is below
the audio threshold.
• If the power in a band is below the masking threshold, don’t encode
it.
• Otherwise, determine number of bits (from scaling
factors) needed to represent the coefficient such that noise
introduced by quantisation is below the masking effect (Recall that 1
bit of quantisation introduces about 6 dB of noise).
Basic MPEG-1 compression algorithm

Example of quantisation:
• Assume that after analysis, the levels of first 16 of the 32 bands are:
----------------------------------------------------------------------
Band 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Level (db) 0 8 12 10 6 2 10 60 35 20 15 2 3 5 3 1
----------------------------------------------------------------------

• If the level of the 8th band is 60 dB,


then assume (according to model adopted) it gives a masking of 12
dB in the 7th band, 15 dB in the 9th.
Level in 7th band is 10 dB ( < 12 dB ), so ignore it.
Level in 9th band is 35 dB ( > 15 dB ), so send it.
–> Can encode with up to 2 bits (= 12 dB) of quantisation error.
• More on Bit Allocation soon.
MPEG-1 Output Bitstream

The basic output stream for a basic MPEG encoder is as follows:

• Header: contains information such as the sample frequency and


quantisation,.
• Subband sample (SBS) format: Quantised scaling factors and 12
frequency components in each subband.
• Peak amplitude level in each subband quantised using 6 bits (64
levels)
• 12 frequency values quantised to 4 bits
• Ancillary data: Optional. Used, for example, to carry
additional coded samples associated with special broadcast format
( e.g surround sound).
Decoding the bitstream

• Dequantise the subband samples after demultiplexing the coded


bitstream into subbands.
• Synthesis bank decodes the dequantised subband samples to
produce PCM stream.
• This essentially involves applying the inverse fourier
transform ( IFFT) on each substream and multiplexing the channels
to give the PCM bit stream.
MPEG layers

MPEG defines 3 levels of processing layers for audio:


• Level 1 is the basic mode,
• Levels 2 and 3 more advance (use temporal masking).
• Level 3 is the most common form for audio files on the Web
• Our beloved MP3 files that record companies claim are bankrupting
their industry.
• Strictly speaking these files should be called
MPEG-1 level 3 files.

Each level:
• Increasing levels of sophistication
• Greater compression ratios.
• Greater computation expense (but mainly at the coder side)
Level 1

• Best suited for bit rate bigger than 128 kbits/sec per channel.
• Example: Phillips Digital Compact Cassette uses Layer 1 192
kbits/sec compression
• Divides data into frames,
• Each of them contains 384 samples,
• 12 samples from each of the 32 filtered subbands as shown above.
• Psychoacoustic model only uses frequency masking.
• Optional Cyclic Redundancy Code (CRC) error checking.
Level 1 (and Level 2) audio layers

Mask calculations are performed in


! Mask Calculations done in parallel with subband filtering
Note:
•parallel with subband
Accurate frequency !ltering,
decomposition via Fourieras in Fig.
Transform.

4.13:

30
Layer 2

• Targeted at bit rates of around 128 kbits/sec per channel.


• Examples: Coding of Digital Audio Broadcasting (DAB) on
CD-ROM, CD-I and Video CD.
• Enhancement of level 1.
• Codes audio data in larger groups:
• Use three frames in filter:
before, current, next, a total of 1152 samples.
• This models a little bit of the temporal masking.
• Imposes some restrictions on bit allocation in middle and high
subbands.
• More compact coding of scale factors and quantised
samples.
• Better audio quality due to saving bits here so more bits can be
used in quantised subband values.
Layer 3

• Targeted at bit rates of 64 kbits/sec per channel. Example: audio


transmission of ISDN or suitable bandwidth network.
• Psychoacoustic model includes temporal masking effects, Takes
into account stereo redundancy.
• Better critical band filter is used (non-equal frequencies)
• Uses a modified DCT (MDCT) for lossless subband transformation.
• Two different block lengths: 18 (long) or 6 (short)
• 50% overlap between successive transform windows gives window
sizes of 36 or 12 — accounts for temporal masking
• Greater frequency resolution accounts for poorer time resolution.
• Uses Huffman coding on quantised samples.
Level 3 audio layers
Bit allocation

• Process determines the number of code bits for each subband


• Based on information from the psychoacoustic model.
Bit allocation for layer 1 and 2

• Aim: ensure that all of the quantisation noise is below the masking
thresholds
• Compute the mask-to-noise ratio (MNR) for all subbands:
M N RdB = SN RdB − SM RdB
where
M N RdB is the mask-to-noise ratio,
SN RdB is the signal-to-noise ratio (SNR), and
SM RdB is the signal-to-mask ratio from the psychoacoustic
model.

• Standard MPEG lookup tables estimate SNR for given quantiser


levels.
• Designers are free to try other methods SNR estimation.
Bit allocation for layer 1 and 2

Once MNR computed for all the subbands:


• Search for the subband with the lowest MNR
• Increment code bits to that subband.
• When a subband gets allocated more code bits, the bit
allocation unit:
• Looks up the new estimate for SNR
• Recomputes that subband’s MNR.
• The process repeats until no more code bits can be
allocated.
Bit allocation for layer 1 and 2
Bit allocation for layer 3

• Uses noise allocation, which employs Huffman coding.


• Iteratively varies the quantisers in an orderly way
• Quantises the spectral values,
• Counts the number of Huffman code bits required to code the audio
data
• Calculates the resulting noise in Huffman coding.
If there exist scale factor bands with more than the
allowed distortion:
• Encoder amplifies values in bands
• To effectively decreases the quantiser step size for those bands.
Bit allocation for layer 3

After this the process repeats. The process stops if any of these three
conditions is true:
• None of the scale factor bands have more than the allowed
distortion.
• The next iteration would cause the amplification for any of the
bands to exceed the maximum allowed value.
• The next iteration would require all the scale factor bands to be
amplified.

Real-time encoders include a time-limit exit condition for this process.


Stereo redundancy coding

Exploit redundancy in two couple stereo channels?


• Another perceptual property of the human auditory system
• Simply stated at low frequencies, the human auditory system can’t
detect where the sound is coming from.
• So save bits and encode it mono.
• Used in MPEG-1 Layer 3.

Two types of stereo redundancy coding:


• Intensity stereo coding — all layers
• Middle/Side (MS) stereo coding — Layer 3 only stereo
coding.
Intensity stereo coding

Encoding:
• Code some upper-frequency subband outputs:
• A single summed signal instead of sending independent left and
right channels codes
• Codes for each of the 32 subband outputs.

Decoding:
• Reconstruct left and right channels
• Based only on a single summed signal
• Independent left and right channel scale factors.

With intensity stereo coding,


• The spectral shape of the left and right channels is the same within
each intensity-coded subband
• But the magnitude is different.
Middle/side (MS) stereo coding

• Encodes the left and right channel signals in certain


frequency ranges:
• Middle — sum of left and right channels
• Side — difference of left and right channels.
• Encoder uses specially tuned threshold values to compress the side
channel signal further.

MPEGAudio (DIRECTORY)
MPEGAudio.zip (All Files Zipped)
Dolby audio compression

Application areas:
• FM radio Satellite transmission and broadcast TV audio
(DOLBY AC-1)
• Common compression format in PC sound cards
(DOLBY AC-2)
• High Definition TV standard advanced television (ATV)
(DOLBY AC-3). MPEG a competitor in this area.
Differences with MPEG

• MPEG perceptual coders control quantisation accuracy of each


subband by computing bit numbers for each sample.
• MPEG needs to store each quantise value with each sample.
• MPEG Decoder uses this information to dequantise:
forward adaptive bit allocation
• Advantage of MPEG?: no need for psychoacoustic
modelling in the decoder due to store of every quantise value.
• DOLBY: Use fixed bit rate allocation for each subband based on
characteristics of the ear.
• No need to send with each frame — as in MPEG.
• DOLBY encoders and decoder need this information.
Different Dolby Standards

DOLBY AC-1

Low complexity psychoacoustic model


• 40 subbands at sampling rate of 32 kbits/sec or
• (Proportionally more) Subbands at 44.1 or 48 kbits/sec
• Typical compressed bit rate of 512 kbits per second for stereo.
• Example: FM radio Satellite transmission and broadcast TV audio
Different Dolby standards

DOLBY AC-2

Variation to allow subband bit allocations to vary


• NOW Decoder needs copy of psychoacoustic model.
• Minimised encoder bit stream overheads at expense of transmitting
encoded frequency coefficients of sampled waveform segment —
known as the encoded spectral envelope.
• Mode of operation known as
backward adaptive bit allocation mode.
• HIgh (hi-fi) quality audio at 256 kbits/sec.
• Not suited for broadcast applications:
• encoder cannot change model without changing
(remote/distributed) decoders.
• Example: Common compression format in PC sound cards.
Different Dolby standards

DOLBY AC-3

Development of AC-2 to overcome broadcast challenge


• Use hybrid backward/forward adaptive bit allocation mode.
• Any model modification information is encoded in a frame.
• Sample rates of 32, 44.1, 48 kbits/sec supported depending on bandwidth of
source signal.
• Each encoded block contains 512 subband samples, with 50% (256) overlap
between successive samples.
• For a 32 kbits/sec sample rate each block of samples is of 8 ms duration, the
duration of each encoder is 16 ms.
• Audio bandwidth (at 32 kbits/sec) is 15 KHz so each subband has 62.5 Hz
bandwidth.
• Typical stereo bit rate is 192 kbits/sec.
• Example: High Definition TV standard advanced television (ATV). MPEG competitor
in this area.
Further Reading

A tutorial on MPEG audio compression

AC-3: flexible perceptual coding for audio trans. & storage

You might also like