Unit 2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

Unit 2: Sound/Audio System

• Sound is the terminology used in the analogue form, and the digitized form of
sound is called as audio. A sound is a waveform. It is produced when waves of
varying pressure travel though a medium, usually air. It is inherently an
analogous phenomenon, meaning that the changes in air pressure can vary
continuously over a range of values.
Multimedia System Sounds
• All computers are equipped with basic sounds such as beeps, dings and random
startup sounds. However, in order to hear more sophisticated sounds (music and
speeches), a user will need a sound card and either speakers or headphones.
Digital Audio
• The sound recorded on an audio tape through a microphone or from other
sources is in an analogue (continuous) form.
• The analogue format must be converted to a digital format for storage in a
computer. This process is called digitizing.
• The method used for digitizing sound is called sampling. Digital audio represents
a sound stored in thousands of numbers or samples. The quality of a digital
recording depends upon how often the samples are taken.
• Digital data represents the loudness at discrete slices of time. It is not device
dependent and should sound the same each time it is played. It is used for
music CDs.
Preparing Digital Audio Files
Steps given below to prepare digital audio files:
• Balancing the need for sound quality against available RAM and hard disk
resource.
• Setting appropriate recording levels to get a high quality and clean recording.
• To digitize the analogue material recording it into a computer readable digital
media.
The sampling rate determines the frequency at which samples will be drawn for the
recording.
The number of times the analogue sound is sampled during each period and
transformed into digital information is called sampling rate.
Sampling rates are calculated in Hertz (HZ or Kilo HZ). The most common sampling
rates used in multimedia applications are 44.1 KHZ, 22.05 JHZ and 11.025 KHZ.
Sampling at higher rates more accurately captures the high frequency content of
the sound. Higher sampling rate means higher quality of sound. However, a higher
sampling rate occupies greater storage capacity. Conversion from a higher sampling
rate to a lower rate is possible.
Sampling rate and sound bit depth are the audio equivalent of resolution and
colour depth of a graphic image. Bit depth depends on the amount of space in
bytes used for storing a given piece of audio information.
• Higher the number of bytes higher is the quality of sound. Multimedia sound
comes in 8- bit, 16-bit, 32-bit and 64-bit formats. An 8-bit has 28 or 256 possible
values. A single bit rate and single sampling rate are recommended throughout
the work. An audio file size can be calculated with the simple formula:
File Size in Disk = (Length in seconds) × (sample rate) × (bit depth/8 bits per byte).
• Bit Rate refers to the amount of data, specifically bits, transmitted or received
per second. It is Notes comparable to the sample rate but refers to the digital
encoding of the sound.
• It refers specifically to how many digital 1s and 0s are used each second to
represent the sound signal.
• This means the higher the bit rate, the higher the quality and size of your
recording. For instance, an MP3 file might be described as having a bit rate of
320 kb/s or 320000 b/s. This indicates the amount of compressed data needed
to store one second of music.
• Bit Rate = (Sample Rate) × (Bit Depth) × (Number of Channels)
Mono or Stereo
Mono sounds are flat and unrealistic compared to stereo sounds, which are much
more dynamic and lifelike. However, stereo sound files require twice the storage
capacity of mono sound files. Therefore, if storage and transfer are concerns, mono
sound files may be the more appropriate choice.
Formula for determining the size of the digital audio is given below:
Monophonic = Sampling rate × duration of recording in seconds × (bit resolution/8)
×1
Stereo = Sampling rate × duration of recording in seconds × (bit resolution/8) × 2
Analogue verses Digital
• There are two types of sound – analogue and digital.
• Analogue sound is a continuous stream of sound waves. To be understood by
the computer, these sound waves must be converted to numbers.
• The process of converting analogue sounds into numbers is called digitizing or
sound sampling
• Analogue sounds that have been converted to numbers are digital sounds.
When we are working with digital sound, we call it audio.
• Therefore, sound that has been converted from analogue to digital is often
called digital audio sounds. Non-destructive sound processing methods maintain
the original file. A copy of the original file can be manipulated by playing it
louder or softer, combining it with other sounds on other tracks, or modifying it
in other ways.
• Once a sound has been recorded, digitized, processed, and incorporated into a
multimedia application, it is ready to be delivered. So that you can hear it
through your speakers, the digital sound is sent through a digital-to-analogue
converter (DAC).
Digitizing Audio
• Natural sound occurs as continuous, and hence, analog, pressure waves. In
order to covert these pressure waves into a representation a computer can
manipulate, it is necessary to digitize them.
• An Analog-to-Digital Converter (ADC) measures the amplitude of pressure waves
at regular time intervals (called samples) to generate a digital representation of
the sound.
• The reverse conversion, to play digital sound through an analog device (such as
speakers) is performed by a Digital-to-Analog Converter (DAC).
• The number of samples taken per second is called the sampling rate.
• CD quality sound is sampled at 44,100 Hz, which means that it is sampled
44,100 times per second. This appears to be well above the frequency range of
the human ear.
• However, the Nyquist sampling theorem states that "For lossless digitization, the
sampling rate should be at least twice the maximum frequency responses." The
human ear can hear sound in the range 20Hz to 20KHz, and the bandwidth
(19980Hz) is slightly less than half the CD standard sampling rate.
• Following the Nyquist theorem, this means that CD quality sound can represent
frequencies only up to 22,050Hz, which is much closer to that of human
hearing.
• Just as the waveform is sampled at discrete times, the value of the sample taken
is also represented as a discrete value. The resolution or quantization of a
sample value is dependent on the number of bits used to represent the
amplitude.
• The greater the number of bits used, the better the resolution, but the more
storage space is required. Typically, amplitude is sampled as either 8-bit
(resulting in 256 possible sample values) or 16-bit (yielding 65536 values).
Computer Representation of Audio
• The smooth, continuous curve of a sound waveform isn't directly represented in
a computer.
• A computer measures the amplitude of the waveform at regular time intervals
to produce a series of numbers. Each of these measurements is called a sample.
• Figure illustrates one period of a digitally sampled waveform.

• Each vertical bar in Figure represents a single sample. The height of a bar
indicates the value of that sample.
• The mechanism that converts an audio signal into digital samples is called an
analog-to-digital converter, or ADC. To convert a digital signal back to analog,
you need a digital-to-analog converter, or DAC.
• A transducer converts pressure to voltage levels
• Convert analog signal into a digital stream by discrete sampling.
• Discretization both in time and amplitude (quantization).
• In a computer, we sample these values at intervals to get a vector of values
• A computer measures the amplitude of the waveform at regular time intervals
to produce a series of numbers (samples)
Sampling Rate The rate at which a waveform is sampled is called the sampling rate.
Like frequencies, sampling rates are measured in hertz. The CD standard sampling
rate of 44100 Hz means that the waveform is sampled 44100 times per second. This
may seem a bit excessive, considering that we can't hear frequencies above 20 kHz;
however, the highest frequency that a digitally sampled signal can represent is
equal to half the sampling rate. So a sampling rate of 44100 Hz can only represent
frequencies up to 22050 Hz, a boundary much closer to that of human hearing.
• Rate at which a continuous wave is sampled (measured in Hertz) CD standard -
44100 Hz, Telephone quality - 8000 Hz.
• Direct relationship between sampling rate, sound quality (fidelity) and storage
space.
Quantization
• Just as a waveform is sampled at discrete times, the value of the sample is also
discrete.
• The quantization of a sample value depends on the number of bits used in
measuring the height of the waveform.
• An 8-bit quantization yields 256 possible values;
• 16-bit CD quality quantization results in over 65000 values.
• As an extreme example, Figure 2-3 shows the waveform used in the previous
example sampled with a 3-bit quantization. This results in only eight possible
values: .75, .5, .25, 0, -.25, -.5, -.75, and -1.
Nyquist Sampling Theorem
If a signal f(t) is sampled at regular intervals of time and at a rate higher than twice
the highest significant signal frequency, then the samples contain all the
information of the original signal.
Example:
• Actual playback frequency for CD quality audio is 22050 Hz
• Because of Nyquist Theorem - we need to sample the signal twice, therefore
sampling frequency is 44100 Hz.
Data Rate of a Channel
Noiseless Channel Nyquist proved that if any arbitrary signal has been run through
a low pass filter of bandwidth H, the filtered signal can be completely reconstructed
by making only 2H (exact) samples per second.
If the signal consists of V discrete levels, Nyquist’s theorem states: Max datarate = 2
*H log_2 V bits/sec
• Noiseless 3kHz channel with quantization level 1 bit cannot transmit binary
signal at a rate exceeding 6000 bits per second.
Noisy Channel
Thermal noise present is measured by the ratio of the signal power S to the noise
power N (signal-to noise ratio S/N). Max datarate - H log_2 (1+S/N) .
Audio Formats
Audio formats are characterized by four parameters:
Sample rate: Sampling frequency
Encoding: audio data representation
• µ-law encoding corresponds to CCITT G.711 - standard for voice data in
telephone companies in USA, Canada, Japan
• A-law encoding - used for telephony elsewhere.
• A-law and µ-law are sampled at 8000 samples/second with precision of 12bits,
compressed to 8-bit samples.
• Linear Pulse Code Modulation(PCM) - uncompressed audio where samples are
proportional to audio signal voltage.
Precision: number of bits used to store audio sample
• µ-law and A-law - 8 bit precision, PCM can be stored at various precisions, 16 bit
PCM is common .
Channel: Multiple channels of audio may be interleaved at sample boundaries.
Available on UNIX
au (SUN file format), wav (Microsoft RIFF/waveform format), al (raw a-law), u (raw u-
law)…
Available on Windows-based systems (RIFF formats)
wav, midi (file format for standard MIDI files), avi
RIFF (Resource Interchange File Format)
tagged file format (similar to TIFF).. Allows multiple applications to read files in RIFF
format
RealAudio, MP3 (MPEG Audio Layer 3)

Digital Audio
Digital audio is created when a sound wave is converted into numbers – a process
referred to as digitizing. It is possible to digitize sound from a microphone, a
synthesizer, existing tape recordings, live radio and television broadcasts, and
popular CDs. You can digitize sounds from a natural source or prerecorded. Digitized
sound is sampled sound .
Basic Music (MIDI) Concepts
The relationship between music and computer has become more and more
important, especially considering the development of MIDI (Musical Instrument
Digital Interface) and its important contributions in the music industry today. MIDI
(Musical Instrument Digital Interface) is a communication standard developed for
electronic musical instruments and computers. MIDI files allow music and sound
synthesizers from different manufacturers to communicate with each other by
sending messages along cables connected to the devices.
• MIDI doesn’t directly describe musical sound
• MIDI is not a language
• It is a data communications protocol
Creating your own original score can be one of the most creative and rewarding
aspects of building a multimedia project, and MIDI (Musical Instrument Digital
Interface) is the quickest, easiest and most flexible tool for this task.
The process of creating MIDI music is quite different from digitizing existing audio.
To make MIDI scores, however you will need sequencer software and a sound
synthesizer.
The MIDI keyboard is also useful to simply the creation of musical scores. An
advantage of structured data such as MIDI is the ease with which the music director
can edit the data.
• Digital audio will not work due to memory constraints and more processing
power requirements
• When there is high quality of MIDI source.
• When there is no requirement for dialogue .
Computer Representation of Music
MIDI (Music Instrument Digital Interface) is a standard that manufacturers of
musical instruments use so that instruments can communicate musical information
via computers. The MIDI interface consists of:
Hardware - physical connection b/w instruments, specifies a MIDI port (plugs into
computers serial port) and a MIDI cable.
Data format - has instrument specification, notion of beginning and end of note,
frequency and sound volume. Data grouped into MIDI messages that specify a
musical event.
An instrument that satisfies both is a MIDI device (e.g. synthesizer)
MIDI software applications include a music recording and performance
applications, musical notations and printing applications, music education etc.
MIDI Reception Modes:
 Mode 1: Omni On/Poly - usually for testing devices
 Mode 2: Omni On/Mono - has little purpose
 Mode 3: Omni Off/Poly - for general purpose
 Mode 4: Omni Off/Mono - for general purpose
Omni On/Off: respond to all messages regardless of their channel
Poly/Mono: respond to multiple/single notes per channel
The first half of the mode name specifies how the MIDI device monitors the
incoming MIDI channels.
If Omni is turned on, the MIDI device monitors all the MIDI channels and responds
to all channel messages, no matter which channel they are transmitted on.
If Omni is turned off, the MIDI device responds only to channel messages sent on
the channel(s) the device is set to receive.
The second half of the mode name tells the MIDI device how to play notes coming
in over the MIDI cable
If the option Poly is set, the device can play several notes at a time.
If the mode is set to Mono, the device plays notes like a monophonic synthesizer-
one note at a time.
MIDI Devices
The heart of any MIDI system is the MIDI synthesizer device. Most synthesizer have
the following common components:
• Sound Generators: Sound generators do the actual work of synthesizing sound;
the purpose of the rest of the synthesizer is to control the sound generators.
• Microprocessors: The microprocessor communicates with the keyboard to know
what notes the musician is playing, and with the control panel to know what
commands the musician wants to send to microprocessor. The microprocessor
send and receives MIDI message.
• Keyboard: The keyboard affords the musician’s direct control of the synthesizer.
• Control Panel: The control panel controls those functions that are not directly
concerned with notes and durations (controlled by the keyboard).
• Auxiliary Controllers: Auxiliary controllers are available to give more control over
the notes played on the keyboard. Two very common variables on a synthesizer
are pitch bend and modulation.
• Memory: Synthesizer memory is used to store patches for the sound generators
and settings on the control panel.
MIDI Messages MIDI messages transmit information between MIDI devices and
determine what kinds of musical events can be passed from device to device.
MIDI messages are divided into two different types:
1. Channel Messages: Channel messages go only to specified devices. There are
two types of channel messages:
I. Channel Voice Messages: Send actual performance data between MIDI
devices. Example: Note On, Note Off, Channel Pressure, Control Change
etc.
II. Channel Mode Messages: Determine the way that a receiving MIDI device
responds to channel voice messages. Example: Local Control, All Notes Off,
Omni Mode Off etc.
2. System Messages: System messages go to all devices in a MIDI system because
no channel numbers are specified. There are three types of system messages:
I. System Real-time Messages: System real time messages are very short
and simple, consisting of only one byte. They carry extra data with them.
Example: System Reset, Timing Clock etc.
II. System Common Messages: System common messages are commands
that prepare sequencers and synthesizers to play a song. Example: Song
Select, Tune Request etc.
III. System exclusive messages allow MIDI manufacturers to create
customized MIDI messages to send between their MIDI devices.
MIDI Software
The software applications generally fall into four major categories:
Music recording and performance applications:
• Provides functions such as recording of MIDI messages
• Editing and playing back the messages in performance.
Musical notation and printing applications:
• Allows writing music using traditional musical notation
• Print the music on paper for live performance or publication.
Synthesizer patch editors and librarians:
• Allow information storage of different synthesizer patches in the computer’s
memory and disk drives.
• Editing of patches in the computer.
Music education applications:
• Teach different aspects of music using the computer monitor, keyboard and
other controllers of attached MIDI instruments.
The main issue in current MIDI-based computer music systems is interactivity. The
processing chain of interactive computer music systems can be conceptualized in
three stages:
• The sensing stage, when data are collected from controller reading gesture
information from human performers on stages.
• The processing stage, when the computer reads and interprets information
coming from the sensors and prepares data for the response stage.
• The response stage, when the computer and some collection of sound-
producing devices share in realizing a musical output.
Speech
• Speech can be "perceived", "understood" and "generated" by humans and also
by machines.
• A human adjusts himself/herself very efficiently to different speakers and their
speech habits.
• The brain can recognize the very fine line between speech and noise.
• The human speech signal comprises a subjective lowest spectral component
known as the pitch, which is not proportional to frequency.
• The human ear is most sensitive in the range from 600 Hz to 6000 Hz
Speech signal have two properties which can be used in speech processing:
• Voice speech signals show during certain time intervals almost periodic
behavior.
• The spectrum of audio signals shows characteristic maxima, which are mostly 3-
5 frequency bands.
Speech Generation
Generated speech must be understandable and must sound natural. The
requirement of understandable speech is a fundamental assumption, and the
natural sound of speech increases user acceptance.
Basic Notions
• The lowest periodic spectral component of the speech signal is called the
fundamental frequency. It is present in a voiced sound.
• A phone is the smallest speech unit, such as the m of mat and b of bat in
English, that distinguish one utterance or word from another in a given
language.
• Allophones mark the variants of a phone. For example, the aspirated p of pit
and the unaspirated p of spit are allophones of the English phoneme p.
• The morph marks the smallest speech unit which carries a meaning itself.
Therefore, consider is a morph, but reconsideration is not.
• A voiced sound is generated through the vocal cords. m,v and l are examples of
voiced sounds. The pronunciation of a voiced sound depends strongly on each
speaker.
• During the generation of an unvoiced sound, the vocal cords are opened. f and
s are unvoiced sounds. Unvoiced sounds are relatively independent from the
speaker.
Exactly, there are:
• Vowels - a speech sound created by the relatively free passage of breath
through the larynx and oral cavity, usually forming the most prominent and
central sound of a syllable (e.g., u from hunt);
• Consonants - a speech sound produced by a partial or complete obstruction of
the air stream by any of the various constrictions of the speech organs (e.g.,
voiced consonants, such as m from mother, fricative voiced consonants, such as
v from voice, fricative voiceless consonants, such as s from nurse, plosive
consonants, such as d from daily and affricate consonants, such as dg from
knowledge, or ch from chew).
Reproduced Speech Output
There are two way of speech generation/output performed by time-dependent
sound concatenation and a frequency-dependent sound concatenation.
Time-dependent Sound Concatenation
Individual speech units are composed like building blocks, where the composition
can occur at different levels. The individual phones are understood as speech units.
The individual phones of the word curmb. It is possible with just a few phones to
create an unlimited vocabulary
Frequency-dependent Sound Concatenation
Speech generation/output can also be based on a frequency-dependent sound
concatenation, e.g. through a formant-synthesis. Formants are frequency maxima in
the spectrum of the speech signal. Formant synthesis simulates the vocal tract
through a filter
• In the first step, transcription is performed, in which text is translated into
sound script. Most transcription methods work here with letter-to-phone rules
and a Dictionary of Exceptions stored in a library. The generation of such a
library in work-extensive, but using the interactive control the user it can be
improved continuously. The user recognizes the formula deficiency in the
transcription and improves the pronunciation manual.
• In the second step, the sound script is translated into a speech signal. Time or
frequency-dependent concatenation can follow. While the first step is always a
software solution, the second step is most often implemented with signal
processors or even dedicated processors .
Speech Analysis
Speech analysis/input deals with the research areas shown in Figure
• Human speech has certain characteristics determined by a speaker. Hence,
speech analysis can serve to analyze who is speaking, i.e. to recognize a speaker
for his/her identification and verification.
– The computer identifies and verifies the speaker using an acoustic fingerprint.
– An acoustic fingerprint is a digitally stored speech probe (e.g., certain statement) of a
person.
• Another task of speech analysis is to analyze what has been said, i.e., to
recognize and understand the speech signal itself. Based on speech sequence,
the corresponding text is generated. This can lead to a speech-controlled
typewriter, a translation system or part of a workplace for the handicapped.
• Another area of speech analysis tries to research speech patterns with respect
to how a certain statement was said. For example, a spoken sentence sounds
differently if a person is angry or calm
Speech Recognition Speech recognition is the process of converting an acoustic
signal, captured by a microphone or a telephone, to set of words.
Speech recognition is the process of converting spoken language to written text or
some similar form. Speech recognition is the foundation of human computer
interaction using speech. Speech recognition in different contexts:
• Dependent or independent on the speaker.
• Discrete words or continuous speech.
• Small vocabulary or large vocabulary.
• In quiet environment or noisy environment.
Natural Language Understanding (NLU) is a process of analysis of recognized words
and transforming them into data meaningful to computer. Other words, NLU is a
computer based system that understands human language. NLU is used in
combination with speech recognition.
Types of Speech Recognition Speech recognizers are divided into several different
classes according to the type of utterance that they can to recognize:
• Connected words,
• Continuous speech (computer notation)
• Spontaneous (natural) speech
• Voice Verification
• Voice Identification

You might also like