0% found this document useful (0 votes)
178 views

Computer Graphics and Multimedia Notes 3

The document discusses several common file formats used for multimedia including RTF, TIFF, RIFF, MIDI, JPEG, AVI, and MPEG. It provides details on the structure and specifications of these formats.

Uploaded by

rosewelt444
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
178 views

Computer Graphics and Multimedia Notes 3

The document discusses several common file formats used for multimedia including RTF, TIFF, RIFF, MIDI, JPEG, AVI, and MPEG. It provides details on the structure and specifications of these formats.

Uploaded by

rosewelt444
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

4.

9 DATA AND FILE FORMATS STANDARDS


There are large number of formats and standards available for multimedia system.
Let us discuss about the following file formats:
1. Rich-Text Format (RTF)
2. Tagged Image File Format (TIFF)
3. Resource Image File Format (RIFF)
4. Musical Instrument Digital Interface (MIDI)
5. Joint Photographic Experts Group (JPEG)
6. Audio Video Interleaved (AVI)
7. Indeo file format (TWAIN.)
8. Motion Picture Experts Group (MPEG)
4.9.1 Rich Text Format
The rich-text format extended the range of information carried through from one word-
processor application or desktop publishing system to another.
The key format information carried across in RTF documents are given below:
Character Set: It determines the characters that supports in a particular implementation.
Font Table: This lists all fonts used. Then, they are mapped to the fonts available in receiving
application for displaying text.
Color Table: It lists the colors used in the documents. The color table then mapped for display
by receiving application to the nearer set of colors available to that applications.
Document Formatting: Document margins and paragraph indents are specified here.
Section Formatting: Section breaks are specified to define separation of groups of paragraphs.
Paragraph Formatting: It specifies style sheds. It specifies control characters for specifying
paragraph justification, tab positions, left, right and first indents relative to document margins,
and the spacing between paragraphs.
General Formatting: It includes footnotes, annotations, bookmarks and pictures.
Character Formatting: It includes bold, italic, underline (continuous, dotted or word), strike
through, shadow text, outline text, and hidden text.
Special Characters: It includes hyphens, spaces, backslashes, underscore and so on.
4.9.2 TIFF File Format
TIFF is an industry-standard file format designed to represent raster image data generated by
scanners, frame grabbers, and paint/ photo retouching applications. TIFF file extension is .tiff
or .tif
TIFF Version 6.0.
It offers the following formats:
i. Grayscale, palette color, RGB full-color images and black and white.
ii. Run-length encoding, uncompressed images and modified Huffman data compression
schemes.
The additional formats are:
i. Tiled images, compression schemes, images using CMYK, YCbCr color models.
TIFF Structure
TIFF files consists of a header. The header consists of byteordering flag, TIFF file format
version number, and a pointer to a table. The pointer points image file directory. This directory
contains table of entries of various tags and their information.
TIFF file format Header:
Location 0 Intel or Motorola Byte Order 2 Number of Bytes

2 Version Number 2

Pointer to the First


4 Image File Directory
4

Image File Directory

The next figure shows the IFD (Image File Directory) as its content. The IFD is a variable –
length table containing directory entries. The length of table depends on the number of
directory entries in the table. The first two bytes contain the total number of entries in the table
followed by directory entries. Each directory entry consists of twelve bytes. The Last item in
the IFD is a four-byte pointer that points to the next IFD. The byte content of each directory
entry is as follows:
 The first two byte contains tag number-tag ID.
 The second two byte represent the type of data as shown in table3-1 below.
 The next four bytes contains the length for the data type.
 The final four bytes contain data or a pointer.
TIFF Tags
The first two bytes of each directory entry contain a field called the Tag ID. Tag IDs arc
grouped into several categories. They are Basic, Informational, Facsimile, Document storage
and Retrieval.
TIFF Classes: (Version 5.0)
It has five classes
1. Class B for binary images
2. Class F for Fax
3. Class G for gray-scale images
4. Class P for palette color images
5. Class R for RGB full-color images.
4.9.3 Resource Interchange File Format (RIFF)
The RIFF file formats consist' of blocks of data called chunks. They are
 RIFF Chunk - defines the content of the RIFF file.
 List Chunk - allows embedding additional file information such as archival location,
copy right information, creating date, and so on.
 Subchunk - allow additional information to a primary chunk when the primary chunk
is not sufficient.

The first chunk in a RIFF file must be a RIFF chunk and it may contain one or more sub chunk.
The first four bytes of the RIFF chunk data field are allocated for the form type field containing
four characters to identify the format of the data stored in the file: AVI, WAV, RMI, PAL and
so on. Table shows the filename extensions used for Microsoft Windows multimedia RIFF file
types.
File Type Form Type File Extension
Waveform Audio File WAVE .WAV
Audio Video Interleaved file AVI .AVI
MIDI File RMID .RMI
Device Independent Bitmap File RDIB .RDI
Palette File PAL .PAL

The sub chunk contains a four-character ASCII string 10 to identify the type of data. Four bytes
of size contains the count of data values, and the data. The data structure of a chunk is same as
all other chunks.
RIFF chunk with two sub chunk:
The first 4 characters of the RlFF chunk are reserved for the "RIFF" ASCII string. The next
four bytes define the total data size: 8 bytes of the RIFF chunk itself, and the size of all
subchunks. The first four characters of the data field are reserved for Form Type. The rest of
the data field contains two subchunk:
(i) fmt - defines the recording characteristics of the waveform.
(ii) data - contains the data for the waveform.
LIST Chunk
RlFF chunk may contains one or more list chunks. List chunks allow embedding additional file
information such as archival location, copyright information, creating date, description of the
content of the file.
RIFF MIDI FILE FORMAT
RIFF MIDI contains a RIFF chunk with the form type "RMID" and a subchunk called "data"
for MIDI data. The 4 bytes are for ID of the RIFF chunk. 4 bytes are for size 4 bytes are for
form type 4 bytes are for ID of the subchunk data and 4 bytes are for the size of MIDI data.
RIFF DIBS (Device-Independent Bit Maps)
DIB is a Microsoft windows standard format. It defines bit maps and color attributes for bit
maps independent of devices. DIEs are normally embedded in .BMP files, .WMF meta data
files, and .CLP files.
DIB Structure
BITMAPINFOHEADER RGBQUAD PIXELS

BIT MAP INFOHEADER is the bit map information header.


RGBEQUAD is the color table structure.
PIXELs are the array of bytes for the pixel bit map.
The following shows the DIB file format also known as .BMP format.
BITMAPFILEHEADER BITMAPINFO= PIXELS
BITMAPINFOHEADER + RGBQUAD

A RIFF DIB file format contains a RIFF chunk with the Form Type "RDIB" and a subchunk
called "data" for DIB data.
4 bytes denote ID of the RIFF chunk
4 bytes refer size of XYZ.RDI 4 bytes define Forum Type
4 bytes describe ID of the sub chunk data 4 bytes define size of DIB data.
RIFF PALETTE File format
The RIFF Palette file format contains a RIFF chunk with the Form Type "RP AL" and a
subchunk called "data" for palette data. The Microsoft Windows logical palette structure is
enveloped in the RIFF data subchunk. The palette structure contains the palette version number,
number of palette entries, the intensity of red, green and blue colours, and flags for the palette
usage. The palette structure is described by the following code segment:
typedef struct tagLOGP ALETTE {
WORD palVersion; //Windows version number for the structure
WORD palNumEntries;
PALETIEENTRY palpalEntry []; //array of PALEN TRY data
} LOGPALETTE;

RIFF Audio Video Interleaved(AVI) file format:


AVI files can be enveloped within the RIFF format to create the RIFF AVI file. A RIFF AVI
file contains a RIFF chunk with the form type “AVI” and two mandatory list chunks "hdr1"
and "movi". The "hdr1" defines the format of the data "Movi" contains the data for the audio-
video streams. The third list chunk called "idxl", is an optional index chunk.
Boundary condition Handling for AVI files
Each audio and video stream is grouped together to form a rec chunk. If the size of a rec chunk
is not a multiple of2048 bytes, then the rec chunk is padded to make the size of each rec chunk
a multiple of 2048 bytes. To align data on a 2048-byte boundary, dummy data is added by a
"JUNK" data chunk. The JUNK chunk is a standard RIFF chunk with a 4-character identifier,
"JUNK," followed by the dummy data.

First Frame
Header Audio Second Frame
Video Audio
Video
Figure: Interleaved Audio and Video for AVI Files
4.9.4 MIDI File Format
The MIDI file format follows music recording metaphor to provide the means of storing
separate tracks of music for each instrument so that they can be read and synchronized when
they are played.
The MIDI file format also contains chunks (i.e., blocks) of data. There are two types of chunks:
(i) header chunks (ii) track chunks.
Header Chunk
It is made up of 14 bytes.
The first four-character string is the identifier string, "MThd" .
The second four bytes contain the data size for the header chunk. It is set to a fixed value of six
bytes.
The last six bytes contain data for header chunk.
Table shows an example of header chunk.
Header Field Byte # Value
Identifier String 1–4 4D 54 68 64
Data Size 5–8 00 00 00 06
Data 9 – 14 00 00 00 01 01 E0

Track chunk
The Track chunk is organized as follows:
 The first 4-character string is the identifier.
 The second 4 bytes contain track length.
 The rest of the chunk contains MIDI messages.
MIDI Communication Protocol
This protocol uses 2 or more bytes messages. The number of bytes depends on the types of
message. There are two types of messages: (i) Channel messages and (ii) System messages.
Channel Messages
A channel message can have up to three bytes in a message. The first byte is called a status
byte, and other two bytes are called data bytes. The channel number, which addresses one of
the 16 channels, is encoded by the lower nibble of the status byte. Each MIDI voice has a
channel number; and messages are sent to the channel whose channel number matches the
channel number encoded in the lower nibble of the status byte. There are two types of channel
messages: voice messages and the mode messages.
Voice messages
Voice messages are used to control the voice of the instrument (or device); that is, switch the
notes on or off and sent key pressure messages indicating that the key is depressed, and send
control messages to control effects like vibrato, sustain, and tremolo. Pitch wheel messages are
used to change the pitch of all notes
Mode messages
Mode messages are used for assigning voice relationships for up to 16 channels; that is, to set
the device to MONO mode or POLY mode. Omni Mode on enables the device to receive voice
messages on all channels.
System Messages
System messages apply to the complete system rather than specific channels and do not contain
any channel numbers. There are three types of system messages: common messages, real-time
messages, and exclusive messages. In the following, we will see how these messages are used.
Common Messages
These messages are common to the complete system. These messages provide for functions
such as select a song, setting the song position pointer with number of beats, and sending a
tune request to an analog synthesizer.
System Real Time Messages
These messages are used for setting the system's real-time parameters. These parameters
include the timing clock, starting and stopping the sequencer, resuming the sequencer from a
stopped position, and resetting the system.
System Exclusive messages
These messages contain manufacturer-specific data such as identification, serial number,
model number, and other information. Here, a standard file format is generated which can be
moved across platforms and applications.
JPEG Motion Image:
JPEG Motion image will be embedded in AVI RIFF file format.
There are two standards available:
 MPEG - In this, patent and copyright issues are there.
 MPEG 2 - It provide better resolution and picture quality.

4.9.5 TWAIN
A standard interface was designed to allow application to interface with different types of input
devices such as scanners, digital still cameras, and so on, using a generic TWAIN interface
without creating device- specific driver. The benefits of this approach are as follows:
1. Application developers can code to a single TWAIN specification that allows
application to interface to all TWAIN-complaint input devices.
2. Device manufactures can write device drivers for their proprietary devices and, by
complying to the TWAIN specification, allow the devices to be used by all TWAIN-
compliant applications
TWAIN Specification Objectives
The TWAIN specification was started with a number of objectives:
 Supports multiple platforms: including Microsoft Windows, Apple Macintosh
OSSystem6.xor7.x, UNIX, andIBMOSl2.
 Supports multiple devices: including scanners, digital camera, frame grabbers etc.
 Standard extendibility and backward compatibility: The TWAIN architecture is
extensible for new types of devices and new device functionality. New versions of the
specification are backward compatible.
 Easy to use: The standard is well documented and easy to use.
The TWAIN architecture defines a set of application programming interfaces (APls) and a
protocol to acquire data from input devices. It is a layered architecture consisting of a protocol
layer and an acquisition layer sandwiched between the application and device layers. The
protocol layer is responsible for communication between the application and acquisition layers.
The acquisition layer contains the virtual device driver to control the device. This virtual layer
is also called the source.
TWAIN ARCHITECHTURE:

The Twain architecture defines a set of application programming interfaces (APls) and a
protocol to acquire data from input devices.
It is a layered architecture.
It has application layer, the protocol layer, the acquisition layer and device layer.
Application Layer:
A TWAIN application sets up a logical connection with a device. TWAIN does not impose any
rules on the design of an application. However, it set guidelines for the user interface to select
sources (logical device) from a given list of logical devices and also specifies user interface
guidelines to acquire data from the selected sources.
The Protocol Layer:
The application layer interfaces with the protocol layer. The protocol layer is responsible for
communications between the application and acquisition layers. The protocol layer does not
specify the method of implementation of sources, physical connection to devices, control of
devices, and other device-related functionality. This clearly highlights that applications are
independent of sources. The heart of the protocol layer, as shown in Figure is the Source
Manager. It manages all sessions between an application and the sources, and monitors data
acquisition transactions.
The functionality of the Source Manager is as follows:
 Provide a standard API for all TWAIN compliant sources
 Provides election of sources for a user from within an application
 Establish logical sessions between applications and sources, and also manages essions
between multiple applications and multiple sources
 Act as a traffic cop to make sure that transactions and communication are routed to
appropriate sources, and also validate all transactions
 Keep track of sessions and unique session identities
 Load or unload sources as demanded by an application
 Pass all return code from the source to the application
 Maintain a default source
The Acquisition Layer:
The acquisition layer contains the virtual device driver, it interacts directly with the device
driver. This virtual layer is also called the source. The source can be local and logically
connected to a local device, or remote and logically connected to a remote device (i.e., a device
over the network).
The source performs the following functions:
 Control of the device.
 Acquisition of data from the device.
 Transfer of data in agreed (negotiated) format. This can be transferred in native format
or another filtered format.
 Provision of a user interface to control the device.
The Device Layer:
The purpose of the device driver is to receive software commands and control the device
hardware accordingly. This is generally developed by the device manufacturer and shipped
with the device.
NEW WAVE RIFF File Format: This format contains two subchunks:
1. Fmt
2. Data.
It may contain optional subchunks:
1. Fact
2. Cue points
3. Play list
4. Associated Data Chunk.
5. Inst (Instrumental) Chunk.
Fact Chunk: It stores file-dependent information about the contents of the WAVE file.
Cue Points Chunk: It identifies a series of positions in the waveform data stream.
Playlist Chunk: It specifies a play order for series of cue points.
Associated Data Chunk: It provides the ability to attach information, such as labels, to sections
of the waveform data stream.
Inst Chunk: The file format stores sampled sound synthesizer's samples.

4.10 MULTIMEDIA INPUT/OUTPUT TECHNOLOGIES


Multimedia Input and Output Devices
Wide ranges of Input and output devices are available for multimedia.
Image Scanners: Image scanners are the scanners by which documents or a manufactured part
are scanned. The scanner acts as the camera eye and take a photograph of the document,
creating an unaltered electronic pixel representation of the original.
Sound and Voice: When voice or music is captured by a microphone, it generates an electrical
signal. This electrical signal has analog sinusoidal waveforms. To digitize, this signal is
converted into digital voice using an analog-to-digital converter.
Full-Motion Video: It is the most important and most complex component of Multimedia
System. Video Cameras are the primary source of input for full-motion video.
Pen Driver: It is a pen device driver that interacts with the digitizer to receive all digitized
information about the pen location and builds pen packets for the recognition context manager.
Recognition context manager: It is the main part of the pen system. It is responsible for co-
ordinating windows pen applications with the pen. It works with Recognizer, dictionary, and
display driver to recognize and display pen drawn objects.
Recognizor: It recognizes hand written characters and converts them to ASCII.
Dictionary: A dictionary is a dynamic link library (DLL); The windows for pen computing
system uses this dictionary to validate the recognition results.
Display Driver: It interacts with the graphics device interface' and display hardware. When a
user starts writing or drawing, the display driver paints the ink trace on the screen.
Video and Image Display Systems Display System Technologies
There are variety of display system technologies employed for decoding compressed data for
displaying. Mixing and scaling technology: For VGA screen, these technologies are used.
VGA mixing: Images from multiple sources are mixed in the image acquisition memory.
VGA mixing with scaling: Scalar ICs are used to sizing and positioning of images in predefined
windows.
Dual buffered VGA mixing/Scaling: If we provide dual buffering, the original image is
prevented from loss. In this technology, a separate buffer is used to maintain the original image.
Visual Display Technology Standards
MDA: Monochrome Display Adapter.
It was introduced by IBM, displays 80 x 25 rows and columns.
 It could not display bitmap graphics.
 It was introduced in 1981.
CGA: Color Graphics Adapter.
 It was introduced in 1981.
 It was designed to display both text and bitmap graphics it supported RGB color
display.
 It could display text at a resolution of 640 x 200 pixels.
 It displays both 40 x 25 and 80 x 25 rows and columns of text characters.
MGA: Monochrome Graphics Adapter.
 It was introduced in 1982.
 It could display both text and graphics.
 It could display at a resolution 720 x 350 for text and 720 x 338 for Graphics . MDA is
compatible mode for this standard.
EGA: Enhanced Graphics Adapter.
 It was introduced in 1984.
 It emulated both MDA. and CGA standards.
 It allowed the display of both text and graphics in 16 colors at a resolution of 640 x 350
pixels.
PGA: Professional Graphics Adapter.
 It was introduced in 1985.
 It could display bit map graphics at 640 x 480 resolution and 256 colors.
 Compatible mode of this standard is CGA.
VGA: Video Graphics Array.
 It was introduced by IBM in 1988.
 It offers CGA and EGA compatibility.
 It displays both text and graphics.
 It generates analog RGB signals to display 256 colors.
 It remains the basic standard for most video display systems.
SVGA: Super Video Graphics Adapter.
It is developed by VESA (Video Electronics Standard Association). Its goal is to display with
higher resolution than the VGA with higher refresh rates with minimize flicker.
XGA: Extended Graphics Array
It is developed by IBM. It offers VGA compatible mode. Resolution of 1024 x 768 pixels in
256 colors is offered by it. XGA utilizes an interlace scheme for refresh rates.
Flat Panel Display system
Flat panel displays use a fluorescent tube for backlighting to give the display a sufficient level
of brightness. The four basic technologies used for flat panel display are:
1. Passive-matrix monochrome
2. Active-matrix monochrome
3. Passive-matrix color
4. Active-matrix color.
LCD (Liquid Crystal Display)
Construction: Two glass plates each containing a light polarizer at right angles to the other
plate, sandwich the nematic (thread like) liquid crystal material.
Liquid crystal is the compounds having a crystalline arrangement of molecules. But it flows
like a liquid. Nematic liquid crystal compounds are tend to keep the long axes of rod-shaped
molecules aligned. Rows of horizontal transparent conductors are built into one glass plate, and
columns of vertical conductors are put into the other plate. The intersection of two conductors
defines a pixel position.
Passive Matrix LCD
Working: Normally, the molecules are aligned in the 'ON' state.
Polarized light passing through the materials is twisted so that it will pass through the opposite
polarizer. The light is then reflected back to the viewer. To turn off the pixel, we have to apply
a voltage to the two intersecting conductors to align molecules so that the light is not twisted.
ACTIVE Matrix LCD
In this device, a transistor is placed at each pixel position, using thin-film transistor technology.
The transistors are used to control the voltage at pixel locations and to prevent charge from
gradually leaking out of the liquid crystal cells.
PRINT OUTPUT TECHNOLOGIES
There are various printing technologies available namely Dot matrix, inkjet, laser print server
and ink jet color. But, laser printing technology is the most common for multimedia systems.
To explain this technology, let us take Hewlett Packard Laser jet-III laser printer as an example.
The basic components of the laser printer are
 Paper feed mechanism.
 Paper guide.
 Laser assembly.
 Fuser.
 Toner cartridge.
Working: The paper feed mechanism moves the paper from a paper tray through the paper path
in the printer. The paper passes over a set of corona wires that induce a change in the paper.
The charged paper passes over a drum coated with fine-grain carbon (toner), and the toner
attaches itself to the paper as a thin film of carbon. The paper is then struck by a scanning laser
beam that follows the pattern of the text on graphics to be printed. The carbon particles attach
themselves to the pixels traced by the laser beam. The fuser assembly then binds the carbon
particles to the paper.
Role of Software in the printing mechanism:
The software package sends information to the printer to select and control printing features.
Printer drivers (files) are controlling the actual operation of the printer and allow the application
software to access the features of the printer.
IMAGE SCANNERS
In a document imaging system, documents are scanned using a scanner. The document being
scanned is placed on the scanner bed or fed into the sheet feeder of the scanner. The scanner
acts as the camera eye and takes a photograph of the document, creating an image of the
original. The pixel representation (image) is recreated by the display software to render the
image of the original document on screen or to print a copy of it.
Types of Scanners
A and B size Scanners, Large Form Factor Scanners, Flatbed scanners, Rotary Drum Scanners
and Handheld scanners are the examples of scanners.
Charge-Coupled Devices
All scanners use charge-coupled devices as their photo sensors. CCDs consists of cells arranged
in a fixed array on a small square or rectangular solid state surface. Light source moves across
a document. The intensity of the light reflected by the mirror charges those cells. The amount
of charge is depending upon intensity of the reflected light, which depends on the pixel shade
in the document.
Image Enhancement Techniques
Half-tones
In a half-tone process, patterns of dots used to build scanned or printed image create the illusion
of continuous shades of gray or continuous shades of color. Hence only limited number of
shades are created. This process is implemented in newspaper printers. But in black and white
photograph or color photograph, almost infinite levels of tones are used.
Dithering
Dithering is a process in which group of pixels in different patterns are used to approximate
halftone patterns by the scanners. It is used in scanning original black and white photographs.
Image enhancement techniques includes controls of brightness, deskew (Automatically
corrects page alignment), contrast, sharpening, emphasis and cleaning up black noise dots by
software.
Image Manipulation
It includes scaling, cropping and rotation.
Scaling: Scaling can be up or down, the scaling software is available to reduce or enlarge. This
software uses algorithms.
Cropping: To remove some parts of the image and to put the rest of the image as the subset of
the old image.
Rotation: Image could be rotated at any degree for displaying it in different angles.

4.11 DIGITAL VOICE AND AUDIO


4.11.1 Digital Audio
Sound is made up of continuous analog sine waves that tend to repeat depending on the music
or voice. The analog waveforms are converted into digital fornlat by analog-to-digital converter
(ADC) using sampling process.
Speech Input

Analog-to-Digital Converter
Amplitude and Noise Normalization

Parametric Analysis

Train Recognize
Train or
Recognize

Dynamic Time Warp


Add New Reference
Pattern to the List

Compare Unknown
With Reference pattern

Output the Matched


Reference Pattern
Sampling process
Sampling is a process where the analog signal is sampled over time at regular intervals to obtain
the amplitude of the analog signal at the sampling time.
Sampling rate
The regular interval at which the sampling occurs is called the sampling rate.
Digital Voice
Speech is analog in nature and is cOl1veli to digital form by an analog-to-digital converter
(ADC). An ADC takes an input signal from a microphone and converts the amplitude of the
sampled analog signal to an 8, 16 or 32 bit digital value.
The four important factors governing the ADC process are
 Sampling Rate
 Resolution
 Linearity
 Conversion Speed.
Sampling Rate: The rate at which the ADC takes a sample of an analog signal.
Resolution: The number of bits utilized for conversion determines the resolution of ADC.
Linearity: Linearity implies that the sampling is linear at all frequencies and that the amplitude
timely represents the signal.
Conversion Speed: It is a speed of ADC to convert the analog signal into Digital signals. It
must be fast enough.
VOICE Recognition System
Voice Recognition Systems can be classified into three types.
1. Isolated-word Speech Recognition.
2. Connected-word Speech Recognition.
3. Continuous Speech Recognition.
1. Isolated-word Speech Recognition.
It provides recognition of a single word at a time. The user must separate every word by a
pause. The pause marks the end of one word and the beginning of the next word.
Stage 1: Normalization
The recognizer's first task is to carry out amplitude and noise normalization to minimize the
variation in speech due to ambient noise, the speaker's voice, the speaker's distance from and
position relative to the microphone, and the speaker's breath noise.
Stage2: Parametric Analysis
It is a pre-processing stage that extracts relevant time-varying sequences of speech parameters.
This stage serves two purposes: (i) It extracts time-varying speech parameters. (ii) It reduces
the amount of data of extracting the relevant speech parameters.
Training mode: In training mode of the recognizer, the new frames are added to the reference
list.
Recognizer mode: If the recognizer is in Recognizer mode, then dynamic time warping is
applied to the unknown patterns to average out the phoneme (smallest distinguishable sound,
and spoken words are constructed by concatenatic basic phonemes) time duration. The
unknown pattern is then compared with the reference patterns.
A speaker independent isolated word recognizer can be achieved by groupi.ng a large number
of samples corresponding to a word into a single cluster.
2. Connected-Word Speech Recognition
Connected-word speech consists of spoken phrase consisting of a sequence of words. It may
not contain long pauses between words. The method using Word Spotting technique. It
Recognizes words in a connected-word phrase. In this technique, Recognition is carried out by
compensating for rate of speech variations by the process called dynamic time warping (this
process is used to expand or compress the time duration of the word), and sliding the adjusted
connected-word phrase representation in time past a stored word template for a likely match.
3. Continuous Speech Recognition
This system can be divided into three sections:
i. A section consisting of digitization, amplitude normalization, time normalization and
parametric representation.
ii. Second section consisting of segmentation and labeling of the speech segment into a
symbolic string based on a knowledge based or rule-based systems.
iii. The final section is to match speech segments to recognize word sequences.
Voice Recognition System Performance
It is categorized into two measures: Voice recognition performance and system performance.
Voice Recognition Performance
Voice Recognition Performance is based on the accuracy with which voice segments are
identified. The following four measures are used to determine voice recognition performance.
1. Voice Recognition Accuracy:
Number of correctly recognized words
Voice Recognition Accuracy = 𝑋 100
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑒𝑠𝑡 𝑊𝑜𝑟𝑑𝑠
2. Substitution Error
Number of substituted words
Substitution error = 𝑋 100
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑒𝑠𝑡 𝑊𝑜𝑟𝑑𝑠
3. No Response Error
Number of No Response
No Response error = 𝑋 100
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑒𝑠𝑡 𝑊𝑜𝑟𝑑𝑠

4. Insertion Error
Number of insertion error
Insertion error = 𝑋 100
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑒𝑠𝑡 𝑊𝑜𝑟𝑑𝑠

System Performance:
System performance is dependent on the size of the vocabulary, speaker independence,
response time, user interface, and system throughput.
Voice Recognition Applications
Voice mail integration: The voice-mail message can be integrated with e-mail messages to
create an integrated message.
DataBase Input and Query Applications
A number of applications are developed around the voice recognition and voice synthesis
function. The following lists a few applications which use Voice recognition.
 Application such as order entry and tracking. It is a server function; It is centralized;
Remote users can dial into the system to enter an order or to track the order by making
a Voice query.
 Voice-activated rolodex or address book when a user speaks the name of the person,
the rolodex application searches the name and address and voice-synthesizes the name,
address, telephone numbers and fax numbers of a selected person. In medical
emergency, ambulance technicians can dial in and register patients by speaking into the
hospital's centralized system.
 Police can make a voice query through central data base to take follow-up action ifhe
catch any suspect.
 Language-teaching systems are an obvious use for this technology. The system can ask
the student to spell or speak a word. When the student speaks or spells the word, the
systems performs voice recognition and measures the student's ability to spell. Based
on the student's ability, the system can adjust the level of the course. This creates a self-
adjustable learning system to follow the individual's pace.
 Foreign language learning is another good application where"' an individual student can
input words and sentences in the system. The system can then correct for pronunciation
or grammar.

You might also like