0% found this document useful (0 votes)

13 views23 pages

Speech Processing in Multimedia

The document provides an overview of speech processing, highlighting its significance in multimedia applications such as voice recognition and speech synthesis. It discusses key components, techniques, and challenges in speech processing, including speech recognition systems and speech coding methods. Additionally, it addresses emerging trends and applications, emphasizing the importance of AI and real-time capabilities in enhancing speech technology.

Uploaded by

turihohabwegrace4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views23 pages

Speech Processing in Multimedia

Uploaded by

turihohabwegrace4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Speech Processing in

Multimedia

BIT/DIT 2024/25
Sem II

Ndejje University – FoSC

Introduction
Speech processing is the field of study that deals with the
analysis, synthesis, and recognition of spoken language.
It plays a crucial role in multimedia applications, including
voice recognition, speech synthesis, and audio
compression.
Importance of Speech Processing in Multimedia
 Used in voice assistants (Siri, Alexa, Google Assistant).
 Enhances human-computer interaction (HCI).
 Supports automatic transcription and subtitling.
 Enables speech-based authentication and security.
 Essential for accessibility features (e.g., screen readers for
visually impaired users). Ndejje University – FoSC
2
Key Components of Speech Processing
Speech Signal Processing
 Speech Representation: Digital signals stored as waveforms.
 Sampling & Quantization: Converts analog speech signals into digital form.
 Fourier Analysis: Extracts frequency components from speech.
 Feature Extraction: Identifies key speech characteristics (e.g., pitch, energy,
formants).
Speech Recognition (Automatic Speech Recognition – ASR)
 Converts spoken words into text.
 Uses Hidden Markov Models (HMM) and Deep Learning (Neural
Networks).
 Applications: Voice commands, dictation software, call center automation.

Ndejje University – FoSC 3

Key Components of Speech Processing…
Speech Synthesis (Text-to-Speech – TTS)
 Converts text into human-like speech.
 Uses techniques like concatenative synthesis and deep learning-
based synthesis.
 Applications: Audiobooks, screen readers, virtual assistants.
Speaker Recognition & Speech Authentication
 Identifies individuals based on voice characteristics.
 Used in biometric security and forensic analysis.
 Types:
• Speaker Verification: Confirms identity (1:1 matching).
• Speaker Identification: Recognizes a speaker from a group (1:N matching).
Ndejje University – FoSC 4
Speech Processing Techniques
• Feature Extraction Methods
 Mel-Frequency Cepstral Coefficients (MFCCs): Extracts human-like features from
speech.
 Linear Predictive Coding (LPC): Models speech waveforms.
 Spectrogram Analysis: Visualizes speech frequency over time.
• Noise Reduction & Speech Enhancement
 Spectral Subtraction: Removes background noise.
 Adaptive Filtering: Adjusts to changing noise conditions.
 Deep Learning Models: AI-driven noise cancellation (e.g., NVIDIA RTX Voice).
• Speech Compression
 Lossy Compression: Reduces file size with slight quality loss (MP3, AAC).
 Lossless Compression: Maintains quality (FLAC, ALAC).
 Codec Examples: G.711 (VoIP), AMR (mobile speech), Opus (real-time streaming).
Ndejje University – FoSC 5
Applications of Speech Processing in
Multimedia
 Virtual Assistants: Siri, Alexa, Google Assistant.
 Automatic Transcription & Subtitling: YouTube, Zoom AI captions.
 Speech-Based Search: Google Voice Search, Shazam (music
identification).
 Interactive Voice Response (IVR) Systems: Call center automation.
 Dubbing & Voice Cloning: AI-generated speech in movies and video
games.

Ndejje University – FoSC 6

Challenges & Emerging Trends in Speech
Processing
Challenges:
 Background noise and speech variability.
 Accents and multilingual speech recognition.
 Privacy concerns in voice data collection.
Emerging Trends:
 AI-powered speech synthesis with emotional expression.
 Real-time speech translation systems.
 Enhanced speech compression for 5G and cloud applications.

Ndejje University – FoSC 7

Speech Recognition
Speech recognition, also known as Automatic Speech Recognition (ASR),
is a technology that converts spoken language into text.
It enables human-computer interaction through voice commands,
dictation, and automated transcription.

Ndejje University – FoSC 8

Working of Speech Recognition Systems
Basic Process
 Audio Input: The system captures speech using a microphone.
 Pre-Processing: The signal is cleaned by removing background noise.
 Feature Extraction: Key speech features (e.g., pitch, tone, frequency) are
extracted.
 Pattern Matching & Recognition: The system compares the input with stored
speech patterns using Machine Learning (ML) models.
 Text Output: The recognized speech is converted into text.
Technologies Used
 Acoustic Modeling: Maps phonemes (basic speech sounds) to audio signals.
 Language Modeling: Predicts word sequences to improve accuracy.
 Deep Learning (Neural Networks): Used in modern speech recognition systems.
 Hidden Markov Models (HMMs): Used for probabilistic speech recognition.
Ndejje University – FoSC 9
Types of Speech Recognition Systems
Speaker-Dependent vs. Speaker-Independent:
 Speaker-Dependent: Trained on a specific user’s voice.
 Speaker-Independent: Works for any user without prior training.
Continuous vs. Discrete Speech Recognition:
 Continuous Speech Recognition: Allows natural speech (e.g., virtual
assistants).
 Discrete Speech Recognition: Requires pauses between words for
processing.
Keyword Spotting vs. Large Vocabulary Recognition:
 Keyword Spotting: Detects specific words (e.g., “Hey Google”).
 Large Vocabulary Recognition: Recognizes full sentences with complex
grammar. Ndejje University – FoSC 10
Speech Coding
Speech coding is the process of compressing speech signals to
reduce data size while maintaining intelligibility and quality.
It is used in various applications, including mobile communications,
VoIP, and audio streaming.
Importance of Speech Coding
 Reduces bandwidth requirements in communication systems.
 Enables efficient storage and transmission of speech data.
 Improves call quality in limited bandwidth networks (e.g., mobile
networks).
 Supports applications like VoIP, video conferencing, and AI voice assistants.
Ndejje University – FoSC 11
Basics of Speech Coding
Speech Signal Characteristics
 Human speech is non-stationary: Varies over time in amplitude and frequency.
 Frequency range: Typically between 300 Hz – 3400 Hz for telecommunication.
 Speech has redundancy: Compression techniques remove unnecessary data to
reduce size.
Process of Speech Coding
 Speech Acquisition: Capturing speech using a microphone.
 Digitization: Converting analog speech into digital form using sampling and
quantization.
 Compression: Applying coding techniques to reduce data size while preserving
speech quality.
 Transmission/Storage: Sending or storing the compressed speech data.
 Decoding: Reconstructing the speech signal at the receiver’s end.
Ndejje University – FoSC 12
Types of Speech Coding Techniques
Waveform Coding
 Directly encodes the speech waveform.
 Maintains high quality but has lower compression efficiency.
Examples:
• Pulse Code Modulation (PCM): Standard in telephone networks.
• Adaptive Differential PCM (ADPCM): Reduces bitrate while maintaining quality.
Parametric (Vocoder-Based) Coding
 Models human speech production and transmits only essential parameters.
 Highly efficient but may reduce speech quality.
Examples:
• Linear Predictive Coding (LPC): Used in low-bitrate speech applications.
• Code-Excited Linear Prediction (CELP): Used in VoIP and mobile communications.
Ndejje University – FoSC 13
Types of Speech Coding Techniques…
Hybrid Coding
 Combines waveform and parametric coding for better efficiency.
Examples:
• G.729: Used in VoIP and telephony.
• AMR (Adaptive Multi-Rate): Used in mobile networks.

Ndejje University – FoSC 14

Speech Coding Applications &
Challenges
Applications of Speech Coding
 Mobile Communications: Used in GSM, LTE, and VoIP calls.
 VoIP (Voice over IP): Reduces bandwidth usage while maintaining clarity.
 Speech Storage: Used in voicemail and audio recording systems.
 AI Voice Assistants: Efficient coding helps process voice commands faster.
 Hearing Aids & Cochlear Implants: Improves speech intelligibility for users.
Challenges in Speech Coding
 Balancing Compression & Quality: High compression may lead to poor speech
intelligibility.
 Real-Time Processing: Coding algorithms must operate with minimal delay.
 Noise & Distortion: Background noise affects speech coding performance.
 Computational Complexity: Advanced algorithms require high processing power.
Ndejje University – FoSC 15
Trends in Speech Coding
 AI-Based Speech Coding: Machine learning models improving
compression efficiency.
 5G & Beyond: Advanced speech codecs for ultra-low latency
communication.
 Neural Vocoders: Deep learning-driven speech synthesis with high
naturalness.
 Ultra-Low Bitrate Coding: Further reducing bandwidth usage without
quality loss.

Ndejje University – FoSC 16

Speech Synthesis
peech synthesis, also known as Text-to-Speech (TTS), is the technology
that converts written text into spoken language.
It is widely used in assistive technologies, virtual assistants, and
automated customer service systems.

Ndejje University – FoSC 17

Types of Speech Synthesis
Concatenative Synthesis
 Uses pre-recorded speech segments.
 Produces natural-sounding speech but requires large storage.
Types:
• Unit Selection Synthesis: Uses large speech databases for smooth speech.
• Diphone Synthesis: Joins two consecutive phonemes to improve fluency.
Formant Synthesis
 Generates speech using mathematical models of vocal tract.
 Requires less storage but sounds robotic.
Example: The classic DECtalk system used in early computers.
Parametric & Deep Learning-Based Synthesis
 Uses machine learning models to generate realistic speech.
 Allows customization of voice characteristics (tone, pitch, speed).
Examples:
• WaveNet (by DeepMind): AI-driven, highly natural speech synthesis.
• Tacotron (by Google): Neural network-based TTS system.
Ndejje University – FoSC 18
Challenges & Emerging techs in Speech Synthesis

• Challenges in Speech Synthesis

 Naturalness of Speech: Maintaining human-like emotion and intonation.
 Pronunciation & Prosody Errors: Handling homophones and sentence stress.
 Multilingual Speech Synthesis: Supporting different languages and accents.
 Computational Complexity: AI-based synthesis requires high processing
power.
 Data Privacy: Risks associated with cloning human voices.
• Emerging technologies in Speech Synthesis
 AI-Powered Emotional Speech Synthesis – Mimicking human emotions.
 Real-Time Speech Translation – AI-driven multilingual TTS systems.
 Personalized Voice Cloning – Users can create their digital voice avatars.
 Ultra-Low Latency TTS – Faster, real-time speech generation.
Ndejje University – FoSC 19
Properties of Speech
Acoustic Properties of Speech
 Speech is a sound wave with specific characteristics:
 Pitch (Fundamental Frequency - F0): Determines the perceived
highness or lowness of speech.
 Intensity (Loudness): Measured in decibels (dB), affects clarity.
 Duration & Tempo: Speech rate and rhythm affect intelligibility.
 Formants: Resonant frequencies that shape vowel sounds.
 Harmonics: Integer multiples of the fundamental frequency that
define voice quality.

Ndejje University – FoSC 20

Properties of Speech
Linguistic Properties of Speech
 Speech consists of different linguistic components:
 Phonemes: Smallest units of sound that distinguish words.
 Morphemes: Smallest meaningful units in language.
 Syntax & Grammar: Rules governing sentence structure.
 Prosody: Rhythm, stress, and intonation patterns.
Physiological Properties of Speech
 Speech production involves the lungs (airflow), vocal cords (sound
generation), and articulators (lips, tongue, and palate) working
together.
 Voiced vs. Unvoiced Sounds: Vibrating vocal cords produce voiced sounds (e.g.,
"b," "d"), while unvoiced sounds (e.g., "s," "t") do not.
 Speech Organs: Include the lungs, larynx, tongue, and lips.
Ndejje University – FoSC 21
Effects of Speech
Psychological Effects of Speech
 Emotional Impact: Tone, pitch, and speed influence emotions (e.g., calm vs. aggressive
speech).
 Persuasion & Influence: Speech is used in public speaking, marketing, and leadership.
 Memory & Learning: Clear and structured speech enhances comprehension and retention.
Social & Cultural Effects of Speech
 Dialect & Accent Differences: Speech reflects social and regional identity.
 Speech Adaptation: People adjust their speech in different social contexts (e.g., formal vs.
informal).
 Language Evolution: Speech changes over time due to cultural influences.
Technological Effects of Speech
 Speech Recognition & AI: Used in virtual assistants (Siri, Alexa).
 Speech Synthesis: Converts text into human-like speech (Text-to-Speech, TTS).
 Forensic Linguistics: Speech analysis helps in legal investigations.
Ndejje University – FoSC 22
Activity
1. Explain the basic process of speech recognition.
2. Describe the role of "Formants" in speech and how they contribute to vowel sound production.
3. What is the difference between waveform coding and parametric coding in speech coding?
4. Discuss the challenges faced by speech synthesis systems when generating natural-sounding speech.
5. What are the physiological properties involved in speech production? Describe their role.
6. Discuss the basic principles of speech synthesis and the difference between concatenative and formant synthesis.
Provide examples of when each method would be used.
7. Explain how speech recognition systems use acoustic and language modeling to improve accuracy. Include the role
of Hidden Markov Models (HMMs) in the process.
8. Describe the effects of speech in social and psychological contexts. How can tone, pitch, and speed influence
communication? Provide examples from real-world scenarios.
9. Discuss the challenges in speech coding, particularly in noisy environments. What are some methods used to
address these challenges?
10. You are developing a virtual assistant using speech recognition. What challenges would you face in recognizing
speech from users with different accents? How would you address these challenges in your system design?
Ndejje University – FoSC 23

Speech Signals Processing
No ratings yet
Speech Signals Processing
7 pages
14ec3029 Speech and Audio Signal Processing
No ratings yet
14ec3029 Speech and Audio Signal Processing
30 pages
Digital Speech Processing
No ratings yet
Digital Speech Processing
7 pages
Digital Speech Processing
No ratings yet
Digital Speech Processing
46 pages
Introduction To Digital Speech Processing
No ratings yet
Introduction To Digital Speech Processing
42 pages
Speech Processing
No ratings yet
Speech Processing
70 pages
Speech Recognition: BY Charu Joshi
No ratings yet
Speech Recognition: BY Charu Joshi
26 pages
Speech Technology
No ratings yet
Speech Technology
5 pages
Speech Recognition Full Report
No ratings yet
Speech Recognition Full Report
11 pages
Applications PDF
No ratings yet
Applications PDF
32 pages
Intro 2025
No ratings yet
Intro 2025
15 pages
Speech Recognition: BY Charu Joshi
100% (2)
Speech Recognition: BY Charu Joshi
26 pages
Advanced Topics in Speech Processing (IT60116) : K Sreenivasa Rao School of Information Technology IIT Kharagpur
No ratings yet
Advanced Topics in Speech Processing (IT60116) : K Sreenivasa Rao School of Information Technology IIT Kharagpur
17 pages
How Voice Works
No ratings yet
How Voice Works
3 pages
DSP in Speech Processing
No ratings yet
DSP in Speech Processing
11 pages
David Crawford Epson
No ratings yet
David Crawford Epson
31 pages
Speech Coding
100% (3)
Speech Coding
36 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
Voice Technologies and Systems: Definitive Reference for Developers and Engineers
From Everand
Voice Technologies and Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Speech To Text Conversion: by B.Sravani 09k95a0404
No ratings yet
Speech To Text Conversion: by B.Sravani 09k95a0404
22 pages
Speech Recognition (Dr. M. Sabarimalai Manikandan
No ratings yet
Speech Recognition (Dr. M. Sabarimalai Manikandan
2 pages
Speech Technology Overview
No ratings yet
Speech Technology Overview
15 pages
Unit 1 - Speech and Video Processing (SVP) - 1
No ratings yet
Unit 1 - Speech and Video Processing (SVP) - 1
20 pages
Speech and Audio Processing: Lecture-3
No ratings yet
Speech and Audio Processing: Lecture-3
20 pages
Speech Coder
No ratings yet
Speech Coder
20 pages
Speech Signal Analysis and Coding: Dr. Arun Kumar
No ratings yet
Speech Signal Analysis and Coding: Dr. Arun Kumar
52 pages
Speech Recognition Report
100% (1)
Speech Recognition Report
20 pages
Final Report
No ratings yet
Final Report
35 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
SPEECH
100% (1)
SPEECH
17 pages
Speech Recognition
0% (1)
Speech Recognition
27 pages
Design and Implementation
No ratings yet
Design and Implementation
74 pages
Speech Recognition: Fundamentals and Applications
From Everand
Speech Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
SP - 3301PPT
No ratings yet
SP - 3301PPT
152 pages
3MCA67 Speech Recognition
No ratings yet
3MCA67 Speech Recognition
14 pages
Unit 2 Sound or Audio System
No ratings yet
Unit 2 Sound or Audio System
29 pages
Speech Recognition PPT F
100% (2)
Speech Recognition PPT F
16 pages
Week-1 EEE 2415 Speech Processing - Course Content
No ratings yet
Week-1 EEE 2415 Speech Processing - Course Content
3 pages
Marathi Speech Synthesis A Review
No ratings yet
Marathi Speech Synthesis A Review
4 pages
Aimybox Voice Assistant Development: Definitive Reference for Developers and Engineers
From Everand
Aimybox Voice Assistant Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Widcollogo1 FINAL
No ratings yet
Widcollogo1 FINAL
83 pages
Speech and Audio Coding
No ratings yet
Speech and Audio Coding
16 pages
Speech Recognition UTHM
No ratings yet
Speech Recognition UTHM
30 pages
Speech and Audio Processing and Coding
No ratings yet
Speech and Audio Processing and Coding
52 pages
Introduction To Speech Coding What, Why, Where & How (First Part)
No ratings yet
Introduction To Speech Coding What, Why, Where & How (First Part)
10 pages
Natural Language Processing: by Dr. Parminder Kaur
No ratings yet
Natural Language Processing: by Dr. Parminder Kaur
26 pages
Speech Recognition - Specific Task of Speech Recognition: Abstract
No ratings yet
Speech Recognition - Specific Task of Speech Recognition: Abstract
7 pages
Speech Recognition Project
No ratings yet
Speech Recognition Project
33 pages
Reconocimiento de Voz - MATLAB
No ratings yet
Reconocimiento de Voz - MATLAB
5 pages
AI Speech Recognition Document
No ratings yet
AI Speech Recognition Document
26 pages
Speech Recognition
No ratings yet
Speech Recognition
2 pages
Speech Recognition Seminar
No ratings yet
Speech Recognition Seminar
19 pages
Speech Recognition1
100% (1)
Speech Recognition1
39 pages
Minor Project123
No ratings yet
Minor Project123
40 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
9 pages
Basic Course Material Winter 2015
100% (1)
Basic Course Material Winter 2015
19 pages
Unit V Application
No ratings yet
Unit V Application
13 pages
SP Assign - 2
No ratings yet
SP Assign - 2
9 pages
(IJCST-V9I2P18) :swati, Harpreet Kaur
No ratings yet
(IJCST-V9I2P18) :swati, Harpreet Kaur
6 pages
Speech Recognition
No ratings yet
Speech Recognition
4 pages
Knowledge Management
No ratings yet
Knowledge Management
8 pages
Walton
No ratings yet
Walton
14 pages
Molex M-100 Catalog 1973
100% (1)
Molex M-100 Catalog 1973
28 pages
Frmcs User Requirements Specification Version 4.0.0 PDF
No ratings yet
Frmcs User Requirements Specification Version 4.0.0 PDF
120 pages
Memory (Unit-3) : 6.1 Main Memory, Secondary Memory and Backup Memory
No ratings yet
Memory (Unit-3) : 6.1 Main Memory, Secondary Memory and Backup Memory
8 pages
FCASD - Lab Assignment - 6
No ratings yet
FCASD - Lab Assignment - 6
7 pages
Asian EFL Journal
No ratings yet
Asian EFL Journal
1 page
Build and Accesing MySQL Database Using XAMPP
No ratings yet
Build and Accesing MySQL Database Using XAMPP
26 pages
Work Simplification QR Attendance
No ratings yet
Work Simplification QR Attendance
12 pages
Binary Search
No ratings yet
Binary Search
18 pages
Repair of Broken Power Supply in Apple Time Capsule
100% (3)
Repair of Broken Power Supply in Apple Time Capsule
4 pages
Mba 4 TH Sem Only
No ratings yet
Mba 4 TH Sem Only
29 pages
GSM-To-UMTS Training Series 01 - Principles of The WCDMA System - V1.0
No ratings yet
GSM-To-UMTS Training Series 01 - Principles of The WCDMA System - V1.0
87 pages
GT11 General Description: GT1155-QSBD, GT1150-QLBD
No ratings yet
GT11 General Description: GT1155-QSBD, GT1150-QLBD
6 pages
Module 2@13 3 2024
No ratings yet
Module 2@13 3 2024
41 pages
Racing
No ratings yet
Racing
130 pages
Apple Inc.: Poenaru Alexandru Cristian
No ratings yet
Apple Inc.: Poenaru Alexandru Cristian
36 pages
Arema Mre Chapter 2 2019
100% (1)
Arema Mre Chapter 2 2019
7 pages
Scandinavian
No ratings yet
Scandinavian
2 pages
Addition of Matrices
No ratings yet
Addition of Matrices
4 pages
Karthik CV 2025-Updated
No ratings yet
Karthik CV 2025-Updated
5 pages
C# Unit-3
No ratings yet
C# Unit-3
133 pages
4-5-Security-Privacy Rules-Lessonplan
No ratings yet
4-5-Security-Privacy Rules-Lessonplan
4 pages
Cambridge Primary Computing Learner S Book Stage 1 Sample Pages 9781398368569 Pages 6
No ratings yet
Cambridge Primary Computing Learner S Book Stage 1 Sample Pages 9781398368569 Pages 6
1 page
Pisa Week 2
No ratings yet
Pisa Week 2
7 pages
Ecil Eee and Eie Syllabus
No ratings yet
Ecil Eee and Eie Syllabus
3 pages
RCR 35ia (E) 02a
No ratings yet
RCR 35ia (E) 02a
2 pages
Dashpute Smita A.: Brief Overview
No ratings yet
Dashpute Smita A.: Brief Overview
3 pages
Android Controlled Spy Robot With Night Vision Camera
No ratings yet
Android Controlled Spy Robot With Night Vision Camera
16 pages
Lead Acid Battery
No ratings yet
Lead Acid Battery
13 pages

Speech Processing in Multimedia

Uploaded by

Speech Processing in Multimedia

Uploaded by

Speech Processing in

Ndejje University – FoSC

Ndejje University – FoSC 3

Ndejje University – FoSC 6

Ndejje University – FoSC 7

Ndejje University – FoSC 8

Ndejje University – FoSC 14

Ndejje University – FoSC 16

Ndejje University – FoSC 17

• Challenges in Speech Synthesis

Ndejje University – FoSC 20

You might also like