0% found this document useful (0 votes)
13 views23 pages

Speech Processing in Multimedia

The document provides an overview of speech processing, highlighting its significance in multimedia applications such as voice recognition and speech synthesis. It discusses key components, techniques, and challenges in speech processing, including speech recognition systems and speech coding methods. Additionally, it addresses emerging trends and applications, emphasizing the importance of AI and real-time capabilities in enhancing speech technology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

Speech Processing in Multimedia

The document provides an overview of speech processing, highlighting its significance in multimedia applications such as voice recognition and speech synthesis. It discusses key components, techniques, and challenges in speech processing, including speech recognition systems and speech coding methods. Additionally, it addresses emerging trends and applications, emphasizing the importance of AI and real-time capabilities in enhancing speech technology.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Speech Processing in

Multimedia

BIT/DIT 2024/25
Sem II

Ndejje University – FoSC


Introduction
Speech processing is the field of study that deals with the
analysis, synthesis, and recognition of spoken language.
It plays a crucial role in multimedia applications, including
voice recognition, speech synthesis, and audio
compression.
Importance of Speech Processing in Multimedia
 Used in voice assistants (Siri, Alexa, Google Assistant).
 Enhances human-computer interaction (HCI).
 Supports automatic transcription and subtitling.
 Enables speech-based authentication and security.
 Essential for accessibility features (e.g., screen readers for
visually impaired users). Ndejje University – FoSC
2
Key Components of Speech Processing
Speech Signal Processing
 Speech Representation: Digital signals stored as waveforms.
 Sampling & Quantization: Converts analog speech signals into digital form.
 Fourier Analysis: Extracts frequency components from speech.
 Feature Extraction: Identifies key speech characteristics (e.g., pitch, energy,
formants).
Speech Recognition (Automatic Speech Recognition – ASR)
 Converts spoken words into text.
 Uses Hidden Markov Models (HMM) and Deep Learning (Neural
Networks).
 Applications: Voice commands, dictation software, call center automation.

Ndejje University – FoSC 3


Key Components of Speech Processing…
Speech Synthesis (Text-to-Speech – TTS)
 Converts text into human-like speech.
 Uses techniques like concatenative synthesis and deep learning-
based synthesis.
 Applications: Audiobooks, screen readers, virtual assistants.
Speaker Recognition & Speech Authentication
 Identifies individuals based on voice characteristics.
 Used in biometric security and forensic analysis.
 Types:
• Speaker Verification: Confirms identity (1:1 matching).
• Speaker Identification: Recognizes a speaker from a group (1:N matching).
Ndejje University – FoSC 4
Speech Processing Techniques
• Feature Extraction Methods
 Mel-Frequency Cepstral Coefficients (MFCCs): Extracts human-like features from
speech.
 Linear Predictive Coding (LPC): Models speech waveforms.
 Spectrogram Analysis: Visualizes speech frequency over time.
• Noise Reduction & Speech Enhancement
 Spectral Subtraction: Removes background noise.
 Adaptive Filtering: Adjusts to changing noise conditions.
 Deep Learning Models: AI-driven noise cancellation (e.g., NVIDIA RTX Voice).
• Speech Compression
 Lossy Compression: Reduces file size with slight quality loss (MP3, AAC).
 Lossless Compression: Maintains quality (FLAC, ALAC).
 Codec Examples: G.711 (VoIP), AMR (mobile speech), Opus (real-time streaming).
Ndejje University – FoSC 5
Applications of Speech Processing in
Multimedia
 Virtual Assistants: Siri, Alexa, Google Assistant.
 Automatic Transcription & Subtitling: YouTube, Zoom AI captions.
 Speech-Based Search: Google Voice Search, Shazam (music
identification).
 Interactive Voice Response (IVR) Systems: Call center automation.
 Dubbing & Voice Cloning: AI-generated speech in movies and video
games.

Ndejje University – FoSC 6


Challenges & Emerging Trends in Speech
Processing
Challenges:
 Background noise and speech variability.
 Accents and multilingual speech recognition.
 Privacy concerns in voice data collection.
Emerging Trends:
 AI-powered speech synthesis with emotional expression.
 Real-time speech translation systems.
 Enhanced speech compression for 5G and cloud applications.

Ndejje University – FoSC 7


Speech Recognition
Speech recognition, also known as Automatic Speech Recognition (ASR),
is a technology that converts spoken language into text.
It enables human-computer interaction through voice commands,
dictation, and automated transcription.

Ndejje University – FoSC 8


Working of Speech Recognition Systems
Basic Process
 Audio Input: The system captures speech using a microphone.
 Pre-Processing: The signal is cleaned by removing background noise.
 Feature Extraction: Key speech features (e.g., pitch, tone, frequency) are
extracted.
 Pattern Matching & Recognition: The system compares the input with stored
speech patterns using Machine Learning (ML) models.
 Text Output: The recognized speech is converted into text.
Technologies Used
 Acoustic Modeling: Maps phonemes (basic speech sounds) to audio signals.
 Language Modeling: Predicts word sequences to improve accuracy.
 Deep Learning (Neural Networks): Used in modern speech recognition systems.
 Hidden Markov Models (HMMs): Used for probabilistic speech recognition.
Ndejje University – FoSC 9
Types of Speech Recognition Systems
Speaker-Dependent vs. Speaker-Independent:
 Speaker-Dependent: Trained on a specific user’s voice.
 Speaker-Independent: Works for any user without prior training.
Continuous vs. Discrete Speech Recognition:
 Continuous Speech Recognition: Allows natural speech (e.g., virtual
assistants).
 Discrete Speech Recognition: Requires pauses between words for
processing.
Keyword Spotting vs. Large Vocabulary Recognition:
 Keyword Spotting: Detects specific words (e.g., “Hey Google”).
 Large Vocabulary Recognition: Recognizes full sentences with complex
grammar. Ndejje University – FoSC 10
Speech Coding
Speech coding is the process of compressing speech signals to
reduce data size while maintaining intelligibility and quality.
It is used in various applications, including mobile communications,
VoIP, and audio streaming.
Importance of Speech Coding
 Reduces bandwidth requirements in communication systems.
 Enables efficient storage and transmission of speech data.
 Improves call quality in limited bandwidth networks (e.g., mobile
networks).
 Supports applications like VoIP, video conferencing, and AI voice assistants.
Ndejje University – FoSC 11
Basics of Speech Coding
Speech Signal Characteristics
 Human speech is non-stationary: Varies over time in amplitude and frequency.
 Frequency range: Typically between 300 Hz – 3400 Hz for telecommunication.
 Speech has redundancy: Compression techniques remove unnecessary data to
reduce size.
Process of Speech Coding
 Speech Acquisition: Capturing speech using a microphone.
 Digitization: Converting analog speech into digital form using sampling and
quantization.
 Compression: Applying coding techniques to reduce data size while preserving
speech quality.
 Transmission/Storage: Sending or storing the compressed speech data.
 Decoding: Reconstructing the speech signal at the receiver’s end.
Ndejje University – FoSC 12
Types of Speech Coding Techniques
Waveform Coding
 Directly encodes the speech waveform.
 Maintains high quality but has lower compression efficiency.
Examples:
• Pulse Code Modulation (PCM): Standard in telephone networks.
• Adaptive Differential PCM (ADPCM): Reduces bitrate while maintaining quality.
Parametric (Vocoder-Based) Coding
 Models human speech production and transmits only essential parameters.
 Highly efficient but may reduce speech quality.
Examples:
• Linear Predictive Coding (LPC): Used in low-bitrate speech applications.
• Code-Excited Linear Prediction (CELP): Used in VoIP and mobile communications.
Ndejje University – FoSC 13
Types of Speech Coding Techniques…
Hybrid Coding
 Combines waveform and parametric coding for better efficiency.
Examples:
• G.729: Used in VoIP and telephony.
• AMR (Adaptive Multi-Rate): Used in mobile networks.

Ndejje University – FoSC 14


Speech Coding Applications &
Challenges
Applications of Speech Coding
 Mobile Communications: Used in GSM, LTE, and VoIP calls.
 VoIP (Voice over IP): Reduces bandwidth usage while maintaining clarity.
 Speech Storage: Used in voicemail and audio recording systems.
 AI Voice Assistants: Efficient coding helps process voice commands faster.
 Hearing Aids & Cochlear Implants: Improves speech intelligibility for users.
Challenges in Speech Coding
 Balancing Compression & Quality: High compression may lead to poor speech
intelligibility.
 Real-Time Processing: Coding algorithms must operate with minimal delay.
 Noise & Distortion: Background noise affects speech coding performance.
 Computational Complexity: Advanced algorithms require high processing power.
Ndejje University – FoSC 15
Trends in Speech Coding
 AI-Based Speech Coding: Machine learning models improving
compression efficiency.
 5G & Beyond: Advanced speech codecs for ultra-low latency
communication.
 Neural Vocoders: Deep learning-driven speech synthesis with high
naturalness.
 Ultra-Low Bitrate Coding: Further reducing bandwidth usage without
quality loss.

Ndejje University – FoSC 16


Speech Synthesis
peech synthesis, also known as Text-to-Speech (TTS), is the technology
that converts written text into spoken language.
It is widely used in assistive technologies, virtual assistants, and
automated customer service systems.

Ndejje University – FoSC 17


Types of Speech Synthesis
Concatenative Synthesis
 Uses pre-recorded speech segments.
 Produces natural-sounding speech but requires large storage.
Types:
• Unit Selection Synthesis: Uses large speech databases for smooth speech.
• Diphone Synthesis: Joins two consecutive phonemes to improve fluency.
Formant Synthesis
 Generates speech using mathematical models of vocal tract.
 Requires less storage but sounds robotic.
Example: The classic DECtalk system used in early computers.
Parametric & Deep Learning-Based Synthesis
 Uses machine learning models to generate realistic speech.
 Allows customization of voice characteristics (tone, pitch, speed).
Examples:
• WaveNet (by DeepMind): AI-driven, highly natural speech synthesis.
• Tacotron (by Google): Neural network-based TTS system.
Ndejje University – FoSC 18
Challenges & Emerging techs in Speech Synthesis

• Challenges in Speech Synthesis


 Naturalness of Speech: Maintaining human-like emotion and intonation.
 Pronunciation & Prosody Errors: Handling homophones and sentence stress.
 Multilingual Speech Synthesis: Supporting different languages and accents.
 Computational Complexity: AI-based synthesis requires high processing
power.
 Data Privacy: Risks associated with cloning human voices.
• Emerging technologies in Speech Synthesis
 AI-Powered Emotional Speech Synthesis – Mimicking human emotions.
 Real-Time Speech Translation – AI-driven multilingual TTS systems.
 Personalized Voice Cloning – Users can create their digital voice avatars.
 Ultra-Low Latency TTS – Faster, real-time speech generation.
Ndejje University – FoSC 19
Properties of Speech
Acoustic Properties of Speech
 Speech is a sound wave with specific characteristics:
 Pitch (Fundamental Frequency - F0): Determines the perceived
highness or lowness of speech.
 Intensity (Loudness): Measured in decibels (dB), affects clarity.
 Duration & Tempo: Speech rate and rhythm affect intelligibility.
 Formants: Resonant frequencies that shape vowel sounds.
 Harmonics: Integer multiples of the fundamental frequency that
define voice quality.

Ndejje University – FoSC 20


Properties of Speech
Linguistic Properties of Speech
 Speech consists of different linguistic components:
 Phonemes: Smallest units of sound that distinguish words.
 Morphemes: Smallest meaningful units in language.
 Syntax & Grammar: Rules governing sentence structure.
 Prosody: Rhythm, stress, and intonation patterns.
Physiological Properties of Speech
 Speech production involves the lungs (airflow), vocal cords (sound
generation), and articulators (lips, tongue, and palate) working
together.
 Voiced vs. Unvoiced Sounds: Vibrating vocal cords produce voiced sounds (e.g.,
"b," "d"), while unvoiced sounds (e.g., "s," "t") do not.
 Speech Organs: Include the lungs, larynx, tongue, and lips.
Ndejje University – FoSC 21
Effects of Speech
Psychological Effects of Speech
 Emotional Impact: Tone, pitch, and speed influence emotions (e.g., calm vs. aggressive
speech).
 Persuasion & Influence: Speech is used in public speaking, marketing, and leadership.
 Memory & Learning: Clear and structured speech enhances comprehension and retention.
Social & Cultural Effects of Speech
 Dialect & Accent Differences: Speech reflects social and regional identity.
 Speech Adaptation: People adjust their speech in different social contexts (e.g., formal vs.
informal).
 Language Evolution: Speech changes over time due to cultural influences.
Technological Effects of Speech
 Speech Recognition & AI: Used in virtual assistants (Siri, Alexa).
 Speech Synthesis: Converts text into human-like speech (Text-to-Speech, TTS).
 Forensic Linguistics: Speech analysis helps in legal investigations.
Ndejje University – FoSC 22
Activity
1. Explain the basic process of speech recognition.
2. Describe the role of "Formants" in speech and how they contribute to vowel sound production.
3. What is the difference between waveform coding and parametric coding in speech coding?
4. Discuss the challenges faced by speech synthesis systems when generating natural-sounding speech.
5. What are the physiological properties involved in speech production? Describe their role.
6. Discuss the basic principles of speech synthesis and the difference between concatenative and formant synthesis.
Provide examples of when each method would be used.
7. Explain how speech recognition systems use acoustic and language modeling to improve accuracy. Include the role
of Hidden Markov Models (HMMs) in the process.
8. Describe the effects of speech in social and psychological contexts. How can tone, pitch, and speed influence
communication? Provide examples from real-world scenarios.
9. Discuss the challenges in speech coding, particularly in noisy environments. What are some methods used to
address these challenges?
10. You are developing a virtual assistant using speech recognition. What challenges would you face in recognizing
speech from users with different accents? How would you address these challenges in your system design?
Ndejje University – FoSC 23

You might also like