0% found this document useful (0 votes)
13 views12 pages

Natural Language Processing: Task4

The document discusses Speech Recognition and Synthesis as key components of AI systems that accept vocal commands and provide spoken responses. It outlines the processes involved in speech recognition, including the use of acoustic and language models, as well as the functions of speech synthesis. Additionally, it describes the APIs available for these services, specifically the Speech-to-Text and Text-to-Speech APIs, and provides examples of their applications.

Uploaded by

wongho.alex0310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views12 pages

Natural Language Processing: Task4

The document discusses Speech Recognition and Synthesis as key components of AI systems that accept vocal commands and provide spoken responses. It outlines the processes involved in speech recognition, including the use of acoustic and language models, as well as the functions of speech synthesis. Additionally, it describes the APIs available for these services, specifically the Speech-to-Text and Text-to-Speech APIs, and provides examples of their applications.

Uploaded by

wongho.alex0310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

​Natural Language Processing

Task4
Speech Recognition and Synthesis
• Some AI solution need accept vocal commands and provide spoken response
• Example:
➢Asking Siri “Will it rain today?“

• These AI system must support two capabilities:


➢ Speech recognition - the ability to detect and interpret spoken input
➢ Speech synthesis - the ability to generate spoken output
Speech Recognition
• Taking the spoken word and converting it into data
• Speech patterns are analyzed with two types of model:
1. Acoustic model
➢Converts the audio signal into phonemes (representations of specific sounds)
2. Language model
➢ Maps phonemes to words
Speech Recognition examples
• Providing closed captions for recorded or live videos
• Creating a transcript of a phone call or meeting
• Automated note dictation
• Determining intended user input for further processing
Speech Synthesis
• Converting text to speech
• A speech synthesis solution requires:
1. The text to be spoken
2. The voice to be used to vocalize the speech
Speech Synthesis examples
• Generating spoken responses to user input
• Creating voice menus for telephone systems
• Reading email or text messages aloud in hands-free scenarios
• Broadcasting announcements in public locations, such as railway
stations or airports
Please stand back from the train door
Services for Speech Recognition and Synthesis
• Speech service includes two APIs for speech recognition and synthesis
1. Speech to text API
2. Text to speech API
Azure AI service

Speech
Speech-to-text API
• Perform real-time or batch transcription of audio into a text format
• Optimized for two scenarios, conversational and dictation
• Create custom models including acoustics, language, and
pronunciation if the pre-built models do not provide what you need
Speech-to-text API
Real-time transcription Batch transcription
➢Real-time ➢Asynchronously (Need to wait)
➢Transcribe text in audio streams ➢Transcribe multiple audio files
➢Scheduled on a best-effort basis
Text-to-speech API
• Support multiple languages and regional pronunciation
• Include standard voices and neural voices that provide more natural sounding
• Develop custom voices with the text to speech API
Question 1
For which two scenarios is the Universal Language Model used by the
speech-to-text API optimized?

• Acoustic
• Conversational
• Dictation
• Language
• Pronunciation
Question 2
What is the role of an acoustic model in speech recognition?

• It converts the audio signal into phonemes


• It maps phonemes to words
• It synthesizes speech
• It vocalizes data

You might also like