Bhavika Voice XML
Bhavika Voice XML
A Seminar Report
Seminar Report submitted in partial fulfillment of the requirements for the award of
the degree of B.Tech. in Computer Science & Engineering under
Bikaner Technical University
by
Bhavika
University Roll No.: 20EMCCS026
This is to certify that the Seminar entitled Voice XML presented by Bhavika bearing University
Roll No. 20EMCCS026 of Computer Science & Engineering in MITRC has been completed
successfully.
This is in partial fulfillment of the requirements of Bachelor Degree in Computer Science &
Engineering under Bikaner Technical University, Bikaner, Rajasthan.
I express my sincere gratitude to Dr. J. R. ArunKumar, HOD of Computer Science Engineering and
Faculty and staffs for their support and guidance.
Bhavika
Department of CSE
University Roll no. - 20EMCCS026
i
ABSTRACT
VoiceXML is the standard scripting language for rendering web pages over the telephone.
Developing costs of an interactive phone application have changed dramatically with the new
markup language. VoiceXML builds on the basic concept and rules set by XML. Interactive
applications contain synthesized speech, pre-recorded audio, grammars defining words that could be
recognised, and DTMF key input. By saying something or pressing the keypad on the phone the user
transitions between different pages. VoiceXML could be used in many different ways. You can
integrate it with your web page, letting people access it through a phone. It is simple to create
services such as booking a ticket or looking up when the bus leaves. It can be used to create a
voicemail from your phone or having your regular e-mail be read to you. Some vendors of voice
gateways even offer SMS add-ons. The neat thing is that VoiceXML works both with the traditional
PSTN system and the new technology of Voice over IP (VoIP).
This report describes how VoiceXML is defined and how it works. It contains the concept of the
VoiceXML with the architectural models and implementation of the VoiceXML with some of its
application in our daily life.
1. Introduction to VoiceXML.........................................................................................................1
1.1 Basic Overview.......................................................................................................................1
1.2 History....................................................................................................................................2
1.3 Goals of Voice Xml................................................................................................................3
2. Creating a basic voiceXML document......................................................................................4
2.1 Voice Xml elements................................................................................................................4
3. Architectural model of VoiceXML............................................................................................6
3.1 Principle of designs.................................................................................................................7
4. Concepts of VoiceXML...............................................................................................................8
4.1 Concepts in Voice Xml...........................................................................................................9
4.2 Supported audio forma..........................................................................................................10
4.3 Applications of VoiceXml.....................................................................................................13
5. Voice User Interface using VoiceXML....................................................................................15
5.1 Interaction..............................................................................................................................15
5.2 Dialogue Initiative.................................................................................................................16
CONCLUSIONS............................................................................................................................17
REFERENCE ................................................................................................................................18
LIST OF TABLES
Voice XML is developed as a standard markup language for delivering and processing voice
dialogs. Voice XML applications include automated driving assistance,voice access to email, voice
directory access and other services. Voice XML pages are transported online via the HTTP protocol.
VoiceXML is a language for creating voice-user interfaces, particularly for the telephone. It uses
speech recognition and touchtone (DTMF keypad) for input, and pre-recorded audio and text-to-
speech synthesis (TTS) for output. It is based on the Worldwide Web Consortium's (W3C's)
Extensible Markup Language (XML), and leverages the web paradigm for application development
and deployment. By having a common language, application developers, platform vendors, and tool
providers all can benefit from code portability and reuse.
With VoiceXML, speech recognition application development is greatly simplified by using familiar
web infrastructure, including tools and Web servers. Instead of using a PC with a Web browser, any
telephone can access VoiceXML applications via a VoiceXML "interpreter" (also known as a
"browser") running on a telephony server. Whereas HTML is commonly used for creating graphical
Web applications, VoiceXML can be used for voice-enabled Web applications.
One popular type of application is the voice portal, a telephone service where callers dial a phone
number to retrieve information such as stock quotes, sports scores, and weather reports. Voice
portals have received considerable attention lately, and demonstrate the power of speech
recognition-based telephone services. These, however, are certainly not the only application but also
1
for VoiceXML. Other application areas, including voice-enabled intranets and contact centers,
notification services, and innovative telephony services, can all be built with VoiceXML.
By separating application logic (running on a standard Web server) from the voice dialogs (running
on a telephony server), VoiceXML and the voice-enabled Web allow for a new business model for
telephony applications known as the Voice Service Provider. This permits developers to build phone
services without having to buy or run equipment.
While originally designed for building telephone services, other applications of VoiceXML, such as
speech-controlled home appliances, are starting to be developed.
1.2 HISTORY
VoiceXML has its roots in a research project called Phone Web at AT&T Bell Laboratories. After
the AT&T/Lucent split, both companies pursued development of independent versions of a phone
markup language.
Lucent's Bell Labs continued work on the project, now known as TelePortal. The recent research
focus has been on service creation and natural language applications.
AT&T Labs has built a mature phone markup language and platform that have been used to
construct many different types of applications, ranging from call center-style services to consumer
telephone services that use a visual Web site for customers to configure and administer their
telephone features. AT&T's intent has been twofold. First, it wanted to forge a new way for its
business clients to construct call center applications with AT&T-provided network call handling.
Second, AT&T wanted a new way to build and quickly deploy advanced consumer telephone
services, and in particular define new ways in which third parties could participate in the creation of
new consumer services.
Motorola embraced the markup approach as a way to provide mobile users with up-to-the-minute
information and interactions. Given the corporate focus on mobile productivity, Motorola's efforts
focused on hands-free access. This led to an emphasis on speech recognition rather than touch-tones
as an input mechanism. Also, by starting later, Motorola was able to base its language on the
recently-developed XML framework. These efforts led to the October 1998 announcement of the
VoxML™ technology. Since the announcement, thousands of developers have downloaded the
VoxML language specification and software development kit.
2
There has been growing interest in this general concept of using a markup language to define voice
access to Web-based applications. For several years Netphonic has had a product known as Web-on-
Call that used an extended HTML and software server to provide telephone access to Web services;
in 1998, General Magic acquired Netphonic to support Web access for phone customers. In October
1998, the World Wide Web Consortium (W3C) sponsored a workshop on Voice Browsers. A
number of leading companies, including AT&T, IBM, Lucent, Microsoft, Motorola, and Sun,
participated.
Most recently, IBM has announced SpeechML, which provides a markup language for speech
interfaces to Web pages; the current version provides a speech interface for desktop PC browsers.
VoiceXML’s main goal is to bring the full power of web development and content delivery to voice
response applications, and to free the authors of such applications from low-level programming and
resource management. It enables integration of voice services with data services using the familiar
client-server paradigm. A voice service is viewed as a sequence of interaction dialogs between a
user and an implementation platform. The dialogs are provided by document servers, which may be
external to the implementation platform. Document servers maintain overall service logic, perform
database and legacy system operations, and produce dialogs. A VoiceXML document specifies each
interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog
interpretation and is collected into requests submitted to a document server. The document server
may reply with another VoiceXML document to continue the user’s session with other dialogs.
3
CHAPTER - 2
CREATING A BASIC XML DOCUMENT
VoiceXML is an extensible markup language (XML) for the creation of automated speech
recognition (ASR) and interactive voice response (IVR) applications. Based on the XML
tag/attribute format, the VoiceXML syntax involves enclosing instructions (items) within a tag
structure in the following manner:
......contained items......
< /element_name>
A VoiceXML application consists of one or more text files called documents. These document files
are denoted by a ".vxml" file extension and contain the various VoiceXML instructions for the
application. It is recommended that the first instruction in any document to be seen by the interpreter
be the XML version tag:
The remainder of the document's instructions should be enclosed by the vxml tag with the version
attribute set equal to the version of VoiceXML being used ("1.0" in the present case) as follows:
Element Purpose
4
<div> JSML element to classify a region of text as a particular type.
<dtmf> Specify a touch-tone key grammar.
<else> Used in <if> elements.
<elseif> Used in <if> elements.
<emp> JSML element to change the emphasis of speech output.
<enumerate> Shorthand for enumerating the choices in a menu.
<error> Catch an error event.
<exit> Exit a session.
<field> Declares an input field in a form.
<filled> An action executed when fields are filled.
<form> A dialog for presenting information and collecting data.
<goto> Go to another dialog in the same or different document.
<grammar> Specify a speech recognition grammar.
<help> Catch a help event.
<if> Simple conditional logic.
<initial> Declares initial logic upon entry into a (mixed-initiative) form.
<link> Specify a transition common to all dialogs in the link’s scope.
<menu> A dialog for choosing amongst alternative destinations.
<meta> Define a meta data item as a name/value pair.
<noinput> Catch a noinput event.
<object> Interact with a custom extension.
<option> Specify an option in a <field>
<param> Parameter in <object> or <subdialog>.
<prompt> Queue TTS and audio output to the user.
<property> Control implementation platform settings.
<pros> JSML element to change the prosody of speech output.
<record> Record an audio sample.
<reprompt> Play a field prompt when a field is re-visited after an event.
<return> Return from a subdialog.
CHAPTER - 3
5
ARCHITECTURAL MODEL
The architectural model assumed by this document has the following components:
A document server (e.g. a web server) processes requests from a client application, the VoiceXML
Interpreter, through the VoiceXML interpreter context. The server produces VoiceXML documents
in reply, which are processed by the VoiceXML Interpreter. The VoiceXML interpreter context may
monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML
interpreter context may always listen for a special escape phrase that takes the user to a high-level
personal assistant, and another may listen for escape phrases that alter user preferences like volume
or text-to-speech characteristics.
The implementation platform is controlled by the VoiceXML interpreter context and by the
VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML
interpreter context may be responsible for detecting an incoming call, acquiring the initial
VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog
6
after answer. The implementation platform generates events in response to user actions (e.g. spoken
or character input received, disconnect) and system events (e.g. timer expiration). Some of these
events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document,
while others are acted upon by the VoiceXML interpreter context.
VoiceXML is an XML schema. For details about XML, refer to the Annotated XML Reference
Manual.
The language does not require document authors to explicitly allocate and deallocate dialog
resources, or deal with concurrency. Resource allocation and concurrent threads of control are to be
handled by the implementation platform.
7
CHAPTER - 4
CONCEPTS OF VOICE XML
A VoiceXML document (or a set of documents called an application) forms a conversational finite
state machine. The user is always in one conversational state, or dialog, at a time. Each dialog
determines the next dialog to transition to. Transitions are specified using URIs, which define the
next document and dialog to use. If a URI does not refer to a document, the current document is
assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is
terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the
conversation.
1. Dialogs and subdialogs : There are two kinds of dialogs: forms and menus. Forms define an
interaction that collects values for a set of field item variables. Each field may specify a grammar
that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to
fill several fields from one utterance. A menu presents the user with a choice of options and then
transitions to another dialog based on that choice.
A subdialog is like a function call, in that it provides a mechanism for invoking a new interaction,
and returning to the original form. Local data, grammars, and state information are saved and are
available upon returning to the calling document. Subdialogs can be used, for example, to create a
confirmation sequence that may require a database query; to create a set of components that may be
shared among documents in a single application; or to create a reusable library of dialogs shared
among many applications.
2. SESSIONS : A session begins when the user starts to interact with a VoiceXML interpreter
context, continues as documents are loaded and processed, and ends when requested by the user, a
document, or the interpreter context.
8
document’s variables are available to the other documents as application variables, and its
grammars can also be set to remain active for the duration of the application.
4. GRAMMARS : Each dialog has one or more speech and/or DTMF grammars associated with it.
In machine directed applications, each dialog’s grammars are active only when the user is in that
dialog. In mixed initiative applications, where the user and the machine alternate in determining
what to do next, some of the dialogs are flagged to make their grammars active (i.e., listened for)
even when the user is in another dialog in the same document, or on another loaded document in the
same application. In this situation, if the user says something matching another dialog’s active
grammars, execution transitions to that other dialog, with the user’s utterance treated as if it were
said in that dialog. Mixed initiative adds flexibility and power to voice applications.
5. EVENTS : VoiceXML provides a form-filling mechanism for handling "normal" user input. In
addition, VoiceXML defines a mechanism for handling events not covered by the form
mechanism.Events are thrown by the platform under a variety of circumstances, such as when the
user does not respond, doesn't respond intelligibly, requests help, etc. The interpreter also throws
events if it finds a semantic error in a VoiceXML document. Events are caught by catch elements or
their syntactic shorthand. Each element in which an event can occur may specify catch elements.
Catch elements are also inherited from enclosing elements "as if by copy". In this way, common
event handling behavior can be specified at any level, and it applies to all lower levels.
6. LINKS : A link supports mixed initiative. It specifies a grammar that is active whenever the user
is in the scope of the link. If user input matches the link’s grammar, control transfers to the link’s
destination URI. A <link> can be used to throw an event to go to a destination URI.
9
4.2 SUPPORTED AUDIO FILE FORMATS
VoiceXML recommends that a platform support the playing and recording audio formats specified
below. Note: a platform need not support both A-law and μ-law simultaneously.
WAV (RIFF header) 8kHz 8-bit mu-law [PCM] single channel. audio/wav
WAV (RIFF header) 8kHz 8-bit A-law [PCM] single channel. audio/wav
EXAMPLES
A form in a VoiceXML document presents information and gathers input from the user. A form is
represented by the <form> tag and has an ID attribute associated with it. The ID attribute is the
name of the form. Following is an example of the use of a form element:
10
Form items
Two types of form items exist: field items and control items. A field item prompts the user on what
to say or key in and then collects the information from the user that is then filled into the field item
variable. A field item also has grammars that define the allowed inputs, event handlers to process
the resulting events, and a <filled> element that defines an action to be taken after the field item
variable has been filled. Following is a list of types of field items:
<field>: value of the field item is obtained from the user via speech or DTMF grammars
<record>: value of the field item is an audio clip recorded by the user, such as a voice mail message,
which can be collected by the <record> element
<subdialog>: like a function call, invokes a call to another dialog on the current page or another
VoiceXML document. A control item's task is to help control the gathering of the form's fields.
<initial>: useful in mixed initiative dialogs that prompt the user for information
Form item variables and conditions
A form item variable is associated with each form. The form item 'variable by default' is set to
'undefined' initially and contains a result (collected from the user) once a form item has been
intepreted. You can define the name of a form item variable by using the name attribute. A guard
condition exists for each form item. The guard condition tests whether the item's variable currently
has a value. If a value exists, then the form item is skipped.
Menus
A menu gives the user a list of choices to select from and transitions to a different dialog or
document based on the user's choice. Following is an example of a menu:
<menu>
<prompt>Say what sports news you are interested in:
<enumerate/></prompt>
<choice next="http://www.news.com/hockey.vxml">
Hockey
</choice>
11
<choice next="http://www.news.com/baseball.vxml">
Baseball
</choice>
<choice next="http://www.news.com/football.vxml">
Football
</choice>
<noinput>Please say what sports news you are interested in
<enumerate/>
</noinput>
</menu>
Our second example asks the user for a choice of drink and then submits it to a server script:
<?xml version="1.0"?>
<vxml version="1.0">
<form>
<field name="drink">
<prompt>Would you like coffee, tea, milk, or nothing?</prompt>
<grammar src="drink.gram" type="application/x-jsgf"/>
</field>
<block> <submit next="http://www.drink.example/drink2.asp"/> </block>
</form>
</vxml>
A field is an input field. The user must provide a value for the field before proceeding to the next
element in the form. A sample interaction is:
C (computer): Would you like coffee, tea, milk, or nothing?
H (human): Orange juice.
C: I did not understand what you said.
C: Would you like coffee, tea, milk, or nothing?
H: Tea
C: (continues in document drink2.asp)
12
section, you will create the main greeting message of the application. In the code below, the user
hears a “welcome” message and is then given a list of choices from the main menu. The <goto>
element is used to skip to the menu section.
<?xml version="1.0"?>
<vxml version = "2.0" xmlns="http://www.w3.org/2001/vxml">
<!-- user hears welcome the first time -->
<form id="intro">
<block>
<audio>Welcome</audio>
<goto next="#make_choice"/>
</block>
</form>
</vxml>
Creating the weather document
In this section you will create the document that gives the user weather information. Once the user
says or selects weather from the menu, the following code is executed:
<?xml version="1.0"?>
<vxml version = "2.0" xmlns="http://www.w3.org/2001/vxml">
<form id="weather_report">
<block>
<audio>
It will be partly cloudy today.</audio>
<goto next="http://www.hostname.com/main.vxml"/>
</block>
</form>
</vxml>
13
dialing from. Applications use the telephone number you are dialing from.
3. Voice alerts (such as for advertising): VoiceXML can be used to send targeted alerts to a user.
The user would sign up to receive special alerts informing him of upcoming events.
4. Commerce: VoiceXML can be used to implement applications that allow users to specific
products that don't need a lot of description (such as tickets, CDs, office supplies, etc.) work well.
Healthcare: VoiceXML can be employed to create applications reminding patients to take their
medication like medication reminders. Automated systems for scheduling and confirming healthcare
appointments.
CHAPTER - 5
14
VOICE USER INTERFACE USING VOICEXML
Construction of voice interface applications is a challenge and the reason for this is that the language
is deeply related to human behaviour (Schinelle, 2005). As a consequence, the expectations related
to the interface become very high. This kind of interface tries to lead the user to the sensation that he
could speak as if it he was talking with a human, however it was not perfectly achieved.
The main objective of a voice user interface project is to support the user navigation with options,
commands and available information in a system to carry out a specifc task. Unfortunately, access
information through navigation is more complex in the audio ambit. For this, some factors must be
considered in the voice interface design: the application requirements, potentialities and limitations
of the technology and the population characteristics (Kamm, 1995). Once understood those factors,
the voice interface designer can anticipate some difficulties and incompatibilities that will affect the
success of the application, minimizing its impacts.
5.1 INTERACTION
Voice interfaces supply the information systems with an interesting alternative for input and output
data such as a voice-only interface (phone) or a component of a multimodal and/or multimedia
system.
A voice-only interface in an information system can become desirable for two reasons. First, the
application can require free hands in the interaction. Second, the telephone system is a net
15
technology truly robust and universal. Then, it makes sense to extend the information services from
computer to phone (Dey, 1997). Multimodal interfaces are a human-machine interaction for
sequential or parallel applications of input/output data. Speech recognition, keyboard, mouse,
mimic, gestures can be used as modality of input data and to get a synthesized reply voice, graphics
or text message. These ways of interaction can be combined dynamically to provide bigger mobility
to the user (Englert, 2006).
2.2 DIALOGUE INITIATIVE
One of the fundamental aspects of the development of applications with voice interface is the way
the dialogue initiative is taken. The strategy of management dialogue can be by system, user or
mixed initiative (SPI Group, 2006). In a system-initiative dialogue, the computer asks the user and
when the necessary information is received, the solution is processed and the answer is given.
Dialogues with user-initiative assume that the user knows what to do and how interact with the
system. Generally, the system waits for the user input and answers it through operations.
Applications with mixed-initiative assume that the initiative of the dialogue can be taken by the
system or the user.
CONCLUSIONS
16
VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio,
recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-
initiative conversations. Its major goal is to bring the advantages of web-based development and
content delivery to intera. With this we can conclude VoiceXML development in speech recognition
application is greatly simplified by using familiar web infrastructure, including tools and Web
servers. Instead of using a PC with a Web browser, any telephone can access VoiceXML
applications via a VoiceXML "interpreter" (also known as a "browser") running on a telephony
serverctive voice response applications.
REFERENCES
17
[1] Dave Reggett’s Introduction to VoiceXML
http://www.w3.org/Voice/Guide/
18