Guillaume Belrose, who helped to devise TalkML and develop
the sofware
TalkML is an experimental XML language for voice browsers, and is being developed by HP Labs for use in the following markets:
Call centers (IVR++) -- sales and support services accessed via 800 numbers, adding speech recognition to today's DTMF (touch tone) systems
Smart phones with displays
Access to email, appointments, news and travel services etc. while your are on the road (in-car systems)
Mobile devices too small for decent displays or keyboards, WCDMA palmtop organizers/pagers with low enough cost to be a must-have (like cell-phones)
TalkML supports more natural conversations than dialog systems based on keywords, while remaining simple to author. Other work is underway to investigate how to author "dual-access" applications, where the same application can be accessed by both conventional visual browsers and voice browsers, perhaps via transforming HTML into TalkML.
The outermost element is talkml
:
<talkml [first="start"]> ... dialog definitions ... </talkml>
An application is defined as set of dialog blocks
A typical dialog block involves:
On timeouts, offer help and listen again
Barge-ins allow users to break in when the browser is speaking, e.g. to change to a new task or to get help
Each application defines variables representing information about its state. These are set by the grammar rules according to the user's responses and determine the way the dialog proceeeds
The dialog
element defines a dialog block. It
consists of one or more steps which are executed in sequence,
looping back to the beginning after the last step is
executed.
<dialog name=name> one or more steps </dialog>
Each step is one of the following:
<say [on=condition]> text used as a prompt </say>
Within the "say" element you can use others, e.g.
<say> <audio src="dooropen.wav"/> <speaker name="Eddie"> Welcome to the TalkML Travel Center </speaker> </say>
We propose to support the SABLE tag set for richer control of
text to speech as well as style sheets and ACSS. You can use the
var
element to insert the value of an application
variable into the text, e.g.
<say> Which day do you want to travel to <var name="destination"/>? </say>
The user's expected response is represented by:
<listen [grammar=grammar-name] [timeout=seconds] [response=var-name]> ... grammar rules ... </listen>
You can include the grammar rules in place or refer to an
external definition defined with a grammar
element,
e.g.
<listen grammar="main-menu"/>
The response attribute is used to name which variable to place the response, and is convenient when the grammar consists of simple list of choices, e.g.
<listen response="airport"> "Bristol"|"Heathrow"|"Gatwick"|"Stansted" </listen>
You can define grammars separately using the
grammar
element and refer to them as non-terminals in
other grammar rules using the rule
element, e.g.
<grammar name="flight" timeout="4"> intro ((to|dest) from?)| (from (to|dest)?) polite?; intro = ("I want"|"I would like") ("to fly"|"a flight"); polite = "please"|"thanks"|"thank you"; to = ("going"|"flying")? "to" <rule name="airport"/>; (leave? departure)? {to.airport=airport}; dest = "arriving" (arrival atdest?)|(atdest arrival?) ("and" from)?; arrival = "on" <rule name="date"/> {to.time=date}; atdest = "at" <rule name="airport"/> {to.airport=airport}; from = leave? (origin departure?)|(departure origin) ("and"? (dest|to))?; origin = "from" <rule name="airport"/> {from.airport=airport}; departure = "on" <rule name="date"/> {from.time=date}; leave = "departing" | "leaving"; </grammar>
Which permits the user to say things like:
The grammar uses | for alternatives, ( and ) for grouping, the ? suffix for options, * for zero or more, and + for one or more. Variable assignments for matching rules are given within { and }.
The rule tag could easily be expanded to support macro-like capability
The act
element is used to determine what to do
next based upon the values of variables set by the user's
response or by earlier dialog steps. The "set" attribute is used
to set one or more variables to given values. The "next"
attribute specifies the dialog to do next.
<act [on=condition] [set="var=value[;var=value]*"] [next=dialog-name]/>
Conditions are boolean expressions over variables using ( and ) for grouping, with the logical operators such as and, or, not, =, and !=. Additional operators are available for numbers and strings, and for finding out if a variable is currently undefined.
An error
element is triggered when an timeout has
occurred or when the user's response hasn't been understood. Each
time this happens an error counter is incremented, allowing you
to tailor the prompt accordingly. Note that the dialog name
"retry" is reserved to mean retry the current dialog
block without clearing the error counter.
<error [on=condition] [set="var=value[;var=value]*"] [next=dialog-name]> ... error prompt ... </error>
You can say "help" at any time. The help
element
defines the text you will get and can contain the same elements
as "say". The "next" attribute allows you to branch off to
another dialog, without it, the default behavior is to retry the
current dialog. The "on" attribute allows you to offer different
help text depending on the current values of the application's
variables.
<help [name=dialog-name] [on=condition] [next=dialog-name]> ... help dialog steps ... </help>
The Voice Browser defines a number of barge-in commands in addition to "help", e.g. to move back to the previous dialog step, to stop the current activity, to change user preferences, to switch to a new activity, etc.
Barge in commands are grouped as follows:
Context sensitive help
Hooking into Built-in commands
Application specific commands
The commands trigger execution of the associated dialog. This allows the application developer to intercept a built-in command to end the task and to thank the user or offer a warning etc. Authors can define application specific commands in addtion to help and the commands built into the browser.
To allow for re-use and effective modularization, you can group dialogs into tasks. The task element scopes application variables and allows you to specify import and export sets of variables.
<task name="hotel" import="..." export="..." start="..."> ... dialog definitions ... </task>
You can execute a task with the act element, e.g.
<act task=name/>
The task ends with a return statement, e.g.
<act next="return"/>
Where "return" is a reserved name like "retry".
In some situations, speech recognition can't be used, e.g. you need to be quiet, or it would be embassing to talk to your handheld, or its really noisy, or you have a bad cold etc.
You can bind keys to actions with the "key" attribute, e.g. to bind the "1" key
<act key="1" next="dialog name">
For small displays etc. where you want to show stuff on the cellphone's display to complement spoken material.
This can be handled via the CSS display property, e.g.
p.summary {display: block; speak: none}
Which renderers any paragraph with class="summary" on the display without speaking it.
Another idea is to add a show element. There is a great opportunity to control time-based synchronization for multimodal applications using ideas being developed in W3C's Synchronized Multimedia Activity.
This is a simple example of a flight booking application.