Introduction to TalkML

photo of a smart phone with Guillaume Belrose in the display
Guillaume Belrose, who helped to devise TalkML and develop the sofware

TalkML is an experimental XML language for voice browsers, and is being developed by HP Labs for use in the following markets:

Call centers (IVR++) -- sales and support services accessed via 800 numbers, adding speech recognition to today's DTMF (touch tone) systems
Smart phones with displays
Access to email, appointments, news and travel services etc. while your are on the road (in-car systems)
Mobile devices too small for decent displays or keyboards, WCDMA palmtop organizers/pagers with low enough cost to be a must-have (like cell-phones)

TalkML supports more natural conversations than dialog systems based on keywords, while remaining simple to author. Other work is underway to investigate how to author "dual-access" applications, where the same application can be accessed by both conventional visual browsers and voice browsers, perhaps via transforming HTML into TalkML.

The outermost element is talkml:

<talkml [first="start"]>
  ... dialog definitions ...
</talkml>

An application is defined as set of dialog blocks

A typical dialog block involves:
1. Prompt user for input
2. Listen for a response
3. Determine what to do next
On timeouts, offer help and listen again
Barge-ins allow users to break in when the browser is speaking, e.g. to change to a new task or to get help
Each application defines variables representing information about its state. These are set by the grammar rules according to the user's responses and determine the way the dialog proceeeds

The dialog element defines a dialog block. It consists of one or more steps which are executed in sequence, looping back to the beginning after the last step is executed.

<dialog name=name>
  one or more steps
</dialog>

Each step is one of the following:

<say [on=condition]>
  text used as a prompt
</say>

Within the "say" element you can use others, e.g.

<say>
  <audio src="dooropen.wav"/>
  <speaker name="Eddie">
     Welcome to the TalkML Travel Center
  </speaker>
</say>

We propose to support the SABLE tag set for richer control of text to speech as well as style sheets and ACSS. You can use the var element to insert the value of an application variable into the text, e.g.

<say>
  Which day do you want to travel to 
  <var name="destination"/>?
</say>

The user's expected response is represented by:

<listen [grammar=grammar-name]
  [timeout=seconds] [response=var-name]>
  ... grammar rules ...
</listen>

You can include the grammar rules in place or refer to an external definition defined with a grammar element, e.g.

<listen grammar="main-menu"/>

The response attribute is used to name which variable to place the response, and is convenient when the grammar consists of simple list of choices, e.g.

<listen response="airport">
  "Bristol"|"Heathrow"|"Gatwick"|"Stansted"
</listen>

You can define grammars separately using the grammar element and refer to them as non-terminals in other grammar rules using the rule element, e.g.

<grammar name="flight" timeout="4">
  intro ((to|dest) from?)| (from (to|dest)?) polite?;
  intro = ("I want"|"I would like") ("to fly"|"a flight");
  polite = "please"|"thanks"|"thank you";
  to = ("going"|"flying")? "to" <rule name="airport"/>;
       (leave? departure)? {to.airport=airport};
  dest = "arriving" (arrival atdest?)|(atdest arrival?)
     ("and" from)?;
  arrival = "on" <rule name="date"/> {to.time=date};
  atdest = "at" <rule name="airport"/> {to.airport=airport};
  from = leave? (origin departure?)|(departure origin)
     ("and"? (dest|to))?;
  origin = "from" <rule name="airport"/> {from.airport=airport};
  departure = "on" <rule name="date"/> {from.time=date};
  leave = "departing" | "leaving";
</grammar>

Which permits the user to say things like:

I want a flight from London to Boston next Thursday
I want to fly to Boston from London please
I would like a flight departing from London on 3rd March and going to Boston

The grammar uses | for alternatives, ( and ) for grouping, the ? suffix for options, * for zero or more, and + for one or more. Variable assignments for matching rules are given within { and }.

The rule tag could easily be expanded to support macro-like capability

The act element is used to determine what to do next based upon the values of variables set by the user's response or by earlier dialog steps. The "set" attribute is used to set one or more variables to given values. The "next" attribute specifies the dialog to do next.

<act [on=condition]
    [set="var=value[;var=value]*"]
    [next=dialog-name]/>

Conditions are boolean expressions over variables using ( and ) for grouping, with the logical operators such as and, or, not, =, and !=. Additional operators are available for numbers and strings, and for finding out if a variable is currently undefined.

An error element is triggered when an timeout has occurred or when the user's response hasn't been understood. Each time this happens an error counter is incremented, allowing you to tailor the prompt accordingly. Note that the dialog name "retry" is reserved to mean retry the current dialog block without clearing the error counter.

<error [on=condition]
    [set="var=value[;var=value]*"]
    [next=dialog-name]>
  ... error prompt ...
</error>

Barge in commands

You can say "help" at any time. The help element defines the text you will get and can contain the same elements as "say". The "next" attribute allows you to branch off to another dialog, without it, the default behavior is to retry the current dialog. The "on" attribute allows you to offer different help text depending on the current values of the application's variables.

<help [name=dialog-name]
   [on=condition] [next=dialog-name]>
  ... help dialog steps ...
</help>

The Voice Browser defines a number of barge-in commands in addition to "help", e.g. to move back to the previous dialog step, to stop the current activity, to change user preferences, to switch to a new activity, etc.

Barge in commands are grouped as follows:

Context sensitive help
Hooking into Built-in commands
Application specific commands

The commands trigger execution of the associated dialog. This allows the application developer to intercept a built-in command to end the task and to thank the user or offer a warning etc. Authors can define application specific commands in addtion to help and the commands built into the browser.

Tasks

To allow for re-use and effective modularization, you can group dialogs into tasks. The task element scopes application variables and allows you to specify import and export sets of variables.

<task name="hotel" import="..." export="..." start="...">

... dialog definitions ...

</task>

You can execute a task with the act element, e.g.

<act task=name/>

The task ends with a return statement, e.g.

<act next="return"/>

Where "return" is a reserved name like "retry".

Key Pad Input

In some situations, speech recognition can't be used, e.g. you need to be quiet, or it would be embassing to talk to your handheld, or its really noisy, or you have a bad cold etc.

You can bind keys to actions with the "key" attribute, e.g. to bind the "1" key

<act key="1" next="dialog name">

Visual Output

For small displays etc. where you want to show stuff on the cellphone's display to complement spoken material.

This can be handled via the CSS display property, e.g.

p.summary {display: block; speak: none}

Which renderers any paragraph with class="summary" on the display without speaking it.

Another idea is to add a show element. There is a great opportunity to control time-based synchronization for multimodal applications using ideas being developed in W3C's Synchronized Multimedia Activity.

Example TalkML application

This is a simple example of a flight booking application.

Dave Raggett email: dsr@w3.org, phone: +44 122 578 2521