Software development techniques behind the magic user interface

Multi-Touch Developer Journal

Subscribe to Multi-Touch Developer Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Multi-Touch Developer Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Multi-Touch Authors: Ben Bradley, Qamar Qrsh, Suresh Sambandam, Jayaram Krishnaswamy, Kevin Benedict

Related Topics: ColdFusion on Ulitzer, MultiTouch Developer Journal

CFDJ: Article

Building VoiceXML Applications

Building VoiceXML Applications

You've used ColdFusion to build forms that send e-mail and build database-driven applications, and maybe even played with some WDDX. Now you're probably looking for a new and different way to use the skills you've gained. VoiceXML could be just the ticket.

As a Web developer, I'm always interested in new technologies and finding different ways to apply the knowledge I've gained. A few months ago I discovered VoiceXML and my curiosity was piqued.

This article is a basic introduction to VoiceXML. It'll get you started - and make you dangerous. Excellent resources on the Internet, included at the end of the article, will take you further. I won't teach you anything you don't already know about ColdFusion; in fact, the ColdFusion examples are simple. Instead, my intent is to expose you to VoiceXML development and show you how you can use ColdFusion to implement dynamic voice applications.

What Is VoiceXML?
If you're not already familiar with the XML in VoiceXML, there's an excellent primer on XML at the TellMe Web site. (See the Resources section for the address.)

So where does the voice part come from? VoiceXML allows you to build applications with a voice interface. We've all called a technical support line and been asked to browse through common menu options using our TouchTone phone. You can build applications like that with VoiceXML, but it's nothing new. That technology is as old as TouchTone phones!

The real advantage to VoiceXML is that it brings the technology to people who are traditionally more familiar with building Web applications. Since it's based on XML, VoiceXML will look familiar to you as a ColdFusion developer. Using VoiceXML, you can build interactive voice applications using voice recognition and synthesized speech. You can even build applications that use voiceprint identification. Remember the movie Sneakers? "My voice is my passport, verify me."

VoiceXML Platforms
To run your voice application, you need to have it hosted by a VoiceXML platform. Two of the most common are the TellMe Studio (http://studio.tellme.com) and the BeVocal Café (http://cafe.bevocal.com). Both provide free access to develop and host one application. When you get to the point of wanting to deploy your application, you can discuss pricing with them based on your needs. The examples in this article are based on the TellMe platform.

Hello World VoiceXML Application
As always, the first thing you have to write is the obligatory Hello World application. I've taken the liberty of including the sample code from the examples at TellMe networks to show you this (see Listing 1). More about TellMe later.

The <vxml> tag is the VoiceXML equivalent of the <html> tag. You begin and end your VoiceXML document with it. After stating some metainformation, the code begins with the <form> tag.

You're already familiar with forms from HTML, but they're used a little differently here. In VoiceXML a form is a container for fields that handle interactions with your users. The difference is simple: in HTML you use forms as a way to input data to your application. In VoiceXML forms are used both to prompt for data input and to present the output.

A VoiceXML form is actually closer to being a Windows development form than one used for Web development. Here you'll simply use a form as a container to output the "Hello World" phrase to the user.

The <block> tag sets up an executable context, telling the interpreter to present information to the user. If the code reads <block>Hello, world!</block>, the interpreter would have used text to speech to output this to the user. But that would be cheesy. For our high quality application you don't want some scratchy computer-generated voice.

Instead, you're going to use the <audio> tag, a tag that's used to output a .wav file to the user or to output text to speech. In this case you're specifying a .wav file to be played (http://resources.tellme.com/audio/misc/hello.wav), but if the specified file doesn't exist, you'll use text to speech to output the string "hello world". You do this by enclosing that text between the open and close of the <audio> tag.

After this you pause (the application), and then use the <audio> tag to say "goodbye" to the user. By not specifying the path to a .wav file here, you're telling the interpreter to use text to speech. After pausing again, use the <goto> tag to return the user to the main menu of the TellMe service.

How Does ColdFusion Fit?
The Hello World application is a basic one that simply outputs some data to the user. To provide real value, you need to tie a VoiceXML application into your database. You can use ASP, PERL, or any number of Web development technologies to do this, but I believe ColdFusion is best suited to the task. It has a proven track record for rapidly building database-enabled Web applications, and its strengths apply here as well.

A VoiceXML application is hosted on a platform like those of TellMe or BeVocal. However, these platforms are able to call out to your Web server to get code dynamically. As long as your ColdFusion application outputs valid VoiceXML, everything works great. Your application can call a ColdFusion page that looks up some data based on user input and returns the results to the platform in VoiceXML.

Building a Sample Application
To see how this works, we'll build a sample application that will prompt a user for the name of a U.S. state. Once that's done, we'll return some specific information about that state. This will let us show a few different ways in which ColdFusion can interact with the VoiceXML platform.

VoiceXML Basics
You need to know a few common terms to start building voice applications. First, you should know what a "grammar" is.

Have you ever used a voice recognition application? When you first start with these applications, you have to go through a training period in which the application learns how you say certain words so it can best recognize what you're saying. In most cases you won't have that luxury when you build a voice application. So how can you use speech recognition with any degree of accuracy?

In VoiceXML a grammar is a list of options that the application will listen to and allow to be spoken as input. For example, if you were to ask a user to say a favorite month, you'd build a grammar that accepts the values January, February, March, and so on. The system would accept only the values you provide in your grammar. Anything else spoken by the user would be ignored. This way, your application can use voice recognition technology without having to go through the training phase, yet still have a high degree of accuracy.

Both the TellMe and BeVocal platforms use the Nuance Grammar Specification Language (GSL). You can learn the details from their sites or on the Nuance site (www.nuance.com). Nuance also provides a free developer's program you can apply for.

Let's look at a sample grammar to get started (see Listing 2). Grammars are placed inside the <grammar> tag. The CDATA section protects the following section from being recognized as markup by the parser. Your grammar description is placed inside that block.

You're telling the application that the words contained inside the brackets of the grammar are possible options for the user to speak. If you look at the first line of the sample grammar, these options are "dtmf-1" and "massachusetts". The state names are not capitalized intentionally because your grammars should be lower- case. You know that "massachusetts" is a state name, but what is "dtmf-1"?

DTMF stands for dual-tone multi-frequency and is basically the TouchTone that you normally associate with traditional phone applications.

You're saying that the user can either say the number "one" or press it on the phone. They could also say "Massachusetts" and it would return the same value. In the curly braces we can see the value that will be returned from the grammar. Here it's the number one, which happens to be the ID of the state "Massachusetts" in the database. This way, you can do a lookup in the database using the primary key to find the detailed information on the state.

Notice the difference in the entries for New Hampshire and Rhode Island. When you're expecting a phrase from the user rather than a single word, you need to put it in parentheses. By replacing the space with an underscore in the grammar, you allow the user to slur the state name together as if it were one word. You still need to give users the option in parentheses of saying it correctly as two words as well.

Dynamic Grammars
Now let's get into some of the good stuff. There's another way to reference a grammar without including it between <grammar> tags. You can link to an external grammar, much like referencing stylesheets in HTML. The syntax for this is:

<grammar name="StateList" src="http://www.yoursite.com/grammarfile.gsl"/>
It's beneficial to include grammars like this because the file you include could be any type of file as long as it outputs a valid grammar. Here's where you get to start using ColdFusion. You need to duplicate the sample grammar in Listing 2 dynamically using the database of states. You can download the sample database and source code from CFDJ's Web site (www.sys-con.com/coldfusion/sourcec.cfm).

As in XML, every tag in VoiceXML must have both an opening and a closing tag, as in <grammar></grammar>. However, some tags don't take information between the tags, or the information between tags is sometimes optional. In these cases you can use shorthand to open and close the tag. In the preceding example the grammar tag closes itself with a forward slash so that <grammar></grammar> is the same as <grammar/>.

Listing 3 is an example of a ColdFusion page that will output your dynamic grammar. You start with a basic database query, selecting the data from the states table. The next section of code would be a normal CFOUTPUT with one exception: you have to handle the case of state names that have more than one word.

This example checks for a space in the state name. If one is found, it outputs the state name as one word, replacing the spaces with underscores for those of us who slur. Then the state name is still output as a phrase for those of us who enunciate our words.

Dynamic grammars can get complex and can include thousands of entries - the more you have, the longer it will take your application to recognize commands from the user. However, the GSL language is extensive, and you can tune your grammars to handle these situations.

Imagine writing a grammar for an application that allowed users to say, "What appointments do I have on Tuesday?" or "Read me the subjects of my unread e-mail from yesterday." While complex, you can build grammars capable of handling these situations. The cause for concern for the developer is determining the level you want to go to in allowing user input.

To make your grammar easier to write but less user-friendly, the "What appointments do I have on Tuesday?" grammar could be written to accept the word appointments and then prompt users with a second grammar to choose the day that interests them. If you'd like more details on grammars, read "VoiceXML: Grammars for VoiceXML Applications" by H. Seth (XML-J, Vol. 2, issue 5). See the resources at the end of this article for more information.

Building the Form
Now that you've built your dynamic grammar, you need to build the form that uses the grammar to prompt the user and get input. You'll do this by using a VoiceXML form with one field (see Listing 4). In the example, just after the metatag, a variable of "StateID" is declared that will remain persistent across both forms in the application. After that, the form is created and given the name "StateLookup".

Next, a form field called "StateID" is created that will prompt users for the state they're interested in. The link to the dynamic grammar is inserted. The <audio> prompt is now used to ask users for a specific state.

Remember that in a production application you should always have professionally recorded audio for any static messages you present to the user. Even some dynamic content can be prerecorded for a more professional touch. We're going to go with the text to speech, so our <audio> tag doesn't include an "src" reference.

After the <prompt> tag, there are a few other options that the parser will go to depending on what happened in the prompt. These tags are event handlers that are relatively self-explanatory, but we'll go over them anyway. The code in the <nomatch> section is run when the user's input doesn't match any of the entries in the grammar. The user is told that the application didn't understand what was said, and is reprompted for the same information.

The code in the <noinput> section is run if the user didn't say anything or if the application couldn't hear what the user said because of too much background noise. It will also reprompt the user if this occurs.

The <help> tag is special in that the user can say "help" whenever prompted for information. I say "special" because it's something the application will recognize even though it wasn't explicitly allowed in the grammar. This is supported as long as you put code in the event handler for it.

The code under the <filled> section runs when the application recognizes the user's input and returns the value specified in the grammar. Here it will be the ID of the state, and the "StateID" variable will be set to the value returned here. The <goto> tag is then used to jump to another form named "StateDetailLookup" that will make an external call to another ColdFusion page for a database lookup.

This form begins by declaring an executable block of code. After doing this, a local variable of "StateID" is declared and sets its value to the "StateID" returned from the grammar. The submit tag is then used to make the call out to our ColdFusion code to obtain a new VoiceXML document via HTTP. Similar to including a grammar, this could be either a static file hosted somewhere else or it could be a dynamic ColdFusion page.

You can pass variables to your pages by using the GET or POST methods of HTTP. Here the parameter is passed in the query string as a GET request. The method is specified, and the "namelist" parameter is set to "StateID". Here only one value is being passed in the "namelist" field. If there were more values, they would be included as a space-separated list. The application closes with the </vxml> tag to complete this portion.

Dynamic VoiceXML
Let's go into some more ColdFusion development by looking at Listing 5, a dynamic result to the external VXML call. The code begins with a simple CFQUERY, looking up the information from the "StateCaps" table based on the "StateID" field passed in the URL.

The VoiceXML document is then built just as if it were built on the platform itself. It starts with the <vxml> tag and goes on from there. A new form called "result" is created that will output the results to the user. There is some error-checking code to make sure that data is actually returned before the results of the query are output to the user.

At the end of this is a <goto> tag that sends the user back to the last anchor. If you recall, in the "StateLookup" form created in Listing 4, the "anchor" attribute was set to "true". This is the last place an anchor was set, so this is where the application returns.

We could mix ColdFusion code and VXML throughout our document, just as we would with HTML or WML. The only difference between these technologies is the format of the output. Most of us have developed different versions of the same code for various reasons, whether to support multiple platforms (WAP phones, or Palm or Windows CE devices) or even just to support different Web browsers. The same techniques we used then could be used to build a ColdFusion application to support both Web and voice browsers.

What Could I Build with VoiceXML?
By now you're probably thinking about which of your ColdFusion applications you could port over to VoiceXML. There's probably a good place for you to use your skills and enhance applications as well. Let's go through some samples.

One thing you could build is an employee list. You could have a dynamic grammar that lists all employees' names as options and gives their detailed information as a result. Or how about building an order- and shipment-tracking application that could be implemented as the hold dialog when customers call in? You could then look up their most recent orders and shipments using their ANI information and read them back the status. (ANI stands for Automatic Number Identification and while different from Caller ID, it offers similar functionality.)

Another possibility would be to build a VoiceXML front end to your e-mail package, allowing your users to access their e-mail from any phone. You could get really fancy and tie it into their calendars as well. If you really want a fun project, sign up with the BeVocal Café and start playing with the Voiceprint Identification. Use that to authenticate your regular users.

Sharing the Inspiration
This article's primary purpose is to give you a basic understanding of VoiceXML and spark your interest enough to build an application. If you'd like more information and development resources, see the links at the end of the article. Also, if you're going to Macromedia DevCon this year, they'll have a session entitled "Delivering Dynamic, Voice-Driven Applications Using VXML." If you do feel inspired to build a voice application, let me know how it's going by e-mailing me. I'd love to hear about what you've done.

Resources

  1. TellMe Studio: http://studio.tellme.com/
  2. BeVocal Cafe: http://cafe.bevocal.com/
  3. Nuance: www.nuance.com
  4. VoiceXML Forum: www.voicexml.org/
  5. From the W3C: www.w3.org/Voice/#implementations)
  6. Seth, H. (2001). "Tools for Developing VoiceXML Apps." XML-Journal, Vol. 2, issue 3.

More Stories By Ben Parson

Ben Parson is a Senior Web Developer for PowerQuest Corporation. He focuses mainly on E-commerce, web development, and database design. He has been building Internet database applications for the past seven years, and enjoys working with new technologies.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.