Disaggregate

Consulting

home
schedule
archives
resume
Contact Us

The Tricky Challenge of Multimodal Interfaces

Combining text and speech in the same application is an example of a "multimodal" interface. Multimodal interfaces combine different input modes -- e.g., voice, text, mouse clicks, gestures -- and output modes -- voice, text and graphics, and perhaps others. The user can switch back and forth between modes during the application, at one point speaking a choice, at another point gesturing with a stylus.

Multimodal applications present fascinating challenges to user interface designers. With various modes all active simultaneously, the user interface (UI) must present a consistent conceptual model to the user and provide consistent information. If speech recognition is one of the input modes, the other modes can provide enormous help by prompting the user to speak an easily-recognized utterance.

Some multimodal systems are on a single device. One example is a PDA with a form to fill out: A gesture with a stylus chooses which input box is the target, but the actual text input is made by speaking. A more difficult problem for system architects is how to unite different devices into a coherent system. (In the airport example, I speak into the telephone, but the output is sent to my PDA or laptop.) UI designers and application developers must handle a wide range of modes that vary from user to user and use to use. The system I describe in this article is designed to explore multidevice systems.

Hints on Speech User Interfaces

The present trend in speech user interfaces is to create not just an application but an entire personality behind the application: The designers of the application envision the age, height, personality, hobbies, and marital status of the virtual agent that answers the call, and then build an application/dialogue around the personality of the virtual agent. Perhaps the most extreme advocates of this style of interface are my colleagues at Jellyvision, who believe that writers — who excel in creating movies, books, and plays — should design applications.

Personally, I find Bruce Balentine's work highly persuasive. Bruce maintains that users value a brisk, task-oriented user interface, not a chatty one; after all, we value individuals who can efficiently service our requests when we're in a hurry, and most people who are on the telephone are not attempting to bond with the application at the other end. Furthermore, a brisk user interface avoids many problems.

As an example, a brisk user interface does not argue with the user when there's a mistake. There's no "Sorry, your response was not understood," which implies that the user was to blame for a recognition error. There's no "Sorry, I didn't get that," which implies that the application is a failable human. Instead, if there's a recognition failure the system follows the example set by efficient human beings — it simply offers valid choices to the user.

Bruce shows a wonderful video with the following approximate dialogue to get this point across. This table shows the differences between a "chatty" interface and the brisk interface:

Personality-Based "Chatty" Interface Brisk, Task-Oriented Interface

Machine: What would you like to do next?

User: What are my choices?

Machine: I'm sorry, I didn't get that. Your choices are news, weather or sports.

User: Weather, please.

Machine: What would you like to do next?

User: What are my choices?

Machine: You can ask for news, weather, or sports scores.

User: Weather, please.

Note that the user's first utterance was entirely out of vocabulary (industry jargon for "the user said words we were not prepared to recognize"). In the chatty personality-based interface, the application designers feel compelled to discuss the fact that something went wrong with speech recognition; and the user will be irritated because either the user made a mistake or the application did. But in the brisk, task-oriented interface, the application simply moves on and tells the user what to say. Errors are taken in stride — and the user is moved towards the goal of completing the transaction. The difference between the task-oriented interface and the chatty interface is quite striking.

home
schedule
archives
resume
Contact Us
Site and contents © 2001, 2002 Moshe Yudkowsky

Last updated 2002-09-23