XHTML+Voice
XHTML+Voice (commonly X+V) is an XML language for describing multimodal user interfaces. The two essential modalities are visual and auditory. Visual interaction is defined like most current web pages via XHTML. Auditory components are defined by a subset of Voice XML. Interfacing the voice and visual components of X+V documents is accomplished through a combination of ECMAScript, JavaScript, and XML events.
Voice input
Voice input or speech recognition is based on grammars that define the set of possible input text. In contrast to a probabilistic approach employed by popular software packages such as Dragon Naturally Speaking, the grammar based approach provides the recognizer with important contextual information that significantly boosts recognition accuracy. The specific formats for grammars include JSGF.
Voice output
Voice output or speech synthesis can read any string at virtually any time. Pitch, volume, and other charactaristics can be customized using CSS and Speech Synthesis Markup Language (SSML) however the Opera browser doesn't currently support all these features.
MIME types
The recommended MIME type for any X+V document is application/xhtml+voice+xml. The Opera browser will also interpret X+V documents served as text/xml. Since most web servers associate the .xml extention with text/xml, an xml extension is a fairly safe way of making your static X+V document files browsable.
X+V-enabled browsers
The most commonly used X+V browser is the Opera browser. Users of the Opera browser can enable X+V support through steps described at http://www.opera.com/voice/. Voice is not yet supported in Opera Mini or on platforms other than Windows.
Detecting support for X+V is best done from the server by checking the HTTP header "Accept" for the MIME type application/xhtml+voice+xml. Here is some PHP code that returns "true" if and only if the requesting browser supports XHTML+Voice:
<?php /* The following script echoes "true" if and only if the requesting browser supports XHTML+Voice. */ $accept = $_SERVER['HTTP_ACCEPT']; if ($accept) // if the "Accept" header was specified by the browser { // if the browser does not accept XHTML+Voice if (strpos($accept,"application/xhtml+voice+xml")===false) echo "false"; else echo "true"; } else echo "false"; // assume "Accept" not specified implies no X+V support ?>
Related Technology
Speech Application Language Tags(SALT) is a very similar format developed by Microsoft in 2001 to compete with VoiceXML and XHTML+Voice. SALT also provides users with multimodal support including grammar based recognition and speech synthesized output. The advantage SALT has over X+V is the server side support provided by Microsoft products such as the Microsoft Speech Application SDK and Microsoft Speech Server.