XHTML+Voice
XHTML+Voice (commonly X+V) is an XML language for describing multimodal user interfaces. The two essential modalities are visual and auditory. Visual interaction is defined like most current web pages via XHTML. Auditory components are defined by a subset of Voice XML. Interfacing the voice and visual components of X+V documents is accomplished through a combination of ECMAScript, JavaScript, and XML events.
Voice input or speech recognition is based on grammars that define the set of possible input text. In contrast to a probabalistic approach employed by popular software packages such as Dragon Naturally Speaking, the grammar based approach provides the recognizer with important contextual information that significantly boosts recognition accuracy. The specific formats for grammars are JSGF,
Voice output or speech synthesis can read any string at virtually any time. Pitch, volume, and other charactaristics can be customized using CSS and Speech Synthesis Markup Language (SSML).