Products ATOM Spoken Dialogue SDK Glossary
Join our mailing list

Glossary of Spoken Dialogue related Terms

A
Acronym
A word formed from the initial letters of a name, such as PC for Personal Computer, or by combining initial letters or parts of a series of words, such as radar for radio detection and ranging, or variations, such as W3C for World Wide Web consortium.
Acronym expansion
The action of replacing an acronym by the sequence of words it represents. Acronym expansion is typically performed to help TTS engines read acronyms and ASR to recognize them.
active grammar
A speech or DTMF grammar that is currently active. This is based on the currently executing element, and the scope elements of the currently defined grammars.
Anchor-point
When referencing an input interval with emma:time-ref-uri, emma:time-ref-anchor allows you to specify whether the referenced anchor is the start or end of the interval.
Annotation
Information about the interpreted input, for example, timestamps, confidence scores, links to raw input, etc.
ASR (Automatic Speech Recognition)
The process of using an automatic computation algorithm to analyze spoken utterances to determine what words and phrases or semantic information were present.
Augmented BNF syntax (ABNF)
A plain-text (non-XML) representation of speech recognition grammars which is similar to traditional BNF grammar and to many existing BNF-like representations commonly used in the field of speech recognition including the JSpeech Grammar Format from which this specification is derived. Augmented BNF should not be confused with Extended BNF which is used in DTDs for XML and SGML. ABNF is defined in [SRGS]).
C
catch element
A block or one of its abbreviated forms. Certain default catch elements are defined by the VoiceXML interpreter.
CCXML
Call Control eXtensible Markup Language. A markup language for specifying telephony call control applications. Developed by the Voice Browser Working Group at W3C, the initial Working Draft is at http://www.w3.org/TR/ccxml/.
CFG
Context-free grammar, such as W3C SRGS (see below).
cHTML
Compact HTML for Small Information Appliances, a W3C Note at http://www.w3.org/TR/1998/NOTE-compactHTML-19980209/.
Composite input
An input formed from several pieces, often in different modes, for example, a combination of speech and pen gesture, such as saying "zoom in here" and circling a region on a map.
Confidence
A numerical score describing the degree of certainty in a particular interpretation of user input.
Control item
A form item whose purpose is either to contain a block of procedural logics () or to allow initial prompts for a mixed initiative dialog ().
D
Data model
For EMMA, a data model defines a set of constraints on possible interpretations of user input.
Derivation
Interpretations of user input are said to be derived from that input, and higher levels interpretations may be derived from lower level ones. EMMA allows you to reference the user input or interpretation a given interpretation was derived from, see semantic interpretation.
Dialogue
A sequence of interactions between the users and the application. Dialogues can be specified using VoiceXML.
DOM
Document Object Model, a standard interface to the contents of a web page, as described at http://www.w3.org/DOM/.
DTMF (Dual Tone Multi-Frequency)
Touch-tone or push-button dialing. Pushing a button on a telephone keypad generates a sound that is a combination of two tones, one high frequency and the other low frequency.
E
ECMA Compact Profile
ECMAScript Compact Profile (see [ECMA-327]) is a subset of ECMAScript 3rd Edition tailored to resource-constrained devices such as battery-powered embedded devices.
ECMA-323
XML Protocol for Computer Supported Telecommunications Applications (CSTA) Phase III, as specified at http://www.ecma.ch/ecma1/STAND/ecma-323.htm by ECMA, the European Computer Manufacturers Association. This specifies an XML protocol for the CSTA services described in ECMA 269.
ECMAScript
A standard version of JavaScript backed by the European Computer Manufacturer's Association.
EMMA
The Extensible MultiModal Annotation markup language is a W3C specification for containing and annotating the interpretation of user input. Examples of interpretation of user input are a transcription into words of a raw signal, for instance derived from speech, pen or keystroke input, a set of attribute/value pairs describing their meaning, or a set of attribute/value pairs describing a gesture. The interpretation of the user's input is expected to be generated by signal interpretation processes, such as speech and ink recognition, semantic interpreters, and other types of processors for use by components that act on the user's inputs such as interaction managers.
End point
In EMMA, this refers to a network location which is the source or receipient of an EMMA document. It should be noted that the usage of the term "endpoint" in this context is different from the way that the term is used in speech processing, where it refers to the end of a speech input.
Event bubbling / Event propagation
This is the idea that an event can affect one object and a set of related objects. Any of the potentially affected objects can block the event or substitute a different one (upward event propagation). The event is broadcast from the node at which it originates to every parent node.
Event
A notification "thrown" by the implementation platform, VoiceXML interpreter context, VoiceXML interpreter, or VoiceXML code. Events include exceptional conditions (semantic errors), normal errors (user did not say something recognizable), normal events (user wants to exit), and user defined events.
Executable content
Procedural logic that occurs in <block>, <filled>, and event handlers.
F
FIA (Form Interpretation Algorithm)
An algorithm implemented in a VoiceXML interpreter which drives the interaction between the user and a VoiceXML form or menu. See Section 2.1.6 and Appendix C.
Form
A dialog that interacts with the user in a highly flexible fashion with the computer and the user sharing the initiative.
Form item variable
A variable, either implicitly or explicitly defined, associated with each form item in a form. If the form item variable is undefined, the form interpretation algorithm will visit the form item and use it to interact with the user.
G
Gestures
In multimodal applications gestures are communicative acts made by the user or application. An example is circling an area on a map to indicate a region of interest. Users may be able to gesture with a pen, keystrokes, hand movements or sound. Gestures often form part of composite input. Application gestures are typically animations and/or sound effects.
Grammar
A set of rules that describe a sequence of tokens expected in a given input. These can be used by speech and handwriting recognizers to increase recognition accuracy.
Grammar Document
An XML or ABNF Document Grammar Document as defined in sections 5.2 and 5.5 of [SRGS].
Grammar Fragment
An XML Fragment as defined in section 5.1 of [SRGS].
Grapheme
One of the set of the smallest units of a written language, such as letters, ideograms, or symbols, that distinguish one word from another; a representation of a single orthographic element.
H
Handwriting recognition
The process of converting pen strokes into text.
Homograph
'a word of the same written form as another but of different meaning and usually origin, whether pronounced the same way or not, as bear "to carry; support" and bear "animal" or lead "to conduct" and lead "metal."' [DICT] An example from French is fils (son) and fils (threads).
Homophone
One of a set of words that are pronounced the same way but differ in meaning, origin, and sometimes spelling. E.g. night and knight in English. Note that color and colour are considered multiple orthographies for the same word and not homophones, because they are variations of spelling with the same pronunciation and meaning.
Hosting environment
The Grammar processor or VoiceXML processor or other computer program that contains a processor for Semantic Interpretation
I
Implementation platform
A computer with the requisite software and/or hardware to support the types of interaction defined by VoiceXML.
Ink recognition
This includes the recognition of handwriting and pen gestures.
Input cost
In EMMA, this refers to a numerical measure indicating the weight or processing cost associated with a user's input or part of their input.
Input device
The device proving a particular input, for example, a microphone, a pen, a mouse, a camera, or a keyboard.
Input function
In EMMA, this refers to use a particular input is serving, for example, as part of a recording or transcription, as part of a dialogue, or as a means to verify the user's identity.
Input item
A form item whose purpose is to input a input item variable. Input items include <field>, <record>, <object>, <subdialog>, and <transfer>.
Input medium
Whether the input is acoustic, visual, or tactile, for instance, a spoken utterance is an example of an aural input, a hand gesture as seen by a camera is an example of a visual input, pointing with a mouse or pen is an example of a tactile input.
Input mode
This distinguishes a particular means of providing an input within a general input medium, for example, speech, DTMF, ink, key strokes, video, photograph, etc.
Input source
This is the device that provided the input, for example a particular microphone or camera. EMMA allows you to identify these with a URI.
Input tokens
In EMMA, this refers to a sequence of characters, words or other discrete units of input.
Instance data
A representation in XML of an interpretation of user input.
Interaction manager
A processor that determines how an application interacts with a user. This can be at multiple levels of abstraction, for example, at a detailed level, determining what prompts to present to the user and what actions to take in response to user input, versus a higher level treatment in terms of goals and tasks for achieving those goals. Interaction managers are typically event driven.
International Phonetic Alphabet (IPA)
The International Phonetic Alphabet [IPA] is a phonetic alphabet used by linguists to accurately and uniquely represent each of the wide variety of sounds (phones or phonemes) the human vocal apparatus can produce. It is intended as a notational standard for the phonetic representation of all languages.
Interpretation
In EMMA, an interpretation of user input refers to information derived from the user input that is meaningful to the application.
J
JSGF Java Speech Grammar Format
A platform-independent, vendor-independent textual representation of grammars for use in speech recognition.
JTAPI
Java Telephony API. See http://java.sun.com
J
Keystroke input
Input provided by the user pressing on a sequence of keys (buttons), such as a computer keyboard or keypad.
L
Language identifier
A language identifier labels information content as being of a particular human language variant. Following the XML specification for language identification [XML], a legal language identifier is identified by an RFC 3066 [RFC3066] code. A language code is required by RFC 3066. A country code or other subtag identifier is optional by RFC 3066.
Lattice
A set of nodes interconnected with directed arcs such that by following an arc, you can never find yourself back at a node you have already visited (i.e. a directed acyclic graph). Lattices provide a flexible means to represent the results of speech and handwriting recognition, in terms of arcs representing words or character sequences. Different arcs from the same node represent different local hypotheses as to what the user said or wrote.
Lexeme
An atomic unit in a language, like a word or a stem. In this specification, "lexeme" designates a collection of graphemic and phonetic representations of words or phrases.
Lexicon
In its most general sense, a lexicon is merely a list of words or phrases, possibly containing information associated with and related to the items in the list. This document uses the term "lexicon" in only one specific way, the as "pronunciation lexicon", which means a mapping between words (or short phrases), their written representations, and their pronunciations suitable for use by an ASR engine or a TTS engine. However, the word "lexicon" can mean other things in other contexts.
Link
A set of grammars that when matched by something the user says or keys in, either transitions to a new dialog or document or throws an event in the current form item.
M
Metadata
Information describing another set of data, for instance, a library catalog card with information on the author, title and location of a book. EMMA is designed to support input processors in providing metadata for interpretations of user input.
Mixed Initiative
A form of dialog interaction model, whereby the user is permitted to share the dialog initiative with the system, e.g. by providing more answers than requested by a prompt, or by switching task when not prompted to do so (see also System Initiative.)
Multimodal
Describing applications or interactions where more than a single mode of input or output is available to the end-user. For SALT this is used in the case where a speech interface is available in addition to a visual interface.
Multimodal Integration
The process of combining inputs from different modes to create an interpretation of composite input. This is also sometimes refered to as multimodal fusion.
Multimodal Interaction
The means for a user to interact with an application using more than one mode on interaction, for instance, offering the user the choice of speaking or typing, or in some cases, allowing the user to provide a composite input involving multiple modes.
N
N-best list
An N-best list is a list of the most likely hypotheses for what the user actually said or wrote, where N stands for an integral number such as 5 for the 5 most likely hypotheses.
N-Gram Model
A stochastic language model, such as W3C Stochastic Language Models (N-Gram) Specification (http://www.w3.org/TR/ngram-spec/).
Natural language understanding
The process of interpreting text in terms that are useful for an application.
NLSML
Natural Language Semantic Markup Language. An early W3C specification for representing the meaning of a natural language utterance and associated information, now superseded by EMMA.
O
Object
A platform-specific capability with an interface available via VoiceXML.
Orthography
A notation for writing and displaying words. Orthography includes character sets, whitespace, case sensitivity, diacritics within languages such as Arabic or Persian, and accents within languages such as French.
P
Parse List (Flat Parse List)
A representation of a parse as a linear sequence of applied rules. See section 6.2
Parse
A structured representation of the (possible) application of Grammar Rules to the sequence of Tokens in an utterance.
Phoneme
One of the set of the smallest units of speech that can distinguish words: for example, the English language has over 40 phonemes (19 vowels and 24 consonants). In American English, /t/ and /p/ are phonemes that can distinguish the word tin from pin.
Phonetic alphabet
A set of symbols that represent the sounds in spoken languages, such as English, Chinese, or German, see also International Phonetic Alphabet.
Pronunciation lexicon
The term pronunciation lexicon means a mapping between words (or short phrases), their written representations, and their pronunciations suitable for use by an ASR engine or a TTS engine.
R
Raw signal
An uninterpreted input, such as an audio waveform captured from a microphone.
Request
A collection of data including: a URI specifying a document server for the data, a set of name-value pairs of data to be processed (optional), and a method of submission for processing (optional).
Rule (Grammar Rule)
A Rule Definition describes the composition of a possible utterance in terms of other Rule Definitions and Tokens. See details in section 3.1 of [SRGS].
S
SALT Speech Application Language Tags
One recently established standard (see www.saltform.org) which extends existing Web markup languages to enable multimodal (speech plus other modalities) and telephony (speech only) access to the Web.
SAMPA
The Speech Assessment Methods Phonetic Alphabet [SAMPA]: a phonetic alphabet using only ASCII characters, rather than the extended character set used by the International Phonetic Alphabet.
SAPI
SAPI, short for Speech Application Programming Interface, is an API developed by Microsoft to allow the use of speech recognizers and speech synthesis within Windows applications. The API is supported by several third party engine vendors.
Semantic Interpretation
A process to produce a Semantic Result representing the meaning of a natural language utterance.
Semantic Interpretation for Speech Recognition [SISR]
A W3C specification defining a process to produce a semantic result representing the meaning of a natural language utterance.
Semantic processor
In EMMA, this refers to systems that can derive interpretations of user input, for instance, mapping the speech for "San Francisco"into the airport code "SFO".
Semantic Result or Semantic Value
A computer processable representation of the information (the meaning, or "semantics") contained in a user input. In the context of this specification the user input is a natural language utterances. A Semantic Result is used here in the relatively narrow sense of representing the information that is relevant to the application that is intended to process it, typically using ad-hoc conventions for the representation. See section 1.1.
Session
A connection between a user and an implementation platform, e.g. a telephone call to a voice response system. One session may involve the interpretation of more than one VoiceXML document.
Signal interpretation
The process of mapping a discrete or continuous signal into a symbolic representation that can be used by an application, for instance, transforming the audio waveform corresponding to someone saying "2005" into the number 2005.
SMIL
Synchronized Multimedia Integration Language. A W3C Recommendation, SMIL 2.0 (pronounced "smile") enables simple authoring of interactive audiovisual applications. See http://www.w3.org/TR/smil20.
speech recognition
The process of determining the textual transcription of a piece of speech.
Speech Recognition Grammar
A description of the candidate words and phrases for use by a Speech Recognizer. Speech Recognition Grammars for use with this specification are defined in [SRGS], a standardized format for context-free grammars.
Speech Recognition Grammar Specification [SRGS]
A W3C specification defining a language to describe grammars (words or phrases) that an ASR engine can recognize.
Speech Recognizer
A program or device that performs Automatic Speech Recognition
Speech Synthesis Markup Language [SSML]
A W3C XML language for specifying the rendering of text by a TTS engine.
Speech Synthesis
The process of automatic generation of speech output from data input which may include plain text, marked up text or binary objects.
SRGS (Speech Recognition Grammar Specification)
A standard format for context-free speech recognition grammars being developed by the W3C Voice Browser group. Both ABNF and XML formats are defined [SRGS].
SSML (Speech Synthesis Markup Language)
A standard format for speech synthesis being developed by the W3C Voice Browser group [SSML].
Subdialog
A VoiceXML dialog (or document) invoked from the current dialog in a manner analogous to function calls.
Synthesis Processor
A Text-To-Speech system that accepts SSML documents as input and renders them as spoken output.
System Initiative
A form of dialog interaction model, whereby the system holds the initiative, and typically drives the dialog with simple questions to which only a single answer is possible. (see also Mixed Initiative.)
T
Tapered prompts
A set of prompts used to vary a message given to the human. Prompts may be tapered to be more terse with use (field prompting), or more explicit (help prompts).
Text to speech
The process of rendering a piece of text into the corresponding speech.
Throw
An element that fires an event.
Time stamp
The time that a particular input or part of an input began or ended.
Token
A token (a.k.a. a terminal symbol) is the part of a Grammar that defines words or other entities that may be spoken (see section 2 of [SRGS]).
TTS, Text-To-Speech, Speech Synthesis
Converting text into sounds using speech synthesis techniques.
U
URI: Uniform Resource Identifier
A URI is a unifying syntax for the expression of names and addresses of objects on the network as used in the World Wide Web. A URI is defined as any legal anyURI primitive as defined in XML Schema Part 2: Datatypes Second Edition Section 3.2.17 [SCHEMA2]. In this specification URIs are provided as attributes to elements, for example in the emma:time-ref-uri attribute.
User input
An input provided by a user as opposed to something generated automatically.
V
Voice-only
Describing applications or interactions where the speech modality is the only interface available to the end-user, such as typical telephony scenarios.
VoiceXML
VoiceXML is markup language designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. VoiceXML is part of the W3C Speech Interface Framework. See [VOICEXML20].
VoiceXML document
An XML document conforming to the VoiceXML specification.
VoiceXML interpreter
A computer program that interprets a VoiceXML document to control an implementation platform for the purpose of conducting an interaction with a user.
VoiceXML interpreter context
A computer program that uses a VoiceXML interpreter to interpret a VoiceXML Document and that may also interact with the implementation platform independently of the VoiceXML interpreter.
XPath
XML Path language, a W3C Recommendation for addressing parts of an XML document. See http://www.w3.org/TR/xpath.
XSLT
Extensible Stylesheet Language Transformations, a W3C Recommendation for transforming XML documents. See http://www.w3.org/TR/xslt.