Author Archives: Kars Wijnhoven
ISO 24617-2 annotation
Tokenization
ISO 24617-2 annotation
Tokenization
ISO 24617-2 annotation
Overview
The DialogBank contains dialogues from the following corpora, either re-annotated according to ISO 24617-2 or directly annotated in that way:
- HCRC Map Task (English)
- Switchboard (English)
- DBOX (English)
- TRAINS (English)
- Other (English)
- DIAMOND (Dutch)
- OVIS (Dutch)
- Schiphol (Dutch)
- Dutch Map Task (Dutch)
Annotation Schemes and Representation Formats
Annotation Schemes
The dialogues in the DialogBank corpus are all annotated according to the ISO 24617-2 standard. In many cases these dialogues were originally annotated according to some other scheme, such as the SWBD-DAMSL annotation scheme, the HCRC Map Task annotation scheme, or the DIT++ annotation scheme, and in some cases the original annotations are also available in the DialogBank in order to facilitate comparative studies.
The ISO 24617-2 annotation scheme supports the annotation of spoken, written, and multimodal dialogue with information about its semantic and pragmatic units. Its main distinguishing features are:
- A taxonomy of 56 communicative functions is defined, selected from the 86 functions in the DIT++ taxonomy
- Multidimensional annotation in 9 dimensions is supported, each dimension corresponding to a type of semantic content;
- Qualifiers are defined for expressing that a dialogue act is performed conditionally, with (un)certainty, or with a particular sentiment;
- Dependence relations are defined that link a dialogue act to one or more other units in the dialogue that they semantically depend on;
- Rhetorical relations may be annotated to indicate semantic or pragmatic relations between dialogue acts.
Annotation and representation
The ISO 24617-2 standard includes the definition of the Dialogue Act Markup Language (DiAML). This language has been designed according to the ISO Linguistic Annotation Framework (ISO 24612; see also Ide & Romary 2004) and the ISO Principles for Semantic Annotation (ISO 24617-6; see also Bunt 2014). This means that the markup language has both an abstract syntax and a concrete syntax, as well as a semantics that is associated with the abstract syntax. The abstract syntax is a specification of the concepts that make up annotations and the possible ways of combining them in set-theoretical constructs, called ‘annotation structures’ – typically sets of pairs, triples, and other n-tuples of concepts. A concrete syntax defines a representation of the annotation structures defined by the abstract syntax. If you’re interested to see an abstract annotation structure, take a look at the annotation structure for dialogue “Schiphol 261”.
The ISO 24617-2 standard includes an XML-based reference representation format (DiAML-XML), but equally allows other formats provided that they are ‘complete’, which means that every annotation structure can be represented by an expression defined by the concrete syntax, and ‘unambiguous’, i.e. every representation defined by the concrete syntax is the encoding of exactly one annotation structure. Any two representation formats with these properties (so-called ‘ideal’ representation formats) are semantically equivalent, and allow a meaning-preserving conversion between them.
Alternative representation formats
The annotations in the DialogBank are available in three different representation formats that have been defined for the DiAML abstract syntax (and semantics), notably in machine-friendly DiAML-XML and in two alternative user-friendly tabular representation formats, one for Switchboard dialogues and one for other dialogues. These formats are called “DiAML-TabSW” and “DiAML-MultiTab”, respectively.
The definition of the tabular formats DiAML-MultiTab and DiAML-TabSW, besides the DiAML-XML reference representation format, is motivated by two considerations:
- DiAML-XML representations are tough for human readers, due to the extensive and repetitive use of XML element and attribute names, which leads to codings that are extremely lengthy and hard to browse; tabular representations are more convenient for human readers (and can be seen as a user-friendly interface to XML representations).
- Comparison of ISO 24617-2 annotations with those of other annotation schemes is facilitated by using formats that are more like the text-based formats that have been used in some corpora. In particular, the DiAML-TabSW representation format has been designed to facilitate comparison with plain text SWBD-DAMSL annotations.
For more information about the DiAML-XML, DiAML-TabSW and DiAML-MultiTab representation formats, their advantages and disadvantages, and conversions between them go to the representation formats page. See also ‘Annotation Representations and the Construction of the DialogBank’ and ‘The DialogBank‘ for more information about these representation formats (and many other things).
2012_iso-logo_print
TRAINS dialogues
The TRAINS corpus of problem solving dialogues was collected at the University of Rochester as part of the TRAINS project. The dialogues involve two participants: one who plays the role of a user and has a certain task to accomplish, and another who plays the role of the system by acting as a planning assistant. The task in these dialogues concerns the shipping of goods in a railroad freight system.
Part of the TRAINS corpus was annotated using the DAMSL annotation scheme. Two examples of transcribed, segmented and DAMSL-annotated dialogues are the following:
Dialogue d92a-2.2 transcription and DAMSL DAMSL_annotation
Dialogue d92a-3.1 transcription and DAMSL annotation
ISO-annotated dialogues (gold standard):