Dialogue Corpora

As the name says, dialogue corpora usually contain dialogic spoken interactions, although sometimes more than two interlocutors may also be involved. These corpora are especially interesting for research into pragmatics & discourse.

Coconut Corpus

A collection of human-human computer-mediated dialogues in which two subjects collaborate on a simple task, buying furniture for the living & dining rooms of a house

Dialogue Diversity ‘Corpus’ (DDC)

Not, technically speaking, a ‘corpus’ as such, but a collection of links to different dialogue texts (transcriptions and/or sound files), covering a very diverse collection of interactive situations – a data resource for studies of the breadth of coverage of particular dialogue models, and for studies that compare dialogue from different situations. Taken as a whole, this ‘corpus’ is irregular & not homogeneous in any way. It is generally unsuitable for drawing any conclusions about dialogue taken as a single category.

SPAADIA (Speech Act Annotated Dialogues) Corpus

A small corpus of 35 timetable information and booking interactions between a female call-centre agent and her callers, compiled by Geoff Leech & myself as part of the original, larger, SPAAC project.

Fully annotated with speech-act information, as well as additional information regarding syntactic (C-unit) categories, topic information, surface polarity, and semantico-pragmatic markers (modes). For more information on the specific categories and the overall design of the annotation scheme, please see the SPAADIA Annotation Scheme document.

The SPAADIA corpus exists in two versions, the original release (version 1), and version 2, which uses the updated DART speech-act taxonomy, and has added ‘punctuation’ (<punc type="..." />) tags.

SRI American Express travel agent dialogue corpus

A corpus of actual travel agent interactions with client callers, consisting of 21 tapes containing between 2–9 calls each.

Switchboard Corpus (SWB)

A corpus of over 240 hours of recorded spontaneous (but topic-prompted) telephone conversations (2,438 conversations, averaging 6 minutes in length) recorded in the early 1990s.

C. 3 m words (3,044,734) of text, spoken by 543 unique speakers (302 males & 241 females) from most major dialect groups of American English. Info on the speakers’ age, sex, education & dialect region. On average, each speaker participates in about 9 calls (but it ranges from 1 to 32).

The corpus is available in different versions:

TRAINS Spoken Dialogue Corpus

Six & a half hours’ worth of human-human transactional dialogues; includes 55,000 words & about 5,500 speaker turns. Audio files for the dialogues are available on a CD-ROM available through the LDC.


If you found this web site useful, or found an outdated link, don’t forget to let me know.