Spoken Corpora

Spoken corpora generally consist of orthographic representations/transcriptions of the spoken data only, but some corpora may also be available in multi-media formats, either audio or video.

ANDOSL

(Australian National Database of Spoken Language)

Comprises spoken language as it occurs in a variety of major speaker groups in Australia (both native-born & overseas-born migrants); data was elicited either by written material which was read aloud (the "read speech" data) or by graphical material which was discussed by two speakers thereby generating spontaneous speech (the "map task" or "spontaneous" data). Speakers were rigorously selected within phonologically defined speaker groups, each group balanced for age ranges & gender. Recorded in a high quality environment at the National Acoustic Laboratories. Manual annotation at both word & phonemic levels using highly trained transcribers is being combined with automatic methods.

BASE (British Academic Spoken English)

See separate entry on the Specialised Corpora page

CANCODE

(Cambridge & Nottingham Corpus of Discourse in English)

Part of the Cambridge English Corpus. Not generally available for research except at specific sites (annoying!).

5 m words of spontaneous speech collected between 1995 & 2000.

CANCODE has all the transcripts coded to reflect the relationship between the speakers–,whether they are intimates (living together), casual acquaintances, colleagues at work, or unknown to each other.

Speech events were recorded at hundreds of locations across the British Isles, covering a wide variety of situations: casual conversation, people working together, people shopping, people finding out information, discussions, etc. [see also the Centre for Research in Applied Linguistics, University of Nottingham].

CHRISTINE

Spoken version of SUSANNE Corpus; SUSANNE-meets-spoken-English; Geoffrey Sampson’s project

Diachronic Corpus of Present-day Spoken English (DCPSE)

More than 400,000 words from ICE-GB (collected in the early 1990s) and 400,000 words from the London-Lund Corpus (late 1960s-early 1980s). Fully parsed (87,000 trees) to be consistent with ICE-GB & searchable using ICECUP. Housed at the Survey of English Usage, University College London.

EUSTACE (Edinburgh University Speech Timing Archive & Corpus of English)

Free for non-commercial use; esp. useful for phonetics researchers & speech technologists working on synthesis & recognition.

Comprises 4608 spoken sentences spoken by 6 speakers of British English; sentences were designed to examine a number of durational effects in speech & are controlled for length & phonetic content. Subconstituents of key words in each sentence have been identified by labels in xlabel (ESPS) format & notes have been made about the prosodic realisation of the sentences.

Example sentences available for playback. Speech waveform files are available in .wav (RIFF) format & .sd (ESPS) format.

FRED

(Freiburg English Dialect Corpus)

A specialized corpus of British English dialects covering nine major dialect areas in Britain; 370 texts; c. 2.45 m words; 300 hours of speech, excluding interviewer utterances (recorded between 1968 & 2000– some recordings were taken from oral history interviews), 420 different informants (a majority are non-mobile old rural males who typically grew up before WW I.). Recordings will be made available.

Some sample texts and audio available from here.

Hansard

Parliamentary Proceedings from:

Not really ‘corpora’ in the sense of fixed, formatted texts, but collections of transcripts.

HCRC Map Task Corpus

Compiled by the Human Communication Research Centre at Edinburgh University.

A set of 8 CD-ROMs containing linked audio & transcriptions of a total of about 18 hours (roughly 150,000 word tokens) of spontaneous (task-oriented) speech that was recorded from 128 two-person conversations according to a detailed experimental design. OR Download/ftp a gzipped tar file of the entire corpus (tar [compressed] file is 10MB, whole corpus is 80MB; 2562 XML files & a dtd directory containing 15 dtd files.)

IViE corpus
(Intonational Variation in English)

See separate entry on the Specialised Corpora page

Lancaster/IBM Spoken English Corpus (SEC)

52,000 words of mostly prepared (& mostly monologic) southern British English speech (approximating to RP), collected in the period 1984-1987.

Orthographic & prosodic transcription & in two versions with grammatical tagging (like those for the LOB Corpus).

For a detailed description, see the ICAME Corpus Collection’s SEC manual.

A collection of research papers based on the SEC has also been published as Working with Speech (1996), Knowles, Gerald, Anne Wichmann & Peter Alderson (eds.), London: Longman.

LeaP (Learning the Prosody of a foreign language)

The LeaP corpus of non-native speech consists of a total of 359 annotated files and includes 131 different speakers with 32 different native languages as well as 18 recordings with native speakers. The total amount of recording time is more than 12 hours. The corpus is divided into two sub-corpora since two target languages of second language learners were analysed: German and English. The German subcorpus consists of 183 annotated files, 62 different speakers (76 including the word lists) with 21 (24) different native languages. The English subcorpus consists of 176 annotated files with 50 different speakers (61 including the word lists) with 16 (17) different native languages.

Limerick Corpus of Irish English (L-CIE)

A one-million word spoken corpus of Irish English discourse; conversations recorded in a wide variety of mostly informal settings throughout Ireland (excluding Northern Ireland); currently (accessed: Feb 2008) 375 transcripts; mainly casual conversation, but also over 200K words of professional, transactional & pedagogic Irish English; not designed to be geographically representative (does not include data from every county); speakers range in age from 14 to 78; equal representation of both male & female speakers; designed to allow inter-corpus comparisons with CANCODE

London-Lund Corpus (LLC)

See description here

Longman Spoken American Corpus

5 million words, demographically sampled speech from 12 regions (30 states) across the continental US; coordinated by the University of California at Santa Barbara; everyday conversations of more than 1,000 Americans of various age groups, levels of education, & ethnicity. Not generally available. PDF with more information is here.

Machine-Readable Spoken English Corpus (MARSEC)

Some notes on (Aix-)MARSEC version 2 here (latest) or here (outdated).

MICASE (Michigan Corpus of Academic Spoken English)

See separate entry on the Specialised Corpora page

Nationwide Speech Project Corpus

A corpus of spoken language containing recordings of young male and female talkers (60 in total) from six regions of the United States. Speech samples include isolated words, sentences, passages, and interview speech. The purpose of the Nationwide Speech Project was to develop a corpus of spoken language that can be used in acoustic and perceptual studies of regional dialect variation in the United States

Newcastle Electronic Corpus of Tyneside English (NECTE)

A corpus of dialect speech from Tyneside in North-East England. It is based on two pre-existing corpora: the Tyneside Linguistic Survey (TLS) project (late 1960s), and the Phonological Variation and Change in Contemporary Spoken English (PVC) project (1994). NECTE amalgamates the TLS and PVC materials into a single Text Encoding Initiative (TEI)-conformant XML-encoded corpus and makes them available in a variety of aligned formats: digitized audio, standard orthographic transcription, phonetic transcription, and part-of-speech tagged.

PROSICE Corpus

A collection of re-recorded ICE-GB texts with high technical specifications; syntactically analysed & temporally aligned. See here for more info.

Reading/Leeds Emotion in Speech Corpus

Prosodically & paralinguistically coded speech corpus for investigating suprasegmental & affective information in the speech signal. 4.5-hour database of machine-readable speech, of which 26 mins were transcribed using the extended ToBI system. Unfortunately, this corpus is NOT available for use by others, but you can find out more info from the people listed on the website.

Saarbruecken Corpus of Spoken English (ScoSE)

See separate entry on the Specialised Corpora page

Santa Barbara Corpus of Spoken American English (SBCSAE)

Recordings of people talking – people from all over the United States, in all walks of life, talking about & doing all sorts of things; 249,000 words; 60 discourse segments of between 15 & 30 minutes each.

Transcripts can be downloaded from the project page & audio as MP3 files or as wav files from TalkBank.

Some can be heard & read synchronously (as multimedia presentations) through any browser from the TalkBank browser page (click on "CABank", then on "SBCSAE", then on one of the transcripts, then press the "play" button for Quicktime.)

The SBCSAE now also forms part of the spoken component of the US subcorpus of the International Corpus of English (ICE).

Spoken Corpus of the Survey of English Dialects

See Survey of English Dialects on the pages of the British Library for the recordings.

Talkback Radio Corpus (Australian)

Currently around 200,000 words; is an element of the Australian English Grammar project. Talkback programs from the ABC and commercial radio stations all over Australia are being collected & transcribed to provide examples of spontaneous public speech.

Tyneside Linguistic Survey (TLS)

Not much info available, but some given on the NECTE page. The TLS corpus was compiled in the late 1960s, & consists of 86 loosely-structured 30-min interviews. The informants were drawn from a stratified random sample of Gateshead in North-East England, & were equally divided among various social class groupings of male & female speakers, with young, middle, & old-aged cohorts

Wellington Corpus of Spoken New Zealand English (WSC)

1 m words of spoken New Zealand English collected from 1988 to 1994 (99% (545 out of 551 extracts) was collected between 1990 to 1994). Of the eight remaining files, four were collected in 1988 (4 oral history interviews) & four in 1989 (4 social dialect interviews). 2,000 word extracts (where possible) & comprises different proportions of formal, semi-formal & informal speech. Both monologue & dialogue categories are included & there is broadcast as well as private material collected in a range of settings. Access to recordings from the WSC is restricted to use at Victoria University of Wellington. A small number of the recordings which are shared with the ICE-NZ corpus will be made available on CD through ICE.

Wellington Language in the Workplace Project Corpus

Not generally available (?). Project aimed to analyse socio-pragmatic norms of interpersonal communication in a wide variety of NZ workplaces, with recordings done as unobtrusively as possible. Volunteers tape-recorded a range of their everday work interactions over a period of time, collecting two-party & multipary meetings, informal work-related conversations, telephone calls, & workplace small talk. Currently (2004) comprises 2000 interactions involving >500 participants, recorded in a number of government departments & commercial white-collar organizations, small businesses, & blue-collar factories. Social talk & business or task-oriented talk, ranging from short telephone calls of <1min to meetings >4 hrs long. Audio recordings are supplemented by detailed on-site ethnographic observations, written agendas & minutes, demographic & organizational info, & video recordings. Contact Janet Holmes at the Victoria University of Wellington, NZ.

British National Corpus (BNC)

Naturally, the spoken component of the British National Corpus is also a rich resource (although for phonetic/prosodic research you’ll need to get the audio tapes from the British Library... these are now generally available, but the matching of tapes & actual BNC files is problematic).

For phonemic/acoustic/articulatory databanks (mainly isolated words, phonemes, or sentences), see separate list of links here (Kiel) or the ELRA/ELDA pages or the LDC. Some people make a distinction between ‘speech corpora’ (suitable for acoustic/phonetic studies) & ‘spoken corpora’ (containing transcriptions of any type of spoken language). I use ‘spoken corpora’ here as an umbrella term for both types.

The LDC also contains various resources which are not ‘corpora’ as such, but may be of interest. Example: the LDC American English Spoken Lexicon, which is a collection of pronunciations captured in individual audio files for more than 50,000 of the most common words in English (words were extracted from newswire & telephone conversation; description & links to audio files here), or the West Point Company G3 American English Speech Data, comprising 185 sentences read out by volunteers.


If you found this web site useful, or found an outdated link, don’t forget to let me know.