Specialised Corpora

This page lists specialised corpora of English (specific dialects, genres, registers), many of which may be suitable for ESP teaching, learning & research.

ACL/DCI CD-ROM disk (no longer accessible)

About 63 m words of plain orthographic English collected by the Association for Computational Linguistics' Data Collection Initiative; consists of: the Collins English Dictionary; selections from the Wall Street Journal (40m words); a database of scientific abstracts from the U.S. Department of Energy (23m words); the `Penn Treebank' of skeleton-parsed data compiled by Mitch Marcus & his team at the University of Pennsylvania (Marcus & Santorini, 1992).

Air Traffic Control (ATC) Corpus

70 hours of recorded conversation between controlers & aircrafts in three major airports of the United States; 3 subcorpora corresponding to each one of the three airports; each subcorpus consists of 20-25 hours of data, representing continuous recording without silence elimination. The speech files are fully transcribed, with time marking indicating beginning & end of transmission.

BASE (British Academic Spoken English)

The British analogue to MICASE. A corpus of university lectures & seminars developed at the Universities of Warwick & Reading, under the directorship of Hilary Nesi, with Paul Thompson. Recordings & transcriptions of 160 lectures & 39 seminars in a range of departments, at both undergraduate & postgraduate level (1,644,942 tokens in total). Transcriptions, video & audio recordings have been archived by the Arts & Humanities Data Service.

BAWE (British Academic Written English)

A corpus of good-quality student assignments across disciplines, from first year undergrad to masters level, developed at the Universities of Warwick, Reading, Oxford Brookes & Coventry, under the directorship of Hilary Nesi, with Paul Thompson, Sheena Gardner & Paul Wickens. 2,761 assignments from 627 student contributors in 33 university departments, totalling 2896 independent texts (6,514,776 words). Corpus development was funded by the Economic & Social Research Council. (2004-2007). The corpus is available to researchers from the Arts & Humanities Data Service.

Blog Authorship Corpus

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. It incorporates a total of 681,288 posts and over 140 million words – or approximately 35 posts and 7250 words per person.

Business Letters Corpus

Someya’s corpus of Business Letters (1,020,060 word tokens of U.S. & U.K. samples, as of 1 March 2000). More info & a web concordancer on this page: Online BLC Concordancer. Also searchable through the interface: a non-native English corpus of business letters written by Japanese business people, as well as other data.

Carnegie Mellon Communicator Corpus

A large corpus of speech produced by callers to a Travel Planning system; around 180,605 utterances (90.9 hours) in 2002.

CHAINS corpus (CHaracterizing INdividual Speakers)

A novel speech corpus which may be of interest into those looking at diverse speaking styles, & those seeking to characterize speaker identity; features approximately 36 speakers recorded under a variety of speaking conditions, allowing comparison of the same speaker across different well-defined speech styles. Speakers read a variety of texts alone, in synchrony with a dialect-matched co-speaker, in imitation of a dialect-matched co-speaker, in a whisper, & at a fast rate. There is also an unscripted spontaneous retelling of a read fable. The bulk of the speakers were speakers of Eastern Hiberno-English. Free for research purposes.

COLT

(Bergen “Corpus Of London Teenage Language”)

Spoken language of 13 to 17-year-old teenagers from different boroughs of London; half a m words, orthographically transcribed & word-class tagged, & is a constituent of the British National Corpus; A pilot-version consisting of 151 texts is now available on the Internet. For registered users, the search program can also show the distribution of an item in relation to factors such as age, sex, socioeconomic class, location etc.

PERC Corpus (formerly Corpus of Professional English)

A 17-million-word corpus of copyright-cleared English academic journal texts in science, engineering, technology and other fields. It was compiled as a part of the project of the Professional English Research Consortium (PERC) and is intended to be used for research in the field of Professional English. Accessible, but not free.

Corpus of Spoken Professional American-English (CSPA)

2-m-word part-of-speech tagged corpus consisting of transcripts of American Eng spoken in professional settings (committee meetings, faculty meetings & White House press conferences); recorded from 1994-1998; consists primarily of short interchanges by approximately 400 speakers that are centered on professional activities broadly tied to academics & politics, including academic politics; seventeen files (12 MB). Commercial product by Athelstan.

Corpus of Written British Creole (CWBC)

Mark Sebba’s project. The user guide can be downloaded here.

Enron Email Dataset/Corpus

Collected & prepared by the CALO Project (A Cognitive Assistant that Learns & Organizes). Contains e-mails from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, & posted to the web, by the Federal Energy Regulatory Commission during its investigation. Does not include attachments, & some messages have been deleted “as part of a redaction effort due to requests from affected employees”, & some email addresses were anonymized. Probably the only substantial collection of “real” email that is public (because of privacy concerns). In using this dataset, please be sensitive to the privacy of the people involved (& remember that many of these people were certainly not involved in any of the actions which precipitated the investigation).

IViE corpus
(Intonational Variation in English)

The IViE corpus contains recordings of nine urban dialects of English spoken in the British Isles. Recordings of male and female speakers were made in London, Cambridge, Cardiff, Liverpool, Bradford, Leeds, Newcastle, Belfast in Northern Ireland and Dublin in the Republic of Ireland.

Hyland’s Research Articles Corpus

Ken Hyland’s personal corpus of published research articles, representing written academic English. Not available to the general public, but contact owner directly for more info. Consists of 30 texts each from 8 disciplinary areas (biology, engineering, mechanical engineering, linguistics, marketing, philosophy, sociology, physics), totalling 1.3m words.

Leipzig Corpora Collection (various languages)

Corpora in different languages using the same format & comparable sources (identical in format & similar in size & content). Randomly selected sentences. Available in sizes of 100,000 sentences, 300,000 sentences, 1 million sentences etc. The sources are either newspaper texts or texts randomly collected from the Web. All data (publicly accessible, copyrighted sources) have been processed automatically so that it is not possible to reconstruct the original source texts. Significant L1, R1 & "within sentence" collocates are computed for each word. Available as plain text files, or as MySQL database tables (ready to use with a supplied Corpus Browser)

Louvain Corpus of Modern English Drama

Compiled at the Institute of Applied Linguistics of the Catholic University of Leuven (L. K. Engels & Dirk Geens). 62 British English plays first published during 1966-1972. 1 million words. Available through the Oxford Text Archive.

LOCNESS (Louvain Corpus of Native English Essays)

A corpus of native English essays made up of: British pupils’ A level essays (60,209 words), British university students essays (95,695 words), American university students' essays (168,400 words). Total: 324,304 words

METER Corpus (MEasuring TExt Reuse)

Collected from British PA (Press Association) archive & 9 British national newspapers; 528,563 words from the two journalistic domains of 'Law & Courts' & 'Show Business'; project aim was to develop techniques for detecting & measuring text reuse (mapping derived texts to their source texts, indicating the probability of derivation). One CD-ROM (free), but link on the web pages not working.

MICASE (Michigan Corpus of Academic Spoken English)
(Compare with the BASE corpus above)

A free & web-accessible spoken (mainly) American English corpus of c. 1.7 m words (190 hours of recordings) focusing on contemporary university speech within the microcosm of the Univ of Michigan. Has a free-to-use accompanying web concordancer/search engine that can search by speaker or speech event attributes.

Speakers include faculty, staff, & all levels of students (mostly native, some non-native speakers) across several speech events (incl. monologic & interactive speech) from all of the major academic divisions (with the exception of the professional schools, i.e., medical, dental, business, & law).

15 different types of speech event: small/large lecture, public interdisciplinary or departmental colloquia, discussion sections, student presentations, seminars, undergraduate lab sessions, lab group & other meetings, one-on-one tutorials, office hours, advising consultations, dissertation defenses, study groups, interviews, campus/museum tours, & service encounters.

The original links off the homepage to transcription info, etc., are broken, but can still be accessed from the Web Archive pages. Access to the data & sound files is available via https://ca.talkbank.org/access/MICASE.html.

MICUSP (Michigan Corpus of Upper-level Student Papers) search interface

MICUSP (the Michigan Corpus of Upper-level Student Papers); 1.6 m words; assessed genres of writing by senior undergraduate (4th year) & graduate students in the US (native & non-native speakers of English); length of the texts ranges from 500 to 10,000 words; developed at the University of Michigan’’s English Language Institute.

MuchMore Springer Bilingual Corpus

A parallel corpus of English-German scientific medical abstracts; c. 1 million tokens for each language. Abstracts are from 41 medical journals, each of which constitutes a relatively homogeneous medical sub-domain (e.g. Neurology, Radiology, etc.). The corpus of downloaded HTML documents was normalized in various ways in order to produce a clean, plain text version consisting of a title, abstract and keywords. Additionally, the corpus was aligned on the sentence level.

NIE Corpus of Spoken Singapore English (NIECSSE)

Aims to provide high-quality recordings of Singaporean speakers. The aim of the corpus is to facilitate acoustic/phonetic analysis of Singapore English. In order to eliminate background noise & thereby facilitate acoustic/phonetic measurement, all recordings were made directly onto the computer in the NIE Phonetics Laboratory. Consists of interviews & a read text.

Nijmegen Corpus & TOSCA Corpus (Tools for Syntactic Corpus Analysis)

Nijmegen Corpus: 132,000-word syntactically analysed corpus of written (120,000 words) & spoken (12,000 words of sports commentaries) modern British English; 20,000-word samples of fiction & non-fiction from 1962-68.; TOSCA Corpus: 1.5 m words (75 samples x 20,000 words each) syntactically analysed; texts from 1976-86. Original link no longer accessible!

Oxford Psycholinguistic Database

Comprises 98,538 English words & information on the spelling, syntactic category & number of letters for each of these as well as information on the phonetics, syllabic count, stress patterns & various criteria affecting comprehension.

Reading Academic Text corpus (RAT)

The Reading Academic Text corpus is a collection of academic texts, written by academic staff or students at the University of Reading, and now stored in machine readable form, that has been developed in the Department of Applied Linguistics. The aim of the project, which started in 1995, is to develop a corpus of academic text for linguistic analysis by research students and staff at the Uninversity, which will contribute to the understanding of text construction practices in academic settings. The insights derived from such analyses should then feed into the development of teaching materials for English for Academic Purposes courses, and into teacher training courses. Use of the corpus is restricted to staff and researchers at the University of Reading

Reuters Corpora

See separate entry under D-I-Y corpora.

RST Discourse Treebank

[For Discourse Analysis, message understanding, etc.]

The Rhetorical Structure Theory (RST) Discourse Treebank was developed by researchers at the Information Sciences Institute (University of Southern California), the US Department of Defense and the Linguistic Data Consortium (LDC). It consists of 385 Wall Street Journal articles from the Penn Treebank annotated with discourse structure in the RST framework along with human-generated extracts and abstracts associated with the source documents.

Saarbruecken Corpus of Spoken English (ScoSE)

Freely downloadable corpus, but strangely in the form of PDF files, rather than plain text or XML. Eight parts: Part 1: Complete Conversations; Part 2: Indianapolis Interviews; Part 3: Jokes; Part 4: Drawing Experiment; Part 5: Kassel Classroom Discourse; Part 6: Stories; Part 7: London Teenage Talk; Part 8: Musicians' Talk. Transcripts & audio can be downloaded from the TalkBank site, and some can be heard & read synchronously (as multimedia presentations) through any browser from the TalkBank browser page (click on “CABank”, then on “SCoSE”, then on one of the transcripts, then press the “play” button for Quicktime).

Scottish Corpus of Texts & Speech (SCOTS)

Contains documents in Scottish Standard English, documents in several varieties of Scots, & everything in between. While Scottish Standard English has a standard written form, Scots does not. This means that the corpus contains a wide range of spelling variation (steps being made to offer a means of searching for all of the variant spellings automatically in a later stage of the project). SCOTS can be searched or browsed through an online interface.

SLX Corpus of Classic Sociolinguistic Interviews

8 sociolinguistic interviews, 9 speakers. William Labov & one of his students conducted the interviews in the 1960s & 70s. These interviews represent solutions to the problems of achieving cross-cultural contact, reducing the effect of the Observer’s Paradox & approximating the vernacular of everyday life. Complete interview recordings plus time-aligned verbatim transcripts for each speaker. Also included: (i) a sociolinguistic variable survey that represents an overview of the intra- & inter-speaker variation attested in the corpus, highlighting a broad range of phonological, phonetic, grammatical, lexical & stylistic variables. (ii) a number of annotation tools that allow users to listen to each interview while browsing the corresponding transcripts, & to display & hear each token identified in the variable survey. The recordings demonstrate successful interviewing techniques, the sound quality is high, & the digitization, segmentation & transcription of the data represent best practice in these areas. The variable survey highlights over 150 sociolinguistic variables attested in the corpus & suggests avenues for further research. Most importantly, the SLX Corpus provides both an example of a digital speech corpus developed specifically to support sociolinguistic research, & a stable benchmark for training in sociolinguistic data collection, digitization, segmentation, transcription, analysis & publication. 17 speech files (22050Hz, 16 bit, single-channel in the MS WAV (RIFF) format), total of 575 minutes (~ 1.5GB); Web download.

Speech, Thought & Writing Presentation Corpus (STWP)

A corpus of around 250,000 words annotated for categories of speech, thought & writing presentation; genres included: fiction, newspaper reports, biographies/autobiographies. Available through the Oxford Text Archive.

TimeBank 1.2

TimeBank 1.2 contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times. The annotation follows the TimeML 1.2.1 specification.

Translational English Corpus (TEC)

Contemporary translational English: written texts translated into English from a variety of source languages, European & non-European. Supports a broad range of studies in two main areas: the way in which the patterning of translated text might be different from that of non-translated text in the same language, & stylistic variation across individual translators. Set up by Mona Baker.

TIMIT Acoustic-Phonetic Continuous Speech Corpus

“Read speech” designed to provide speech data for the acquisition of acoustic-phonetic knowledge & for the development & evaluation of automatic speech recognition systems; contains broadband recordings of 630 speakers of 8 major dialects of American English, each reading 10 phonetically rich sentences

T2K-SWAL Corpus (The TOEFL 2000 Spoken & Written Academic Language Corpus)

[Owned by the Educational Testing Service (ETS), USA. NOT publically available.] 2.8 m words; 490 texts; 8 spoken & written registers (e.g., classroom teaching, study groups, textbooks) taken from 6 academic disciplines at four US universities; designed to represent the range of spoken & written registers that students will regularly encounter in university life. Part-of-speech-tagged. The following articles gives more detailed info: Biber, D., S. Conrad, R. Reppen, P. Byrd, & M. Helt. 2002. Speaking & writing in the university: A multi-dimensional comparison. TESOL Quarterly, 36(1):9-48.

A reduced redundancy USENET corpus
(2005-2011)

A free, dowloadable full text corpus of over 7 billion words of internet discussion board messages (raw text, unannotated), delivered as a set of weekly files, enabling diachronic analysis.

Wolverhampton Business English Corpus

[description & purchasing information from ELDA]

10,186,259 words in the general domain of business, collected from 23 different web sites around the world (from six months within the period 1999-2000), covering a wide variety of categories including product descriptions, company press releases, annual financial reports, business journalism, academic research papers, political speeches & government reports. POS-tagged.


If you found this web site useful, or found an outdated link, don’t forget to let me know.