Second-Generation (Mega) Corpora of English

Second-generation corpora are often either mega corpora that have 100 million words or above, or are continuously expanding monitor corpora. They can either be general corpora, which tend to be openly available, or publishers’ corpora, to which access tends to be restricted to collaborators of the relevant publisher.

Open American National Corpus (ANC)

The Open American National Corpus (OANC) is a large electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. All data and annotations are fully open and unrestricted for any use.

Available Data and Annotations

OANC : 15 million words of contemporary American English with automatically-produced annotations for a variety of linguistic phenomena.

MASC : 500,000 words of OANC data equally distributed over 19 genres of American English, with manully produced or validated annotations for several layers of linguistic phenomena.

British National Corpus (BNC)

The original 100 million-word collection of samples of written (90m words) and spoken language (10m words) from a wide range of sources, designed to represent a wide cross-section of current British English. The latest XML edition is now also freely available from the Oxford Text Archive. Sometimes these days referred to as BNC 1994 to distinguish it from the 2014 follow-on version (see below).

Sound recordings of some of the spoken files are lodged with the National Sound Archive at the British Library

The BNC can also be accessed through a convenient and powerful search interface, BNCWeb (CQP-Edition), which also provides access to the audio, as well as advanced options for searching on audio data now. These include access to phonemic transcriptions and stress patterns. Some basic visualisations of audio extracts are provided (e.g. spectrograms) and data can be downloaded in various formats for further analysis through specialised software (e.g. Praat or Wavesurfer). For more information on how to work with audio data, see the Searching Audio Data guide (also available directly from within BNCweb).

The audio files can also be accessed through the Audio BNC site at Oxford University, though currently in a bit of a roundabout way ;-)

You can also try the BYU-BNC or Phrases in English sites for a bit more functionality and context.

Try the following links for information on (i) the tagging of the BNC (Version1) (iii) the improved tags in BNC World Edition and (iv) the tagging of the BNC Sampler (see caveats on using the Sampler here)

There is a comprehensive User Reference Guide for the BNC XML Edition.

The BNC Index is an accompaniment to the BNC that allows users to find files based on various types of category and content information. Information on the index itself is available in the BNC Index Notes.

British National Corpus 2014 (BNC 2014)

So far, only the spoken component of the BNC2014 is available via CQPweb and download. Prior signup to CQPweb and/or agreement to the license is required before being able to use it.

The spoken version, collected between 2012–2016, comprises 11.5 million words from 1,251 conversations, and involving 672 speakers.

The written component is still under development, and was initially expected to be released in late 2018, but so far this still hasn’t happened...

COBUILD Project
(+ the Bank of English)

The Bank of English, launched in 1991, was originally designed as a 'monitor corpus' (continually 'refreshed', where texts are added and subtracted on a regular basis after comparing current texts against a reference set). However, it appears that now it is just a "dynamic corpus" in the sense that texts just get added to it. The corpus is so far around 550 million words of spoken and written English.

Sadly, Collins seems to have removed all information about the project/corpus from its web pages :-(, but access to the corpus is available for registered users at the The Centre for Corpus Research at Birmingham.

COBUILD = "Collins Birmingham University International Language Database".

International Corpus of English (ICE)

ICE began in 1990 with the primary aim of providing material for comparative studies of varieties of English throughout the world. Twenty centres around the world are preparing corpora of their own national or regional variety of English, following a common corpus design, as well as a common scheme for grammatical annotation.

The composition is unusual in that the corpora are in fact made up of 60% spoken materials and 40% written.

ICE-GB (the British component of ICE was the first of the ICE corpora to be completed, and is the British component of the ICE. It consists of a million words - 83,394 parse trees, including 59,640 in the spoken part of the corpus- extracted from 200 written and 300 spoken English texts. It is fully grammatically annotated and has been fully checked. ICE-GB is distributed with the retrieval software ICECUP (International Corpus of English Corpus Utility Program; currently version 3.1; version IV in beta stage) an exploration software designed for parsed corpora.

The website includes links to various national varieties of ICE, plus downloadable sound files from several ICE teams, including Australia, India, Jamaica, and the Philippines.

Corpus of American English (COCA)

c. 450 million words, including 20m for each year from 1990 to the present, collected by Mark Davies at Brigham Young University. Each year (and therefore overall, as well), the corpus is evenly divided between spoken, fiction, popular magazines, newspapers, and academic. In addition, the corpus will be continually updated - 20m words each year.

Same architecture and interface as the other corpora at http://corpus.byu.edu/. This includes options for searching by word, phrase, substring, lemma, and part of speech. Users can also search for collocates (including sorting by Mutual Information score) and can compare the collocates of competing words. They can also compare any features across different sections of the corpus (genres and/or years) to examine variation and change. As with the other corpora at the same site, though, the corpus is free for all users, but there are some access restrictions regarding the number of possible queries, and downloading of results is not possible due to copyright reasons.

Global Web-based EnglishGloWbE (GloWbE)

A 1.9-billion word corpus collected from the websites of 20 different countries where English is used as a major language, making comparisons across these varieties possible. The name is pronounced /gləʊb/.
For usage options, see COCA above.

iWeb

A 14-billion word web corpus compiled from 22 million web pages.
For general usage options, see COCA above. In addition, iWeb also makes it possible to create virtual corpora that can be compared to each other. For more detailed information, there’s also an introductory PDF on the website.

TenTen Corpus Family

A group of 10+ billion-word corpora for more than 30 languages, such as English, Spanish, Japanese, Chinese, Greek, etc., available through SketchEngline.


Publishers’ Corpora

Cambridge International Corpus (CIC)

[Not generally available]

At the moment, the CIC can only be used by authors and writers working on books for Cambridge University Press; built up over the last ten years (circa from 1992) to help in writing books for learners of English; currently 600 m words; new data is added continuously. Together with other corpus resources, this gives the CUP lexicographers & writers access to:

  • sources: newspapers, novels, non-fiction books, recordings of spoken English, websites, magazines, TV & radio programmes, and many others.
  • 400 million words of written British English
  • 17 million words of spoken British English (including the CANCODE corpus [restricted availability], collected jointly by Cambridge University Press and the University of Nottingham)
  • 20 million words of written British academic English
  • 30 million words of written British business English
  • 175 million words of written American English
  • 22 million words of spoken American English including the Cambridge-Cornell Corpus of Spoken North American English collected jointly by Cambridge University Press and Cornell University in the US
  • 7 million words of written American academic English
  • 25 million words of written American business English
  • 15 million words of written learners' English (the Cambridge Learners' Corpus)
  • 5 million words of error-coded written learner English

[Figures correct at Oct 30th 2002]

Longman Corpus Network

Several corpora form the nucleus of the Network: the Longman/Lancaster Corpus with over 30 million words covers an extensive range of written texts from literature to bus timetables; the Longman Learners' Corpus is the only corpus to record and monitor the written output of students of English and enables us to pinpoint their specific needs; the Longman Written American Corpus comprised of 100 million words of American newspaper and book text; the Longman Spoken American Corpus is a unique resource of 5 million words of everyday American speech (a PDF here gives more details).

Did you find this useful? Let me know if you want to encourage me to keep updating the site.