Mega & National Corpora

Second-generation corpora are often either mega corpora that have 100 million words or above, or are continuously expanding monitor corpora. They can either be general corpora, which tend to be openly available or at least accessible, or publishers’ corpora, to which access tends to be restricted to collaborators of the relevant publisher. Frequently such mega corpora these days also tend to be national corpora. The table below initially lists corpora of English, followed by those of other languages. Only corpora for which there is a website containing further information available are listed.

Open American National Corpus (OANC)

The Open American National Corpus (OANC) is a large electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. All data and annotations are fully open and unrestricted for any use.

Available Data and Annotations

OANC : 15 million words of contemporary American English with automatically-produced annotations for a variety of linguistic phenomena.

MASC : 500,000 words of OANC data equally distributed over 19 genres of American English, with manully produced or validated annotations for several layers of linguistic phenomena.

Australian National Corpus (AusNC)The Australian National Corpus is a national meta-collection that includes assorted examples of Australian English text (published and unpublished), transcriptions, audio and audio-visual materials from individual collections provided by collaborative institutions. Several retrospective corpora and content represented in the Australian National Corpus include linguistic occurrences that can be analysed for both academic research and teaching purposes.

British National Corpus (BNC)

The original 100 million-word collection of samples of written (90m words) and spoken language (10m words) from a wide range of sources, designed to represent a wide cross-section of current British English. The latest XML edition is now also freely available from the Oxford Text Archive. Sometimes these days referred to as BNC 1994 to distinguish it from the 2014 follow-on version (see below).

Sound recordings of some of the spoken files are lodged with the National Sound Archive at the British Library

The BNC can also be accessed through a convenient and powerful search interface, BNCWeb (CQP-Edition), which also provides access to the audio, as well as advanced options for searching on audio data now. These include access to phonemic transcriptions and stress patterns. Some basic visualisations of audio extracts are provided (e.g. spectrograms) and data can be downloaded in various formats for further analysis through specialised software (e.g. Praat or Wavesurfer). For more information on how to work with audio data, see the Searching Audio Data guide (also available directly from within BNCweb).

The audio files can also be accessed through the Audio BNC site at Oxford University, though currently in a bit of a roundabout way ;-)

You can also try the BYU-BNC or Phrases in English sites for a bit more functionality and context.

Try the following links for information on (i) the tagging of the BNC (Version1) (iii) the improved tags in BNC World Edition and (iv) the tagging of the BNC Sampler (see caveats on using the Sampler here)

There is a comprehensive User Reference Guide for the BNC XML Edition.

The BNC Index is an accompaniment to the BNC that allows users to find files based on various types of category and content information. Information on the index itself is available in the BNC Index Notes.

British National Corpus 2014 (BNC 2014)

So far, only the spoken component of the BNC2014 is available via CQPweb and download. Prior signup to CQPweb and/or agreement to the license is required before being able to use it.

The spoken version, collected between 2012–2016, comprises 11.5 million words from 1,251 conversations, and involving 672 speakers.

The written component is still under development, and was initially expected to be released in late 2018, but so far this still hasn’t happened...

COBUILD Project
(+ the Bank of English)

The Bank of English, launched in 1991, was originally designed as a 'monitor corpus' (continually 'refreshed', where texts are added and subtracted on a regular basis after comparing current texts against a reference set). However, it appears that now it is just a "dynamic corpus" in the sense that texts just get added to it. The corpus is so far around 550 million words of spoken and written English.

Sadly, Collins seems to have removed all information about the project/corpus from its web pages :-(, but access to the corpus is available for registered users at the The Centre for Corpus Research at Birmingham.

COBUILD = "Collins Birmingham University International Language Database".

International Corpus of English (ICE)

ICE began in 1990 with the primary aim of providing material for comparative studies of varieties of English throughout the world. Twenty centres around the world are preparing corpora of their own national or regional variety of English, following a common corpus design, as well as a common scheme for grammatical annotation.

The composition is unusual in that the corpora are in fact made up of 60% spoken materials and 40% written.

ICE-GB (the British component of ICE was the first of the ICE corpora to be completed, and is the British component of the ICE. It consists of a million words - 83,394 parse trees, including 59,640 in the spoken part of the corpus- extracted from 200 written and 300 spoken English texts. It is fully grammatically annotated and has been fully checked. ICE-GB is distributed with the retrieval software ICECUP (International Corpus of English Corpus Utility Program; currently version 3.1; version IV in beta stage) an exploration software designed for parsed corpora.

The website includes links to various national varieties of ICE, plus downloadable sound files from several ICE teams, including Australia, India, Jamaica, and the Philippines.

Corpus of American English (COCA)

~600 million words, including 20 million for each year from 1990 to the present, collected by Mark Davies at Brigham Young University. Each year (and therefore overall, as well), the corpus is evenly divided between spoken, fiction, popular magazines, newspapers, and academic. Being a monitor corpus, the corpus will continually be updated by the same number of word every year.
Same architecture and interface as the other corpora at http://corpus.byu.edu/. This includes options for searching by word, phrase, substring, lemma, and part of speech. Users can also search for collocates (including sorting by Mutual Information score) and can compare the collocates of competing words. They can also compare any features across different sections of the corpus (genres and/or years) to examine variation and change. As with the other corpora at the same site, though, the corpus is free for all users, but there are some access restrictions regarding the number of possible queries, and downloading of results is not possible due to copyright reasons.

Global Web-based English (GloWbE)

A 1.9-billion word corpus collected from the websites of 20 different countries where English is used as a major language, making comparisons across these varieties possible. The name is pronounced /gləʊb/.
For usage options, see COCA above.

iWeb

A 14-billion word web corpus compiled from 22 million web pages.
For general usage options, see COCA above. In addition, iWeb also makes it possible to create virtual corpora that can be compared to each other. For more detailed information, there’s also an introductory PDF on the website.

Pakistan National Corpus of English (PNCE)The Pakistan National Corpus of English (PNCE) is the first English corpus of Pakistan developed at the Corpus Research Centre, Air University. The design of the 10-million word PNCE is based on the major genres practiced in the Pakistani context. The genres are Academic (research articles, theses, student essays, book reviews), Media (newspaper & magazines), Legal (judgements, affidavits, business deeds, divorce deeds, research articles), Workplace (circulars, notifications, reports, minutes of meeting) and Literary (short stories, novels, autobiographies, literary essays).

TenTen Corpus Family

A group of 10+ billion-word corpora for more than 30 languages, such as English, Spanish, Japanese, Chinese, Greek, etc., available through SketchEngline.

Deutsches Referenzkorpus (DeReKo)Monitor & reference corpus, currently containing over 46.9 billion words of written German, housed at the Institue for German Language.
Digitales Wörterbuch der Deutschen Sprache (DWDS)Large reference corpus of German, staring form 1600. Includes predominantly written, but also spoken data.
Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK)Monitor & reference corpus of spoken German, housed at the Institue for German Language.
Bulgarian National CorpusThe Bulgarian National corpus consists of a monolingual (Bulgarian) part and 47 parallel corpora. The Bulgarian part includes about 1.2 billion words in over 240 000 text samples. The materials in the Corpus reflect the state of the Bulgarian language (mainly in its written form) from the middle of 20th century (1945) until present.
CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh)CorCenCC is an inter-disciplinary and multi-institutional project that will create a large scale, open source corpus of contemporary Welsh language. It will be the first large-scale corpus of Welsh representative of language use across communication types (circa 4m spoken words, 4m written, 2m e-language), genres, language varieties (regional and social) and contexts, with contributors representative of over half a million Welsh speakers in the UK.
Czech National CorpusThe Czech National Corpus is an academic project founded in 1994 at the CU FA and administered by the Institute of the Czech National Corpus. The aim of the project is systematic mapping of Czech and other languages in comparison with Czech. CNC corpora are accessible to everybody interested in studying the language after free registration.
Eastern Armenian National CorpusThe Eastern Armenian National Corpus (EANC) is a comprehensive linguistic database of annotated texts in Standard Eastern Armenian (SEA), the language spoken in the Republic of Armenia.
Hellenic National Corpus (HNC)The HNC corpus of Institute for Language and Speech Processing is being devoloped over many years and it includes over 47.000.000 words today, while it is being continuously enriched. All documents of HNC are carefully selected, in a way to reflect the actual usage of Modern Greek. There are limited options for guest users and more extensive ones after registration.
Hungarian National Corpus (HNC)The HNC currently contains 187.6 million words. It is divided into five subcorpora by regional language variants, and into five subcorpora by text genres.
Sejong 21 Corpora (‘Korean National Corpus’)Large written & spoken corpora of Korean compiled from 1998-2007, including a dedicated search interface. Unfortunately, all information is only available in Korean.
National Corpus of Polish (NKJP)A reference corpus of Polish language containing over fifteen hundred millions of words. The corpus is searchable by means of advanced tools that analyse Polish inflection and the Polish sentence structure. The list of sources for the corpora contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts.
Russian National CorpusThe Russian National Corpus covers primarily the period from the middle of the 18th to the early 21st centuries. This period represents the Russian language of both the past and the present in a wide range of sociolinguistic variants: literary, colloquial, vernacular, in part dialectal. The Corpus includes over 300 million words of original (non-translated) works of fiction (prose, drama and poetry) of cultural importance which are interesting from a linguistic point of view. Apart from fiction, the Corpus includes a large volume of other sources of written (and, for the later period, spoken) language: memoirs, essays, journalistic works, scientific and popular scientific literature, public speeches, letters, diaries, documents, etc.
Slovak National CorpusThe Slovak National Corpus is an electronic database containing Slovak language texts from 1955 onward and covering broad range of language styles, genres, areas, regions, etc. The corpus is offered to the public for research, educational, and other strictly non-commercial purposes. You can get the full and free-of-charge access to the main corpus, subcorpora and other databases by registration.
Thai National CorpusBased on the design of the BNC. Description on the page only in Thai. Search interface here.
Turkish National Corpus (TNC)The TNC is designed to be a balanced, large scale (50 million words) and general-purpose corpus for contemporary Turkish. It generally follows the framework of British National Corpus and is a free resource for non-commercial use.

Publishers’ Corpora

Cambridge International Corpus (CIC)

[Not generally available]

At the moment, the CIC can only be used by authors and writers working on books for Cambridge University Press; built up over the last ten years (circa from 1992) to help in writing books for learners of English; currently 600 m words; new data is added continuously. Together with other corpus resources, this gives the CUP lexicographers & writers access to:

  • sources: newspapers, novels, non-fiction books, recordings of spoken English, websites, magazines, TV & radio programmes, and many others.
  • 400 million words of written British English
  • 17 million words of spoken British English (including the CANCODE corpus [restricted availability], collected jointly by Cambridge University Press and the University of Nottingham)
  • 20 million words of written British academic English
  • 30 million words of written British business English
  • 175 million words of written American English
  • 22 million words of spoken American English including the Cambridge-Cornell Corpus of Spoken North American English collected jointly by Cambridge University Press and Cornell University in the US
  • 7 million words of written American academic English
  • 25 million words of written American business English
  • 15 million words of written learners' English (the Cambridge Learners' Corpus)
  • 5 million words of error-coded written learner English

[Figures correct on Oct 30th 2002]

Longman Corpus Network

Several corpora form the nucleus of the Network: the Longman/Lancaster Corpus with over 30 million words covers an extensive range of written texts from literature to bus timetables; the Longman Learners' Corpus is the only corpus to record and monitor the written output of students of English and enables us to pinpoint their specific needs; the Longman Written American Corpus comprised of 100 million words of American newspaper and book text; the Longman Spoken American Corpus is a unique resource of 5 million words of everyday American speech (a PDF here gives more details).

Did you find this useful? Let me know if you want to encourage me to keep updating the site.