On-line Corpora of English

Many language teachers & learners just want to know one simple thing: where are the free, web-accessible corpora that we can search directly, without any fuss? There aren’t many, but here are the major ones. I have left out literary works, newspaper collections & blogs because these you can easily find yourselves & there are millions of them out there.

British National Corpus (BNC)

[100m words; 1990s British English, spoken & written]: There are many different web sites giving free (but limited ) access to the corpus - limited due to copyright : i.e. you cannot expand the concordance context to read more of the surrounding text, & you cannot read the entire source texts (only snippets).

  • BNCweb : User-friendly, free interface.
  • JustTheWord : The most accessible site for non-English-speaking background students (& most pedagogically useful) because it straightaway gives you a list of collocations for your search word/phrase, instead of concordances; results are categorized by POS-based patterns & by approximate sense clusters, & graph bars give an indication of how common each combination is. Results are based on a 80K-word subset of the BNC.
  • BYU-BNC (formerly called "VIEW"): allows word-, phrase- or part-of-speech-based searches of the BNC with genre -restrictions; allows wildcards & "fuzzy matches"; can list collocations . Requires registration (free) after about 20 searches.
  • Phrases in English ( PIE ; make sure you’re on the "N-grams" page) – allows word/phrase searches of the BNC, returning a maximum of 50 random hits (enter words or phrases (up to 8 words), one word per box).
  • BNC online : There is almost no good reason to use this site because it has so many limitations: limited to max. of 50 random hits; limited left/right contexts; only sentence view (the search term is not highlighted & not in the center of the screen); cannot search by POS alone; cannot restrict to specific genres; etc . [ Hint for EFL users : Be aware of which genres your concordance examples come from (e.g., teen magazines & informal speech may not always provide the best models of language for writing academic essays]

Various online corpus at Corpus.byu.edu (Mark Davies’ site)

Corpus of Contemporary American English (COCA) : [450 m words; 20 m words of American Eng. each year from 1990-2012.] For each year (& therefore overall, as well), the corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers , & academic journals . Searchable on-line only; the texts themselves are not available for download.

Corpus of Historical American English ( COHA ): [400 m words; American Eng; each year from 1810-2009.]

Corpus of American Soap Operas : [100 m words; American Eng; 2001-2012.]

TIME Magazine Corpus : [100 m words American English, 1923-2006; More than 275,000 articles from TIME Magazine. Wide range of topics: news, sports, business, culture, health, entertainment, etc.] Nice search interface (essentially the same as that of the BYU-BNC and COCA).

Open version of the Sketch Engine No registration required. Currently 15 corpora available, mostly in English, such as BAWE, BASE, the ACL Anthology, or EcoLexicon.

MICASE

[1.7m words of current, spoken academic American English, as produced by faculty(lecturers), students & staff in formal & informal settings around the university]: fully searchable & browseable via a custom web interface ( no limits ), & now has selected playable sound files to accompany some transcripts. Homepage is here .

Word Neighbors

[by John Milton. Corpora = a mix of spoken & written English genres (user-selectable); some texts are from the BNC]: Quite similar to JustTheWord in terms of giving lists of collocational patterns first (which are then linked to actual corpus examples), but the text database is bigger (not limited to BNC texts) and you can restrict by medium (spoken/written) and by specific genres. It’s a fairly comprehensive learning environment: the collocational/colligational patterns and corpus samples are integrated with on-line dictionaries, thesauri, encyclopaedia, Chinese translations, "Answers.com", JustTheWord, and even audio/video examples containing the phrase/pattern.

Business Letters Corpus

[U.S. & U.K. letters, 1m words as of 1 March 2000; alternative site here ]

LOB & Brown [1m words each; 1960s written British English (LOB) & American English (Brown)]

The Brown Corpus of American English is available through the Lextutor site . The Brown & LOB used to be searchable via the Virtual Language Centre (NOT working?), or the alternative edict site (NOT working?) (used to be limited to 2001 hits ). The VLC/edict sites also have other collections of text – see here for a description & breakdown of these more specialized corpora.

Hong Kong Financial Services Corpus ( HKFSC )

[7.6 m words; spoken & written texts collected with the help of professional associations & private organisations from across the financial services sector in Hong Kong: e.g., insurance/investment product descriptions; agreements; media releases; ordinances; procedures; prospectuses; rules; standards; speeches]

CorpusEye

Search various corpora (for many languages). The English corpora include texts culled from Wikipedia and the Enron e-mails.

OPUS

[Computer manuals, European parliament speeches, Subtitles corpus, etc.] an open-source collection of freely searchable/downloadable monolingual and parallel (translation) corpora or collections.

VOA’s Special English Program Scripts (by Charles Kelly)

[c.14K words; sentence-view concordances of scripts from Voice of America’s " Special English " broadcasts, which use a limited vocab of 1,500 words (not necessarily the "easiest" English words, but most are simple)] The scripts represent a kind of "written-to-be-spoken" English; useful for less-proficient English learners.

CorTec

a bilingual (English & Portuguese) comparable corpus of technical language (linked to the COMET project) in 5 areas: Cooking, Contracts, Computing, Environment & Hypertension. The texts themselves cannot be downloaded, but can be searched via the web tools provided: concordancer, wordlist & N-gram extractor.

SACODEYL

includes a small corpus of English language teenager talk. Contains structured video interviews with students 13-18 yrs old (seven European languages in total). Annotated and enriched for language learning purposes. Free multimedia access (videos).

TCSE (Ted Corpus Search Engine)

TCSE is a search engine specializing in exploring transcripts of TED Talk. It has been created for educational and scientific purposes.

BACKBONE (European languages, incl. English)

BACKBONE is a European project; web-based pedagogic corpora of video-recorded spoken interviews with native speakers of English, French, German, Polish, Spanish and Turkish as well as non-native speakers of English as a Lingua Franca (ELF).


There are many other corpora which are free, but not on-line, including most of the ICE corpora (just sign a licence & download the files). If you’re interested in non-native English , the PICLE Corpus (argumentative essays & literature exam scripts by Polish learners of English) is searchable on-line .

See also the section on Using the Web as a corpus (many of these web concordancing search engines allow you to restrict searches to particular countries, institutions, URLs/web sites, etc., thus reducing the amount of junk/unwanted hits), & scrutinize the above section on D-I-Y Corpora for newspapers, out-of-copyright literary texts, & Bibles in various language. Most of these are not 'corpora' in the strict sense of being structured & formatted according to contemporary corpus standards, but are starting points if you want to have your own free texts to run concordances on.

XX WordbanksOnline (from the Bank of English ) [NO LONGER FREE?]: search a 56-million-word subset of the Bank of English (sub-dividable into 3 broad categories); also usefully allows you to specify a following word by part of speech , & gives collocations ( limited to 40 hits, & the total number of hits is not reported :-( )

** I’ve left out something? If you know of other web-searchable corpora, do let me know .


Did you find this web site/page useful? Do let me know if you want to encourage me to keep updating the site, or if you have a new corpus or resource (or something I’ve missed) for me to link to, please drop me a line.