DIY Corpus Collection

This web page provides pointers on how to obtain materials for creating your own corpora online.

For information on how to clean up and turn data obtained from these resources into a proper corpus, you can consult my textbook Practical Corpus Linguistics: an Introduction to Corpus-Based Language Analysis.

Text Archives & Corpus Distribution Sites (various languages)

Alex Catalogue of Electronic texts

A collection of digital documents collected in the subject areas of English literature, American literature, & Western philosophy. Basic concordancing & browsable, downloadable full texts.

American Memory

A gateway to rich primary source materials relating to the history & culture of the United States. The site offers more than 7 million digital items from more than 100 historical collections (some as images of documents, some in text format).

Bavarian Archive for Speech Signals (BAS)

Makes databases of spoken German accessible in a well structured form to the speech science community as well as to speech engineering

Electronic Text Center

(University of Virginia)

Combines an on-line archive of tens of thousands of SGML & XML-encoded electronic texts & images with a library service that offers hardware & software suitable for the creation & analysis of text. SGML texts are converted to HTML when you select them in your web browser. Has texts in English (Middle & modern), German, French, Latin, Apache, Japanese, Chinese, etc.

Oxford Text Archive (OTA)

"holdings include electronic editions of works by individual authors, standard reference works such as the Bible & mono-/bilingual dictionaries, & a range of language corpora"; "electronic texts & corpora of interest not only to literary textual scholars, but also those working in linguistics, history, law, modern & ancient languages, indeed almost any humanities discipline which relies upon a close reading of texts."

Project Gutenberg

Books published pre-1923, anything out of copyright; e.g. Shakespeare, Poe, Dante, Sherlock Holmes stories by Sir Arthur Conan Doyle, the Tarzan & Mars books of Edgar Rice Burroughs, Alice’s adventures in Wonderland as told by Lewis Carroll, & thousands of others.

String frequency reports for 5400+ books (400M words) from Project Gutenberg available at Ronald Reck’s site (but read this Corpora List message for details)

ELDA (European Language Resources Distribution Agency)

The distribution arm of ELRA (European Language Resources Association). Has a searchable catalogue covering their speech resources, written corpora & terminological resources.

ICAME (International Computer Archive of Modern & Medieval English)

Collects & distributes information on English language material available for computer processing & on linguistic research completed or in progress on the material. The ICAME CD-ROM (20 different corpora, totalling > 17 m words) contains most of the important English Language corpora used in research.

TRACTOR

TELRI Research Archive of Computational Tools & Resources (TELRI = Trans-European Language Resources Infrastructure); Corpora in 20 languages; Parallel corpora in a variety of pairings; Software for processing corpus evidence; Lexicons & other language-information resources.

Linguistic Data Consortium (LDC)

Supports language-related education, research & technology development by creating & sharing linguistic resources: data, tools & standards. Has lots of specialised corpora for many languages (most of them, however, intended for NLP).

OLAC
(Open Language Archives Community)

Has a search facility covering the resource catalogs of LDC, ELRA & the ACL/DFKI Natural Language Software Registry, & permits single searches to be applied to all catalogs simultaneously. The OLAC cross-archive search engine now harvests 11,000+ records from 12 OLAC archives. Try it out using the query box in the top right corner of the web page OR the more advanced search facility hosted by the Linguist List.

OLAC is an international partnership of institutions & individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, & (ii) developing a network of interoperating repositories & services for housing & accessing such resources. OLAC was founded at the Workshop on Web-Based Language Documentation & Description, held in Philadelphia in December 2000.

RELATOR (European Linguistic Resources Repository Network)

a CEC-funded initiative which adresses the vital area of linguistic resources for spoken & written language processing

Other sources of data for building your own corpus

ABU: la Bibliothèque Universelle (French)

L’accès libre au texte intégral d’oeuvres du domaine public francophone sur Internet depuis 1993. Pour accéder aux textes, consultez le catalogue des AUTEURS OU CELUI DES textes>. Vous pouvez également faire des recherches de mots sur tout le corpus. Nous avons aussi plusieurs dictionnaires.

Bartleby.com: Great Books Online

Enormously useful site covering much of the same ground as the OTA (but, refreshingly, without the considerable bother of endless copyright restrictions & legal threats). Besides plain texts of prose fiction & non-fiction, poetry & drama, the site includes: an encyclopaedia, gazetteer, world factbook, dictionary, thesaurus, style guides, books of quotations.

Bibliomania

More than 2000 free texts (mostly classics), study guides & reference resources (more for the literary/humanities scholar, but worth a look)

Cyber Classics

More than 200 titles available

EServer

42 collections on such diverse topics as contemporary art, race, Internet studies, sexuality, drama, design, multimedia, accessible publishing & current political & social issues. Also includes hypertexts, audio & video recordings..

Essays.se

A digital resource which enables you to search and download thousands of English-language university essays and theses from Sweden.

The English Server’s Fiction collection

Works of fiction & about fiction. Collection of texts in the public domain, classified into: Late Antique & Medieval Texts, Renaissance & Early Modern Texts, Modern Fiction, Modern Poetry, Historical Documents, Religious Texts & Other Texts.

Great Books

Searchable (with basic concordancing) & browsable texts of English classics (More for the literary/humanities scholar, but worth a look. Whole texts not downloadable in one go.)

Hansard

Parliamentary Proceedings from: the United Kingdom (UK);Canada; Australia;New Zealand

(Not really 'corpora' in the sense of fixed, formatted texts, but collections of transcripts)

The sites also have minutes of meetings, bills, reports, bulletins, & other official publications.

Internet Classics Archive

441 works of classical literature by 59 different authors, including user-driven commentary & "reader’s choice" Web sites. Mainly Greco-Roman works (some Chinese & Persian), all in English translation.

Movie Script sites

Drew’s Scripts-O-Rama/The Movie Script Compendium/Script Central

Movie Subtitles

Watch out for typos and mis-translations

Newspaper sites for English(Sampler)

(broadsheets & tabloids)

(You will, of course, have your own links to hundreds of other newspapers, other varieties of English & other languages.)

British Broadsheets:

The Guardian, The Independent, The Telegraph, The Times, The Evening Standard, The Observer, The Sunday Times, The Scotsman, The Herald, The Irish Times

British Tabloids:

The Mirror, The Sun, Daily/Sunday Express, News of the World, The Daily Star, The Sunday Mirror

[* More newspaper & magazine links may be found here ]

American Newspapers:

The Washington Post, USA Today, The New York Times

Newspaper sites for Other Languages

Try this searchable database of Newspapers, Magazines & other media (radio, TV) on the Internet (Kidon Media Link, a meta-site with listings by language & country) or try this site (maintained by IMS Stuttgart).

or the selection below:

French: Le Monde

German: Die Zeit, Die Welt, Süddeutsche Zeitung

Russian: Nezavissimaya Gazeta

Spanish: ABC, El Pais, El Mundo

Renascence Editions (Oregon)

An online repository of works printed in English between 1477 & 1799; includes Shakespeare, Wordsworth, Bacon, Bunyan, Donne, Hume, Hobbes, Milton, Spenser

SketchEngine

A fee-based Corpus Query System incorporating word sketches, grammatical relations, & a distributional thesaurus. A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical & collocational behaviour. A 30-day free trial account is available. Web-based service using standard browsers: no software installation required.

Available Resources: (1) Pre-loaded corpora (60M-1.5B words) for Chinese, English, French, German, Italian, Japanese, Portuguese, Spanish, & Slovene; (2) WebBootCaT (for building your own instant corpus from web pages, then extracting keywords, specialist terminology, etc.); (3) CorpusBuilder (upload & install your own corpora).

TV Transcripts Database

Transcripts of popular US television shows + some movie scripts too.

Transcripts of Spoken News reportage, Debates, Interviews

CNN Transcripts

On-Line Books Page

A directory of books that can be freely read on the Internet. The On-Line Books Page is now hosted by the University of Pennsylvania Library.

Reuters Corpora

In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. This corpus, known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older, well-known Reuters-21578 collection heavily used in the text classification community. In Fall of 2004, NIST took over distribution of RCV1 and any future Reuters Corpora. You can now get these datasets by sending a request to NIST.

UK Parliament home page

On-line minutes of meetings (Lords & Commons); bills, reports, bulletins; Hansard & other publications.

United Nations (UN) web site

Good source for getting parallel texts (for a limited range of topics & genres) in Arabic, English, Chinese, French, Russian & Spanish.

EUR-LEX

Parallel texts concerning European law in several EU languages (Spanish, Danish, German, Greek, English, French, Italian, Dutch, Portuguese, Finnish & Swedish).


Did you find this web site/page useful? Do let me know if you want to encourage me to keep updating the site, or if you have a new corpus or resource (or something I’ve missed) for me to link to, please drop me a line.