Historical & Diachronic Corpora


Historical Corpora or Collections (English)

ARCHER Corpus
(A Representative Corpus of Historical English Registers)

1.8 m words (so far – May 2009) of British & American English from written & "speech-based" genres sampled from 7 historical periods covering Early Modern Englishto the present (range: 1650-1990); 1,037 texts; 10 registers (e.g., drama, letters, science prose) representing speech-based, popular, & specialist/academic written registers. Complements the Helsinki corpus. On-going collaborative research efforts are underway to extend the coverage of the corpus with the Universities of Uppsala, Helsinki, Freiburg, Heidelberg, Lancaster, Manchester & Michigan. The corpus is not publicly available, but the several universities involved in the project are willing to host visits by interested scholars.

Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English

a selection of texts from the Old English Section of the Helsinki Corpus of English Texts; contains 106,210 words of Old Eng text; the samples from the longer texts are 5,000 to 10,000 words in length; texts represent a range of dates of composition, authors, & genres. For a list of the texts included in the Brooklyn Corpus, click here. The texts are syntactically& morphologically annotated, & each word is glossed. Size of the corpus: c.12 megabytes.

Century of Prose Corpus

half a m words of literary & non-literary English; 1680-1780; 120 authors. (Not sure where the Web site is…)

Complete Corpus of Old English

3,022 texts representing all extant Old English texts, compiled at the University of Toronto.

Corpus of English Dialogues, 1560-1760 (CED)

1.2-m words of Early Modern English speech-related texts (177 text files). The CED contains texts representative of five text types (plus a mixed bag of dialogues labelled 'Miscellaneous'), which divide into two categories: these are 'authentic dialogue', which is written records of real speech events (Trial Proceedings & Witness Depositions), &'constructed dialogue', in which the dialogue is constructed by an author (Drama Comedy, Didactic Works, & Prose Fiction).

Corpus of Newsbooks

approximately 800,000 words of running text drawn from all the newsbooks present in the Thomason Tracts that were published from December 1653 to May 1654.

Corpus of Middle English Prose & Verse(CME)

(or visit the parent site, the Middle English Compendium)

collection of Middle Eng texts assembled from works contributed by Univ of Michigan faculty & from texts provided by the Oxford Text Archive, as well as works created specifically for the Corpus (archive last updated in October 2000). All 61 texts in the archive are valid SGML documents, tagged in conformance with the TEI Guidelines, & converted to the TEI Lite DTD for wider use. Web-searchable.

Corpus of Early English Correspondence (CEEC) & the Parsed Corpus of Early English Correspondence (PCEEC),

2.7 m words; 1410 to 1681 (CEES = 450,000 words); a supplement, the "Corpus of Early Correspondence Supplement (CEECSu; 0.44 m words) extends the time range: 1402-1663, while the "Corpus of Early English Correspondence Extension" (CEECE; 2.2 m words) covers the period 1681-1800. The project home page & the manual at ICAME give more details.

Corpus of Early English Medical Writing & Corpus of Middle English Medical Texts (MEMT)

a corpus of medical treatises from 1375-1800. Shorter texts are included in toto & longer treatises are represented by extracts of approximately 10-12 K words. The medieval section contains about 500,000 words

Corpus of Late 18c Prose

c.300,000 words of local English letters on practical subjects, dated 1761-89, as a sample of the English language of the north-west of England in the late Modern English period. These letters, written to Richard Orford, a steward at Lyme Hall in Cheshire, are unselfconscious practical letters, often by uneducated people, on matters of business, farming, mining, & social relations. Available free for ftp download as a single text file or as three linked HTML files for maximum readability.

Corpus of Late Modern English Prose

A 100K-word corpus of informal private letters by British writers, covering the period 1861 to 1919. (Range of dates by birth-date of writer is narrower: 1837-67.) Available from the Oxford Text Archive & through the owner (David Denison).

Corpus of Late Modern English Texts (CLMET)

c.10 m words; a principled collection of texts drawn from the Project Gutenberg & Oxford Text Archive; Ten m words of running text, divided over three 70-year sub-periods from 1710-1920.

Corpus of Early American English

English in America from the beginning of the 17th century; compiled in Helsinki.

The Coruña Corpus of English Scientific Writing

The Coruña Corpus of English Scientific Writing is one of the projects currently being carried out in the University of A Coruña (Spain) by the Research Group for Multidimensional Corpus-based Studies in English (MuStE). The team is in the process of creating a corpus that can be used for the diachronic study of scientific discourse from most linguistic levels and thereby contribute to the study of the historical development of English for specific purposes. At the same time, we believe that the Coruña Corpus is an excellent tool for the study of the scientific register/style at particular moments in history: it offers the researcher the chance to analyse how this “Specific English” behaves from a synchronic point of view.

The compilation of the Coruña Corpus has been and is still governed by some of the most common parameters used in Corpus Linguistics, namely, external criteria for the delimitation of dates, sampling techniques, number of words per sample, etc.

Helsinki Corpus of Older Scots

830,000 words; 1450-1700, from fifteen genres.

ETED

(& accompanying book)

Transcriptions of 905 depositions drawn from manuscripts collected from the North, South, East and West Of England, and the London area; c. 267,000 words.Testimonies by men and women of different ages and walks of life. Five electronic formats (XML, resolved XML, HTML, TXT and PDF) & ETED Presenter, a data retrieval program.

Early English Books On-line (EEBO)

(subscription required)

(images of original print documents, with some now searchable as texts) "From the first book published in English through the age of Spenser & Shakespeare, this incomparable collection now contains about 100,000 of over 125,000 titles listed in Pollard & Redgrave’s Short-Title Catalogue (1475-1640) & Wing’s Short-Title Catalogue (1641-1700) & their revised editions, as well as the Thomason Tracts (1640-1661) collection & the Early English Books Tract Supplement."

Early English Books Online (EEBO) corpus

Not to be confused with the above, a 755 million-word corpus from more than 25,000 historical texts, ranging from the 1470s-1690s. The corpus is freely accessible through Mark Davies’ BYU interface.

Helsinki Corpus of English Texts: Diachronic Part

c. 1.5 m words; 242 files; covers the period from c. 750 to c. 1700 (Old English to Early Modern)

Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET)

(1) The Prose Corpus of ICAMET: compilation of 129 texts (March 1999) of Middle Eng prose (1100-1500), digitalized from extant editions & constantly enlarged by further files. Since it is a full-text database, it particularly aims at target groups of users who, unlike those of the Helsinki Corpus, are not so much interested in extracts of texts, but in their complete versions. Thus allows literary, historical & topical analyses of various kinds, esp. studies of cultural history. It also invites linguists to raise questions of style, rhetoric or narrative technique, for which one would want a lengthier piece of text or even the complete text.

(2) The Letter Corpus of ICAMET contains 254 complete letters, arranged diachronically, from different sources (written between 1386 & 1688). Particularly encourages pragmatic & sociolinguistic studies, & analyses concerning cultural life & lifestyle.

NEET (Network of Early Eighteenth-century English Texts)

c. 3-million-words, 18th Century English registers. No more information available, but contact Douglas Biber for more details.

Newdigate Newsletters

750,000 words; manuscript newsletters from 1674-92.

Old Bailey Corpus

Contains the proceedings of the Old Bailey, London’s central criminal court, 1674 to 1913. This constitutes a large body of texts from the beginning of Present Day English. The Proceedings contain about 200,000 trials, totalling c.134 million words, of which about 113 million is direct speech. Sociolinguistic mark-up based on sociobiographical speaker data found in the context for about half of the material identified as direct speech is under way (target: 57 million words).

Lampeter Corpus of Early Modern English Tracts

1m words of English pamphlet literature covering 1640-1740. Text samples are taken from each decade within this century & several genres are represented. Contains the whole text of pamphlets, rather than fragments.

Leverhulme Corpus Project

(Under construnction: 15 months from October 2003)

1-million-word corpus which matches as closely as possible the LOB & FLOB corpora of written British English, except that the year of data collection is 1931, or near to that date (+/- 3 years). The immediate purpose of building this corpus is to make it possible to compare these three temporally equidistant corpora (1931, 1961, 1991): "Pre-LOB", LOB, & FLOB. This will enable tracking of grammatical change through a period of 60 years of the 20th century. Under construction & as yet unnamed (?)

Penn-Helsinki Parsed Corpus of Middle English

prose text samples of Middle Eng, annotated for syntactic structure. Designed for the use of students & scholars of the history of English, especially the historical syntax of the language

TIME Magazine Corpus

100-m words from TIME magazine, 1923-2006. Allows you to see how words & phrases have increased or decreased in usage & or changed meaning over time.

Women Writers Online

The Brown University Women Writers Project’s main undertaking is an SGML-encoded full-text database of pre-Victorian women’s writing in English (at present, it covers 1400 to 1850). This collection currently includes nearly 200 texts representing a broad cross-section of the literate culture of pre-Victorian Britain.

York-Toronto-Helsinki Corpus of Old English Prose (YCOE)

1.5 million word syntactically-annotated corpus of Old English prose texts; sister corpus to the Penn-Helsinki Parsed Corpus of Middle English (uses the same form of annotation & is accessed by the same search engine, CorpusSearch). The corpus itself (the annotated text files) is distributed by the Oxford Text Archive. Free for non-commercial use.

York-Helsinki parsed corpus of Old English poetry

a selection of poetic texts from the Old English Section of the Helsinki Corpus of English Texts; 71,490 words of Old English text; the samples from the longer texts are 4,000 to 17,000 words in length. The texts represent a range of dates of composition & authors. For a list of the texts included in the York Poetry Corpus, click here. The texts are syntactically & morphologically annotated.

Zürich Corpus of English Newspapers (ZEN)

London newspapers from 1660s to the beginning of the 20th century. Contact: Udo Fries

* See also the Early Modern English Dictionaries Database (EMEDD description here)


Diachronic Comparisons (recent changes in English)

Since the first major English corpora were collected in the 1960s, it is now possible to compare these earlier corpora with more contemporary (1990s) corpora. For written British English, LOB can now be compared with FLOB, while for American English, it’s Brown v. Frown. For spoken British English, the Diachronic Corpus of Present-Day Spoken English (DCPSE) allows comparisons of the London-Lund Corpus (LLC, 1960s) with the British component of the International Corpus of English (ICE-GB, 1990s).

More recently, Mark Davies has compiled a Corpus of Contemporary American (COCA) that is continually being updated every 6-9 months – probably the only corpus of English that is suitable for looking at current, ongoing changes in the language.