Recent Written Corpora

BE06 Corpus (British English 2006) 1-m words, published general written British English; same sampling frame as the LOB and FLOB corpora; consists of 500 files of 2,000 word samples taken from 15 genres of writing published between 2005-2008 . Copyright restricted. Texts not available; can only be searched online here (registration required).
FLOB (Freiburg-LOB Corpus of British English) 1990s analogue to the LOB corpus (1 m words, written British English); 2006 analogue to LOB/FLOB is the BE06 Corpus.
FROWN (Freiburg-Brown Corpus of American English) 1990s analogue to the Brown corpus (1 m words, written American English)
GUM (Georgetown University Multilayer corpus) GUM is an open source multilayer corpus of richly annotated web texts from four text types. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (Creative Commons licenses), so that new texts can be annotated and published with ease. Version 3.2.0 contains 64K tokens annotated for:
  • Multiple POS tags (100% manual gold PTB, extended PTB, CLAWS5 and Universal POS), and corrected lemmatizatio
  • Sentence segmentation and rough speech act (manual)
  • Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
  • Constituent and dependency syntax (manually corrected Stanford Dependencies, automatic conversion to Universal Dependencies, as well as automatic PTB parses from gold tags)
  • Information status (given, accessible, new)
  • Entity and coreference annotation (including non-named entities, singletons, appositions, cataphora and bridging)
  • Discourse parses according to Rhetorical Structure Theory
LUCY (documention is here) structurally analysed written British English (drawn from the British National Corpus ); a treebank sampling modern written British English of three genres (edited published prose, the writing of young adults (e.g. A-level exam scripts, 1st-year undergraduate essays), spontaneous writing by 9- to 12-year-old children).
SUSANNE (Surface & Underlying Structural Analyses of Naturalistic English) 130,000-word cross-section of written American English (based on a subset of the million-word Brown Corpus; 64 texts x 2,000 words each from four Brown genre categories) syntactically analysed (treebanked).
Longman Written American Corpus [This blurb is from their web site. Availability is unknown, as with all proprietary corpora... no comment on the use of 'corpuses'...]
A dynamic corpus of 100 m words from newspapers, journals, magazines, best-selling novels, technical & scientific writing, & coffee-table books..composition constantly being refined & new material added.... based on the general design principles of the Longman Lancaster English Language Corpus & the written component of the British National Corpus. Like other corpuses[sic] in the Longman Corpus Network, words can be concordanced, wordlists created, & statistical features analysed, allowing lexicographers to compare & contrast usage in British & American English.
Reuters Corpora (registration required to get the CDs, or get the older Reuters-21578 here .) Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 [810,000 news stories]
Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 [over 487,000 Reuters News stories in 13 languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, & Swedish). These stories are contemporaneous with RCV1, but some languages do not cover the entire time period.]

PLUS hundreds of others available from the Linguistic Data Consortium (LDC) & ELRA/ELDA catalogue. However, the 'corpora' in these catalogues which are not listed on this site are mostly specialised collections/small corpora of isolated sentences (hence not really text corpora but collections of sentences). You could also try querying the OLAC archives


If you found this web site useful, or found an outdated link, don’t forget to let me know.