Parsed Corpora/Treebanks

This list excludes parsed historical corpora. For parsed corpora in languages other than English, please see this page.

American Printing House for the Blind Treebank (APHB) A skeleton-parsed corpus of a wide range of English texts. 200,000 words. See description at the UCREL website.
Anaphoric Treebank A subsample of the AP corpus (English), annotated to show the reference of pronouns & lexical cohesion. Approximately 100,000 words. See description at the UCREL website.
Associated Press Treebank (AP) A skeleton-parsed corpus of American newswire reports. 1m words. See description at the UCREL website.
Canadian Hansard Treebank A skeleton-parsed corpus of proceedings in the Canadian Parliament. 750,000 words. See description at the UCREL website.
GUM (Georgetown University Multilayer corpus) GUM is an open source multilayer corpus of richly annotated web texts from four text types. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (Creative Commons licenses), so that new texts can be annotated and published with ease. Version 3.2.0 contains 64K tokens annotated for:
  • Multiple POS tags (100% manual gold PTB, extended PTB, CLAWS5 and Universal POS), and corrected lemmatizatio
  • Sentence segmentation and rough speech act (manual)
  • Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
  • Constituent and dependency syntax (manually corrected Stanford Dependencies, automatic conversion to Universal Dependencies, as well as automatic PTB parses from gold tags)
  • Information status (given, accessible, new)
  • Entity and coreference annotation (including non-named entities, singletons, appositions, cataphora and bridging)
  • Discourse parses according to Rhetorical Structure Theory
Diachronic Corpus of Present-day Spoken English (DCPSE) 800,000 words (87,188 parse trees) of fully-parsed & annotated spoken British English from the 1950s to 1990s; composed of two 400,000-word samples of spoken English from the London-Lund Corpus (late 1960s-early 80s) & ICE-GB (early 1990s); fully parsed to be consistent with ICE-GB & searchable using ICECUP, (Survey of English Usage, University College London).
International Corpus of English (ICE) ICE-GB (the British component of ICE) is the first of the ICE corpora to be completed, & is the British component of the International Corpus of English (ICE) Project. It consists of a m words - 83,394 parse trees, including 59,640 in the spoken part of the corpus- extracted from 200 written & 300 spoken English texts. It is fully grammatically annotated & has been fully checked. ICE-GB is distributed with the retrieval software ICECUP (International Corpus of English Corpus Utility Program) an exploration software designed for parsed corpora.
IBM Manuals Treebank A skeleton-parsed corpus of computer manuals. 800,000 words. See description at the UCREL website.
Lancaster-Leeds Treebank A manually parsed subsample of the LOB corpus of English showing the surface phrase structure of each sentence, prepared by Professor Geoffrey Sampson. Approximately 45,000 words taken from all the genre categories of the LOB corpus. See description at the UCREL website.
Lancaster Parsed Corpus (LPC) A parsed subcorpus of the LOB Corpus of English, parsed by computer & manually corrected by researchers (Roger Garside, Geoffrey Leech & Tamas Varadi). Available through ICAME. It is a treebank consisting of over 133.000 words from each of the 15 categories of the LOB Corpus. Each sentence is annotated with a phrase-structure parse in the form of labelled bracketing. The labels mark the boundaries of sentence, clause, phrase & coordinated word constituents. The word tags used in the tagged version of the LOB Corpus are also part of the annotation of the Lancaster Parsed Corpus. Manual is here.
Penn Treebank (III) The Penn Treebank Project annotates naturally-occuring text for linguistic structure – skeletal parses showing rough syntactic information & argument structure (a bank of linguistic trees) in addition to part-of-speech tags , & for the Switchboard corpus of telephone conversations, also dysfluency annotation. The original CD-ROM contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional 1 million words tagged for part of speech; the first fully parsed version of the Brown Corpus, completely retagged using the Penn Treebank tag set; tagged & parsed data from Dept of Energy abstracts, IBM computer manuals, MUC-3 & ATIS. Release 2 CDROM features the new Penn Treebank II bracketing style, & contains, among other files, 1 million words of 1989 Wall Street Journal material annotated in Treebank II style. CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations. It pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure. Contains 99.44% of the sentences in the Penn Treebank, for which it corrects a number of inconsistencies & errors in the original annotation. Can also be searched with Douglas Rohde’s TGrep2, version 1.15 or higher.
Polytechnic of Wales Corpus (POW) Consists of approximately 65,000 words in 11,396 (sometimes very long) lines, each containing a parse tree.
LUCY (documention is here) structurally analysed written British English (drawn from the British National Corpus ); a treebank sampling modern written British English of three genres (edited published prose, the writing of young adults (e.g. A-level exam scripts, 1st-year undergraduate essays), spontaneous writing by 9- to 12-year-old children).
SUSANNE (Surface & Underlying Structural Analyses of Naturalistic English) 130,000-word cross-section of written American English (based on a subset of the million-word Brown Corpus; 64 texts x 2,000 words each from four Brown genre categories) syntactically analysed (treebanked).