Non-English, Parallel & Multilingual Corpora
(a selection)


Monolingual corpora for languages other than English form the fastest-growing group of corpora. This growth has been propelled by the interests of both language engineers and linguists.The former need corpora in various languages as training data for statisticalnatural language processing applications such as machine translation or cross-lingual information retrieval. Linguists, on the other hand, are interested in both intra-linguistic and cross-linguistic comparisons and analyses. Whatever the motivation, much corpus research is now firmly multilingual in nature, and a new conference series (Corpus Linguistics) now complements the English-only ICAME conference series.

HINT: Use the ‘Find’ facility of your browser to search this page for the language you’re interested in (because this list is alphabetised by the name of the corpus, not consistently grouped by language except for Chinese, which has all the relevant corpora put together). I have not included non-naturalistic or elicited data, i.e., "corpora" (better called "databases") of read-aloud phonemes, words, sentences, prompted speech, etc., for which you are referred to the ELDA or LDC catalogues. For "less common" languages (i.e., anything other than the boring, over-researched major languages), see Manuel Barbera’s page.

Non-English Corpora

ABU: la Bibliothèque Universelle (French)

L’accès libre au texte intégral d’oeuvres du domaine public francophone sur Internet depuis 1993. Ces textes sont produits et diffusés par les membres bénévoles de l’Association des Bibliophiles Universels (ABU).

ALPINO Treebank (Dutch)

Syntactically annotated Dutch sentences (more than 150,000 words). Includes the full cdbl (newspaper) part of the Eindhoven corpus. A number of tools are provided for browsing and searching the corpus.

Arabic Corpora, & the Corpus of Contemporary Arabic (CCA),
(Latifa al-Sulaiti)

Lists available Arabic copora, plus a Corpus of Contemporary Arabic (CCA), still under construction (to date, May 2006, 842K words and 415 texts).

Arabic English Parallel News

Contains Arabic news stories and their English translations that the LDC collected via Ummah Press Service from January 2001 to September 2004; 8,439 story pairs, 68,685 sentence pairs, 2M Arabic words and 2.5M English words. Aligned at sentence level.

ARTFL Project (American and French Research on the Treasury of the French Language)

(subscription required) c. 2000 texts, ranging from classic works of French literature to various kinds of non-fiction prose and technical writing; 18th, 19th and 20th centuries are about equally represented, with a smaller selection of seventeenth century texts as well as some medieval and Renaissance texts. Includes a Provençal database with 38 texts in their original spellings. Genres include novels, verse, theater, journalism, essays, correspondence, and treatises. Subjects include literary criticism, biology, history, economics, and philosophy.

ARCADE/ROMANSEVAL corpus
(ELDA catalog entry here)

The ARCADE/ROMANSEVAL corpus was used as a reference corpus in two international competitions: (i) ARCADE, an exercise on multilingual text alignment financed by AUPELF-UREF; (ii) ROMANSEVAL, part of the SENSEVAL exercise sponsored by ACL-SIGLEX and EURALEX, on word sense disambiguation. The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four Romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission). The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3,700 contexts altogether, and comprises: (a) semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian; (b) word-level alignment of all the occurrences of the test words between French and English. Additional information: Arcade site and Romanseval site

Archivio di Varietà di Italiano Parlato (AVIP)

semi-spontaneous spoken Italian materials (plus lists of read words), collected in the form of so-called "map task dialogues" in three localities: Bari, Napoli and Pisa. Includes a speech sample produced by hearing-impaired and normal children. Specifically, the corpus includes 39 semi-spontaneous dialogues produced by young adult speakers, and 5 dialogues produced by children, for a total of about 14 hours. 15 adult speakers' dialogues (plus all the 5 produced by children) are orthographically transcribed (about 350 minutes). In particular, 75 minutes of speech are phonetically segmented and labelled (and a smaller subset is also prosodically labelled). Finally, 4 dialogues are annotated at textual level.

Available as a CD via the technical staff at Laboratorio di Linguistica. The materials are also accessible via anonymous ftp at ftp.cirass.unina.it (username "anonymous", password "your e-mail address") and are contained in the folder cirass/pub/avip.

ASEDA (Aboriginal Studies Electronic Data Archive)
(Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS))

Australian Indigenous languages in the ASEDA has materials including dictionaries, grammars, teaching materials hosted by the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS); represents about 300 languages. ASEDA offers a free service of secure storage, maintenance, and distribution of electronic texts relating to these languages. The Archive is available to language community members and to researchers in the field of Aboriginal and Torres Strait Islander Studies. Availability of items is subject to depositors' access conditions.

ASL (American Sign Language)

Resources from the National Center for Sign Language and Gesture Resources (Boston University). A substantial corpus of American Sign Language (ASL) video data from native signers is being collected and made available. Data collection began in December 1999. Multiple synchronized high-quality video files (available in a variety of formats) showing the signing from different angles as well as a close-up view of the face. Includes SignStream software.

BACKBONE (European languages, incl. English) On-line Search here.

BACKBONE is a European project; web-based pedagogic corpora of video-recorded spoken interviews with native speakers of English, French, German, Polish, Spanish and Turkish as well as non-native speakers of English as a Lingua Franca (ELF).

BangorTalk

Bilingual conversational corpora assembled by the ESRC Centre for Research on Bilingualism in Bangor, Wales, all offered under a free (GPL) license. All include glosses and translations into English.

Siarad (Welsh-English) - 460,000 tokens

Miami (Spanish-English) - 265.000 tokens

Patagonia (Welsh-Spanish) - 192,000 tokens

Le corpus BAF

(English-French parallel corpus)

Le BAF est un corpus de bi-texte anglais-français, c’est-à-dire un ensemble de paires de documents anglais et français, traductions l’un de l’autre, dont les phrases ont été "alignées". Ce corpus a été constitué par l'équipe de traduction assistée par ordinateur (TAO) du CITI, dans le cadre de l’Action de recherche concertée (ARC) A2. La plus grande partie du corpus est constituée de textes de nature institutionnelle (Hansard canadien, rapports de l’ONU, etc.), mais nous avons aussi inclu quelques articles scientifiques de même qu’une oeuvre littéraire. Le tout représente environ 400 000 mots dans chaque langue.

French-English bitext of about 400,000 words per language. It contains four sub-sets of texts: (1) INST (Four institutional texts (including a representative excerpt of the Hansard corpus which consists of transcription of parliamentary debates) for a total size close to 300000 words per language; (2) SCIENCE (Five scientific articles of about 50000 words per language each); (3) TECH (A technical documentation with 39328 English-words for 46828 French ones. Contains a large glossary sorted in a different order in each language); (4) VERNE (Jules Verne’s novel De la terre à la lune. (40161 English-words vs 53181 French-words)).

Banca dati dell’italiano parlato (BADIP)

spoken Italian database containing an online edition of the 500,000 word LIP-Corpus. The edition is being enriched with POS-tags and lemmata, and more data are being added continuously. Other corpora of spoken Italian will be included in the database as well. The database is part of the Language Server of the University of Graz (Austria). Access to BADIP is free.

Basque Spoken Corpus

(distributed by ELDA/ELRA)

a collection of 42 narratives by native Basque Euskara speakers, who relate a silent movie they have just watched to someone else. It includes sound files (MP3 format) and full detailed transcripts, + 53 additional sound tracks of extemporaneous speech and description of still images.

Base de Datos Sintácticos del español actual (Syntactic Database for modern Spanish (BDS)

contains about 160,000 clauses (1.5 m words) of Spanish with syntactic analysis (manually added), from the corpus ARTHUS (Archivo de Textos Hispánicos de la Universidad de Santiago). Composition: 66.5% written (narratives, essays and journalistic texts), 14.7% drama and 18.9% oral transcriptions. The syntactically annotated corpus is avaliable for free (only for reseach purposes) via the web, through an user-friendly interface (only in Spanish). A previous subscription (at http://www.bds.usc.es/usuarios) is required.

BulTreeBank (HPSG-based Syntactic Treebank of Bulgarian)

Under construction. Objective : to create a high quality set of syntactic structures of Bulgarian sentences within the framework of HPSG. Aims to contain samples of all the syntactic structures of the language. These sentences should serve as templates for future corpora development, become the basis for the development of a more comprehensive test suite for NLP applications, and can be used as a source for grammar extraction and for linguistic research.

CoMET project
(Corpus Multilíngue para Ensino e Tradução)

Comprises three corpora:

  • CorTec: a Technical corpus with comparable texts (originals in English and Portuguese) in 20 technical domains
  • CorTrad: a parallel corpus (originals in either Portuguese or English and their translations, usually with more than one translation) with technical-scientific texts, journalistic texts and literary texts (Canadian short stories, Australian short stories, Alice in Wonderland, Alice through the Looking Glass - Dubliners in the works)
  • CoMAprend: a multilingual learner corpus of students learning German, English, Spanish, French or Italian.

‘Brown’ corpus of Bulgarian

compiled in conformity with the design used in the compilation of the well-known Brown Corpus of Standard American English. It consists of 500 text samples (2,000 words each) distributed in 15 categories from two types of texts - fiction and informative prose. 1,001,286 words. The samples are excerpts from texts created or published for the first time in the period 1990-2005, the main part dated after 2000.

CATE

Corpus of Taiwanese Learners of Spanish (Corpus de Aprendices Taiwaneses de Español)

CETEMPúblico

from the AC/DC project. 180 million words annotated morphosyntactically with the PALAVRAS tagger; can be searched online.

CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo)

24 million words in Brazilian Portuguese, built by Linguateca from the texts of Folha de S. Paulo belonging to the corpus NILC/São Carlos, compiled by Núcleo Interinstitucional de Lingüística computacional (NILC). This is the the little brother of CETEMPúblico (also free and on the Web).

For a comprehensive list of Portuguese resources, go to the Linguatec page here.

CEXI
(Italian-English Translational Corpus)

an English-Italian Translational Corpus being developed at the SSLMIT, University of Bologna at Forlì; translation-driven, bilingual, bidirectional and parallel, i.e. it contains translations from English into Italian and translations from Italian into English, together with the respective source texts; contains printed books, further subdivided into two equally-sized sub corpora: adult fiction (Eng-Italian, Italian-Eng), adult non-fiction (Eng-Italian, Italian-Eng); overall size of the core corpus will be about four million words, or one million words per component.

Chemnitz German-English translation corpus

The link is through the Chemnitz Internet Grammar (http://www.tu-chemnitz.de/phil/InternetGrammar) for which you’ll need to get a username and password (instantaneous). The corpus contains recent (in the last 15 years) texts in the areas of politics, tourism and academia. Approx. 1 million words each in German and English. Some (undocumented and buggy) advanced boolean searches possible.

CHINESE CORPORA
(Readily-accessible and freely downloable corpora are listed first) See also the multilingual Opus Corpus

Academia Sinica Balanced Corpus

(Chinese)

5-m-word corpus of Chinese at the Academia Sinica (Taiwan); Traditional script/Big5-encoded, free online access.

Lancaster Corpus of Mandarin Chinese (LCMC)

Free on-line concordancing interfaces are: via NIE here ; via Leeds here

a 1-m-word (incl. punctuation) mainland-Chinese corpus comparable with the Freiburg-LOB Corpus of British English (FLOB); allows for contrastive studies between English and Chinese as well as monolingual investigations of Chinese. Most texts from 1991 (some within +/- 2 yrs).

UCLA Chinese Corpus (UCLACC)

1-m words (incl. punctuation); texts from 2000-2005; can be used vis-à-vis LCMC to track lg change over a decade; examine potential influence of the Web on (written) Chinese.

Lancaster-Los Angeles Spoken Chinese Corpus (LLSCC)

1-m transcribed words (incl. punctuation); can be used vis-à-vis LCMC to compare spoken with written Mandarin; texts from around 2004-2007 (?).

ZJU Corpus of Translational Chinese (ZCTC)

1-m words (incl. punctuation); texts from 1991-2001; translations into Chinese, mostly from Eng; can be used vis-à-vis LCMC to study the validity of so-called 'translation universals'.

Peking University’s corpora collection

ancient & modern Chinese as well as a parallel corpus; some searchable on-line.

LIVAC Synchronous Corpus (Chinese)

(City University of Hong Kong) (Alternative URL here)

On-going project which collects texts from representative Chinese newspapers and electronic media of Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. Started: 1995, Ends: 2005. The collection of materials from the diverse communities is synchronized, and so offers an innovative "Window" approach for a whole variety of comparative studies and useful applications in IT. All corpus texts have undergone automatic segmentation and manual verification.

Data available for online search are those from July 1995 to June 1997, amounting to about 16 million characters (190,000 words) encoded in Big-5.

English-Chinese Parallel Corpus (by Wang Lixun)

free on-line access; a variety of parallel texts in both directions. English originals: 1.4 m words; Chinese originals: 413,823 words (576,724 characters)

PKU Babel Chinese-English Parallel Corpus

20m Chinese chars & 10m English words of written & spoken bilingual texts sampled from a variety of text categories incl. gov docs, news, acad prose, fiction, play scripts, & speech; covers three styles (literature, practical writing and news), six fields (arts, business/economics, science, etc.)

Lancaster’s Babel English-Chinese Parallel Corpus
(On-line search here)

half-million words, sources from World of English & Time; translated into Chinese.

Hong Kong Bilingual Corpus of Legal & Documentary Texts (by Xu Xunfeng, HK PolyU)

bilingual (parallel) English-Chinese corpus of mainly legal & documentary texts from Hong Kong; texts dated 1997 or 1998, and roughly cover a two-year time span from before to after the establishment of Hong Kong SAR; "Legal texts" = Bills, Ordinances and Laws, including the Basic Law of Hong Kong; "Government documents" =reports, papers, discussions & notes. In addition, the corpus contains transcripts of public speeches (mainly delivered by the Chief Executive), minutes of Legislative Council meetings, Hospital Authority annual reports and press releases, and also statements and corporate profiles from the business and commerce sector.

English: 300,000 words; Chinese: 500,000 characters; limited searches; concordances not aligned in the output (though sentences in the raw texts are).

Sheffield Corpus of Chinese (SCC)

(free for academic use) a limited body of representative texts from Medieval (MedC) & Modern Chinese (ModC) periods; two text types: literary and non-literary. Contains: The MedC text "Zhuzi Yulei" (ZZYL, Classified conversations of Master Zhu) by Zhu Xi (12th century; the most 'spoken-like' registers available from earlier historical periods); ModC novels = "Shuihu Zhuan" (SHZ, Tales of the Water Margin; in a colloquial style compounded with oral conventions) and "Rulin Waishi" (RLWS, The Scholars; conscious use of "Guoyu" or the national vernacular).

Hong Kong Cantonese Adult Corpus (HKCAC) & Hong Kong Corpus of Primary School Chinese (HKCPSC)

The HKCAC contains orthographic and phonetic transcriptions of 8 hours of spoken Cantonese, totalling 170,000 Chinese characters (phone-in programs and forums on the radio, recorded Nov 1998 - Feb 2000; 69 speakers + program hosts).

The HKCPSC contains linguistic analysis of 186,022 characters used by grade1 to grade 5 students in Hong Kong.

Chinese Treebank 8.0 (Penn/LDC)

Over 1.5 million words of Chinese with syntactic bracketing (3,007 files, 71,369 sentences, 1,620,561 words, 2,589,848 characters), articles from Xinhua newswire, magazines & government documents. UTF-8 encoded and formatted similarly to the UPenn English Treebank.

CALLHOME Mandarin Chinese Transcripts (XML version)

300,767 words; 120 unscripted telephone conversations (collected around 1996) between native speakers of Mandarin Chinese. Calls, which lasted up to thirty minutes, originated in North America and were placed to locations overseas; most participants called family members or close friends. XML Version = UTF-8 encoding, retokenization and part-of-speech (POS) tagging. Available through the LDC (Non-member price: US$1,500)

— End of Section on Chinese corpora —

CLIPS

Corpus of spoken Italian

100 hours of Italian speech (female and male); a section has been transcribed orthographically; a smaller section has been phonetically labelled. Recordings were made in 15 Italian cities, selected on the basis of linguistic and socio-economic principles of representativeness: Bari, Bergamo, Bologna, Cagliari, Catanzaro, Firenze, Genova, Lecce, Milano, Napoli, Palermo, Parma, Perugia, Roma, Venezia. Speech genres: radio and television broadcasts, dialogue, read speech from non professional speakers, speech over the telephone, read speech from 20 professional speakers.

COMIC (Commercial Italian Corpus)

over 200 articles in Italian from 1996-2001. Plain text & SGML. Available through the OTA.

C-ORAL-ROM (Integrated reference corpora for spoken romance languages)

Project (Jan 2001-2004): aim to create a comparable set of spontaneous speech spoken language corpora for the main Romance languages (French, Italian, Portuguese and Spanish) where textual information and audio are associated and stored on DVD. The resulting multilingual corpus will be tagged with respect to prosodic parsing and integrated with tools for acoustic and textual analysis.

CORIS/CODIS

a 100-million-word corpus of contemporary written Italian. Current version is available on-line for research purposes, for people employed in academic and research institutions, and will continue to be available free of charge, on an experimental basis, until the release of the final version. Before signing an agreement to obtain personal access to the corpus, the demo version corpus may be consulted, using the data retrieval software on the web.

Corpus del Español

a 100 million word corpus of Spanish texts funded by the NEH and created by Mark Davies (Illinois State University): 20m from the 1200s-1400s, 40m from the 1500s-1700s, 40m from the 1800s-1900s. The 20m words from the 1900s are divided equally among literature, oral texts, and newspapers/encyclopedias.

Corpus lexicaux québécois

Le réseau des corpus lexicaux québécois (Canadian French)

Corpus of Italian Newspapers

issues of three different Italian newspapers from 20-21 Oct 1989. Plain text format. Available through the OTA.

Corpus of Contemporary Lithuanian

currently under construction, and also a parallel/translation corpus.

Corpus of Greek Texts

The Corpus of Greek Texts (CGT) is the first electronic corpus of Greek that was created with the aim of providing a resource for linguistic research in a wide range of both written and spoken Modern Greek genres.

Corpus of Spoken Israeli Hebrew (CoSIH)

currently under construction

CORGA (Reference Corpus of Present-day Galician Language)

Current version (at Dec 2005) is composed of 13.3m orthographical forms of Galician (target: 25 million). Includes texts published or produced between 1975 and 2004 (with priority given to recent periods). Texts are grouped together in five-year periods to facilitate period-based research. Direct access to CORGA can be made via the Internet, through the server of the Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH): http://www.cirp.es or http://corpus.cirp.es/corga.

Mannheimer Corpus Collection
(Institut für Deutsche Sprache, Mannheim, Germany) via

COSMAS (Corpus Storage, Maintenance and Access System)

world’s largest, growing, collection of German online corpora for linguistic research. Launched in mid 1960’s, it reached 1.85 billion words in 2002. Since 1993, the copyright-free part (currently 1.1 billion words) of the collection is publicly available for searching via the COSMAS online toolbox (a powerful online corpus search and analysis toolbox: complex query language, concordancing, on-line collocation analysis, clustering, virtual corpus composition, etc.; includes a German lemmatizer and compound analyzer). Invited guests have access to the whole Mannheimer Corpora collection. Wide variety of sources, e.g. classic literary texts, national and regional newspapers, spoken language in transcribed form, morphosyntactically annotated texts and several unique corpora. Commercial use is not permitted. No downloads.

CELT: Corpus of Electronic Texts

online resource for Irish history, literature and politics. Includes texts in Irish, Hiberno-Norman French, Latin, etc.

COMPARA

See also: Corpografo

an extensible bidirectional Portuguese-English Parallel Translation Corpus; an open-ended collection of Portuguese-English and English-Portuguese translations (c.1.5m words as of Feb 2004); 37 source texts and 40 translations, with more being added all the time. Web-searchable interface. Only fiction texts at present (more genres to be added later). Alignment is based on the source-text sentence and allows users to search for sentences that have been joined, split, added to, deleted from, and reordered in translation. Other searchable features are translators' notes, foreign words, titles, emphasis and named entities.

Corpus Berbère

A Corpus for Berber Languages.

Croatian National Corpus

Under construction (at time of writing: Mar 22 2003). Goal: 30-million corpus of contemporary Croatian. Currently available texts are searchable. Has a frequency listing. Includes the Croatian Electronic Text Archive (Hrvatski elektronski tekstovni arhiv)

Czech National Corpus (CNC)
or Ceská Národní Korpus (CNK)

A sub-section of the written corpus, PUBLIC, contains about 20 million words and is immediately internet-accessible. The full corpus, SYN2000 (100 million words), with the same genre constitution, is also accessible by arrangement.

CRATER Multilingual Aligned Annotated Corpus

The Corpus Resources And Terminology Extraction project (MLAP-93 20) has extended the bilingual annotated English-French International Telecommunications Union corpus to include Spanish, and has also debugged the existing corpus. In addition, a Spanish tagger has been developed, along with a set of retrieval tools for browsing the trilingual aligned corpus, and examining the proposed term or word alignments. The offer consists of the 3 x 1,000,000 token corpora of English, French and Spanish, morphosyntactic annotations (human-edited), lemmatisation and term extraction routines for English, French and Spanish.

Corpora of Vietnamese Texts (CVT)

> 1 million Vietnamese words; Two sources: (1) children’s’ literature (204K words); (2) online newspapers (851K words).

DOBES Project

A project to document and archive endangered languages (multimodally and with appropriate metadata) before they become extinct. The data archive (at the Max Planck Institute, the Netherlands) will cover video and audio recordings, photos, drawings, annotations to the recordings at various layers, lexica of various sort, grammar notes, field notes, notes about the sound system of a language and many others. This material is available in a restricted number of formats such as WAVE, MPEG1/2 for audio and video data and XML and UNICODE for text data. All material that is well documented by the various research teams is integrated into a metadata domain which is open to the public.

Dutch Corpora at the Instituut voor Nederlanse Lexicologie (INL)

(1) the 5 Million Words Corpus 1994 (books, magazines, newspapers and TV broadcasts, and cover several topics such as journalism, politics, environment, linguistics, leisure and business & employment) (2) the 27 Million Words Newspaper Corpus 1995 (3) the 38 Million Words Corpus 1996 [three main components: a component with varied composition (1970-1989), a newspaper component (Meppeler Courant, 1992-1995) and a legal component (1814-1989)]

EMILLE (Enabling Minority Language Engineering)

a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK. Beta version of the corpus consists of: 30 million words of monolingual written data (Gujarati, Tamil, Hindi, Punjabi); 600,000 words of monolingual spoken data (Hindi, Urdu, Punjabi, Bengali, Gujarati); 120,000 words of parallel data in each of (English, Hindi, Urdu, Punjabi, Bengali, Gujarati).

English-Russian Parallel Corpus

said to be the biggest EN-RU parallel corpus on the net.

European Corpus Initiative Multilingual Corpus I(ECI/MCI)

contains over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more. The primary focus in this effort is on textual material of all kinds, including transcriptions of spoken material. Also available from ELDA here.

ELC for ALL (Electronic Corpora for African-Language Linguistics) (Various African languages)

a site for projects still in their infancy. Aim: corpora of African languages for linguistic and lexicographic purposes. To date, corpus holdings include: South African languages (Afrikaans, isiNdebele, isiXhosa, isiZulu, Sepedi, Sesotho, Setswana, siSwati, Tshivenda, Xitsonga & South African English) and others (Cilubà, Kiswahili). Site includes some on-line papers & publication abstracts relating to the work.

Europarl
(Philipp Koehn)

free parallel corpus extracted from the proceedings of the European Parliament; includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish. Goals of the processing were to generate sentence-aligned text for statistical MT systems. For this purpose matching items were extracted and labeled with corresponding document IDs. Punctuation was separated out and sentence boundaries identified. Contains c.20 million words in 740K sentences per language.

FIDA Corpus of Slovene Language

100 million words of (mostly) written Slovene (predominantly from the 1990s). Web access possible with registration.

Floresta Sintá(c)tica (Portuguese treebank project)

A sampler of c.1,000 running text sentences (European Portuguese) is availablefor download and searching, or for graphical tree inspection/manipulation visit http://visl.sdu.dk. The sampler is a manually revised part of a larger tree corpus (1 million words), which was automatically annotated with the Constraint Grammar based PALAVRAS parser and then converted into constituent trees. This full version can also be searched. The project is a joint venture of the VISL project (Southern Denmark University) and the project "Computational Processing of Portuguese".

French Learner Language Oral Corpora (FLLOC)

an electronic database of French Learner Language Corpora, freely available to the research community, in the form of linked digital soundfiles and transcripts formatted using the CHILDES software.

German Political Speeches Corpus

political speeches from the German Presidency and Chancellery

Hamburg Corpus of Argentinean Spanish (HaCASpa)

141,321 transcribed words, 259 recordings (Total: 18h 24 mins, 63 speakers, 259 communications, 261 transcriptions). Audio & video recordings of experimental/read and spontaneous speech from adult speakers of Porteño Spanish in Argentina. Speakers: 18-69 years old from two geographic areas. For the intonational experiments, there are audio recordings only, whereas some of the free interviews and map tasks feature video recordings. Info: http://www.corpora.uni-hamburg.de/sfb538/en_h9_hacaspa.html

Hansard (various versions)
(English and Canadian French)

USC Information Sciences Institute has the freely downloadable Aligned Hansards of the 36th Parliament of Canada. (1.3 million pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament [1997-2000] )

LDC’s version (covers a time span from the mid-1970s through 1988); free to LDC members but US$5,000 for nonmembers

Hebrew Newspaper Corpora (by Shlomo Yona)

newspaper articles from three major Hebrew on-line editions of Israeli newspapers (Maariv, Yediot, and Haaretz); two tagged versions

Hellenic National Corpus (HNC; Greek)

(Under construction) The Greek equivalent to the British National Corpus; 20m words of written Modern Greek; has a web interface; full access requires subscription.

Hungarian National Corpus

(in English, in Hungarian)

Magyar Nemzeti Szövegtár: a 100-million-word balance reference corpus of present-day Hungarian

Hungarian Historical Dictionary Corpus

searchable

IFA Spoken Language Corpus

a free (GPL) database of hand-segmented Dutch speech. It was constructed with off-the-shelf software using speech from 8 speakers in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker. Compiled data are available in relational database format for querying with SQL.

Irish Corpora

1. Tobar Na Gaedhilge: free searchable textbase of quality 20th-century Gaelic texts (mostly Irish, with some Scottish), containing (at version 1.10) 5.3 million words of literary continuity Gaelic from the first half of the 20th century. The texts are accessible only through the supplied retrieval program, which runs under MS Windows. Parallel texts are included, pos-tagged and lemmatized, in English (1.9M words), French (0.8M), German (0.5M) and Russian (0.06M). (Contact: Ciarán Ó Duibhín)

2. large web-crawled corpora from the Crubadan project (Contact: Kevin Scannell)

3. New Corpus for Ireland (NCI): 30-million-word corpus of Irish (Connacht, Munster, and Ulster), including 5m words drawn from the Web; 25-million-word corpus of Hiberno-English (including 5m from the Web); original site no longer works, but access is now provided through the Sketch Engine.

IJS-ELAN corpus (Slovene-English Parallel Corpus)

1 million words of parallel/aligned Slovene-English/English-Slovene texts (15 source texts); has a web-based parallel concordancing service

Ingrian Finnish

a corpus of spoken Ingrian Finnish available via WWW

INTERSECT (International Sample of English Contrastive Texts)

a sentence-aligned, parallel bilingual corpus of French-English (1.5m of each language) and German-English (800K words of each language) written texts.

IPI PAN Corpus (Polish)

a large (currently over 250 million segments), morphosyntactically annotated, publicly available corpus of Polish

JOC (the Official Journal of the European Community)

composed of records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the EC in all official languages (previously nine); contains written questions asked by members of the European Parliament on a wide variety of topics and corresponding answers from the European Commission in 9 parallel versions; c. 10.2 million words (ca. 1.1 million words per language) corresponding to the year 1993, which was collected and prepared within the MLCC-MULTEXT projects. The part used for JOC was composed of one fifth of the French and English parts (ca. 200000 words per language). For more info, see the LPL site or the ARCADE pages

KACENKA (Korpus anglicko-cesky - elektronicky nastroj Katedry anglistiky; Czech)

Parallel Corpus of English and Czech texts (mainly literary); currently 3,297,283 words.

Korean National Corpus

(21st Century Sejong Project)

goal of this project (Korean Language Information-Oriented Project) is to establish a national corpus of Korean language comparable both in quality and size with other national corpora such as the BNC (British National Corpus). As of 2007, 57 million words.

Korean Treebank

(see also LDC catalog entry)

consists of 33 texts originally written in Korean and translated into English for purposes of language training in a military setting (information about various aspects of the military, such as troop movement, intelligence gathering, equipment supplies, etc.). 54,366 words and 5078 sentences.

Korpus 2000 (Danish)

the first major written Danish corpus which has been made publicly available on the internet. It consists of approximately 28 million words from texts written from 1998 to 2002, with parts-of-speech and morphological (i.e. inflectional) information. At the site it is also possible to search the Korpus 90 (1988-1992) which is similar to the Korpus 2000 in its composition and size and hence serves as an older comparative corpus for the Korpus 2000.

Lacio-Web (Portuguese)

(Universidade de São Paulo)

Project comprises six corpora: 1) a reference corpus called Lacio-Ref; 2) Mac-Morpho, a gold standard portion from Lacio-Ref, comprising 1,1 million words, which was manually-validated for morpho-syntactical tags; 3) an automatically-annotated portion of the Lacio-Ref with lemmas, POS and syntactic tags which are used by the parser Curupira developed at NILC; 4) a deviation corpus composed of non-revised texts (Lacio-Dev); and 5) parallel and 6) comparable Portuguese-English corpora called, respectively, Par-C and Comp_C..

Leipzig Corpora Collection

(various languages: Catalan, Danish, Dutch, English, Estonian, Finnish, French, German, Italian, Japanese, Korean, Norwegian, Sorbian, Swedish, Turkish)

corpora in different languages using the same format and comparable sources (identical in format and similar in size and content). They contain randomly selected sentences in the language of the corpus and are available in sizes of 100,000 sentences, 300,000 sentences, 1 million sentences etc.. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. All data (publicly accessible, copyrighted sources) have been processed automatically so that it is not possible to reconstruct the original source texts. Significant L1, R1 and "within sentence" collocates are computed for each word. Available as plain text files, or as MySQL database tables (ready to use with a supplied Corpus Browser)

Malay Concordance Project

classical Malay texts (c. 4 million words, including over 50,000 verses) which can be searched on-line. Host: Australian National University (ANU).

METU Turkish Corpus (& METU-Sabanci Turkish Treebank)

METU Turkish Corpus is a collection of 2 million words of post-1990 written Turkish samples. A subset of the corpus is used in METU-Sabanci Turkish Treebank. METU Turkish Corpus is XCES tagged at the typographical level. The distribution of the corpus also includes a workbench and related publications.

METU Spoken Turkish Corpus Project (ODT-STD)

Under construction (2008-Oct 2010). Aim: 1m words; face-to-face or mediated interactions in present-day Turkish.

Turkish National Corpus (TNC)

a balanced and a representative reference corpus of contemporary Turkish. 4438 different text samples. TNC-Demo Version represents 9 domains and 34 genres with a size of 48 million words.

MLCC Multilingual and Parallel Corpora

The MLCC text corpus has two main components - one set to allow comparable studies to be carried out in different languages and one set as the basis for translation studies. The first set is referred as the Polylingual Document Collection, a collection of newspaper articles from financial newspapers in 6 languages (Dutch, English, French, German, Italian and Spanish). The second set is a Multilingual Parallel Corpus consisting of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data, provided by the European Commission, comprises two sub-corpora from the Official Journal of the European Communities.

MULTEXT-East

Multilingual Text Tools and Corpora for Central and Eastern European Languages: corpora, lexica and tools

NEGRA Corpus version 2 (German)

355,096 tokens (20,602 sentences) of German newspaper text, taken from the Frankfurter Rundschau as contained in the CD "Multilingual Corpus 1" of the European Corpus Initiative; tagged with part-of-speech and completely annotated with syntactic structures; created as part of the projects NEGRA and LINC (Universität des Saarlandes) in Saarbrücken.

NEXING Corpus (Portuguese)

includes: (i) a collection of written transcriptions of verbal data elicited during an experiment on syllogistic reasoning; and (ii) performance data concerning that experiment, such as latencies, confidence levels and accuracy of answers provided. Freely available for download

Norwegian Spoken Language Corpus

pilot project intended to develop and test methods to enable the compilation of a digital, searchable spoken language corpus. The project has digitalised and transcribed some 18 hours of the speech of informants from Bergen, Voss and Tromøya. The material can be searched by means of a web browser, and following a search one can play back the relevant sound of the respective concordance lines.

Old French Corpus

a small collection of texts by C.R. Sneddon. Plain text format. Available through the OTA.

OPUS (an open source parallel corpus)

OPUS is an attempt to collect translated texts from the web, to convert and align the entire collection, to add linguistic data, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and will be delivered as an open source package as well OPUS consists so far of the documentation of the office package OpenOffice.org with its original collection of 2014 files in English and 5 collections of translated texts, French, Spanish, Swedish, German, and Japanese. The English original comprises about 500,000 words. Not all files have been translated yet. The translations contain between 400,000 and 500,000 words. All documents have been tokenised and aligned on the sentence level (1830 language pairs). Version 0.2 of the corpus contains roughly 30 million tokens in 60 languages

Oslo Corpus of Bosnian Texts

written corpus of 1.5 million words from several different genres: fiction (novels and short stories), essays, children’s stories, folklore, islamic texts, legal texts, and newspapers and journals. Authors from Bosnia and Herzegovina, published in the 1990s. Free, but need password to access.

Oslo Multilingual Corpus (OMC)

the Oslo Multilingual Corpus is an extension of the English-Norwegian Parallel Corpus (ENPC); 200 texts, or 2.6 million words.

PAROLE corpora (various languages)

Detailed descriptions of PAROLE text corpora and lexica may be found in the ELDA catalogue here. The languages involved in PAROLE corpora are: Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish. Not all of these have distributable corpora (the ELDA catalogue lists the ones in bold).

Polish National Corpus

Not completed yet. See other Polish corpus projects at the PELCRA web site.

Polish Learner English Corpus

Tagged with CLAWS C7 tag set. See the PELCRA web site.

Polish Spoken Conversational Mutimedia Corpus

See the PELCRA web site.

English-Polish Parallel and Comparable Corpora

See the PELCRA web site.

Portuguese language resources

compiled in connection with the project Computational Processing of Portuguese (Processamento computacional do português). Has links to CETEMPúblico, CETENFolha, COMPARA, and Floresta sintá(c)tica.

Prague Dependency Treebank

The Prague Dependency Treebank (PDT) is a morphologically and syntactically annotated corpus of Czech as a representative of inflectionally rich free-word-order languages

Reuters Corpora

(registration required to get the CDs)

Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 [over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish). These stories are contemporaneous with RCV1, but some languages do not cover the entire time period.]

Russian Corpora

- BOKR: The Russian Reference Corpus (under construction): the Russian equivalent of the British National Corpus; an XMLised 100-million-word corpus covering the wide range of text types and registers in modern Russian.

- HANCO (Helsinki Annotated Corpus of Russian Texts): 100, 000 running words, extracted from a modern Russian magazine and representing the modern Russian language.

- Links at Tübingen on Russian texts, including the Uppsala Corpus (Upsal’skij korpus russkix tekstov): 1 million words, 600 texts (informative texts from 1985-89; literary texts from 1960-88). Via the web query interfaces here, you can access the Uppsala Corpus of Modern Russian, a growing corpus of Russian interview texts, and other corpora.

- a subset of the Reuters news corpus (Leeds)

- Other links by Vladimir Rykov

- Computer Fund of the Russian Language (see the TRACTOR page)

- For more links, see the summary of Russian/Slavic corpora/NLP links posted to the LINGUIST list here

Russian Newspaper Corpus

samples of Russian newspapers from Moscow State University.

Corpus Of Serbian Language (CSL)

11-million-word Serbian written language corpus, divided into five samples. The first four samples (c. 4,000,000 words) cover the 12th to 20th centuries. The fifth sample includes contemporary language (about 7,000,000 words).Manually tagged for inflected morphology, number of graphemes and syllables and phonological structure; marked-up for sentence and paragraph boundaries. The system of tagging consists of about 2000 grammatical (inflected) forms.

Scots Gaelic and Welsh

very small corpus of transcribed Scottish Gaelic and Welsh speech events (LER-BIML website)

Spanish Corpora

See Corpus del Español (Mark Davies), Corpus de Referencia del Español Actual (CREA) from Real Academia Española, Spanish On-line (Gothenburg).

Links to Spanish corpora from the Laboratorio De Lingüística Informática at Universidad Autónoma Madrid, including the UAM Spanish Treebank, the Corpus Oral de Referencia del Español Contemporáneo, Corpus de Referencia de la Lengua Española en Chile, Corpus de Referencia de la Lengua Española en la Argentina, etc.

Spoken Dutch Corpus (CGN)

Corpus Gesproken Nederlands; "a database of contemporary standard Dutch as spoken by adults in the Netherlands and Flanders. Upon completion, the corpus will contain approximately ten million words"

Swedish Language Resources:

  • Swedish Spoken Language Corpus
  • SynTag Tree Bank
  • Swedish PAROLE corpus
  • English-Swedish Parallel Corpus
  • (see also: entry for a Swedish dictionary (LEXIN) in the dictionaries section here)

    (1) Göteborg Spoken Language Corpus: about 1.3 million words of adult 1st-language spoken Swedish (Göteborg University)

    (2) SynTag Tree Bank: a Swedish tree bank, containing 158 newspaper articles (about 100K running words) from the Press-65 corpus. The corpus can only be used for research purposes and for higher education. Instructions are required as the format doesn’t follow modern markup standards. Contact Jerker Järborg (Jerker.Jaerborg@svenska.gu.se) for more information. Address: ftp://ftp.spraakbanken.gu.se/pub/reskit/syntag.zip.

    (3) Swedish PAROLE corpus. A morphosyntactically annoted corpus comprising about 19 million running words. The corpus can only be used for research purposes and for higher education. Address: ftp://ftp.spraakbanken.gu.se/pub/reskit/parole.zip
    There is also a web version of the Swedish PAROLE corpus (unrestricted access): http://spraakbanken.gu.se/lb/parole/
    (The Language Bank plans to release a new lemmatized and morfosyntactically annotated corpus of about 100 mill. running words at the end of 2002. The annotation is based on the information in the SAOL (The Swedish Academy Glossary))

    (4) English-Swedish Parallel Corpus (ESPC) (Lund and Göteborg Universities): Parallel corpora can consist of comparable original texts in two or more languages ('comparable corpora') or of original texts and their translations into another language ('translation corpora'). The ESPC combines the advantages of these two types: 2.8 million words; 64 English text samples and their translations into Swedish and 72 Swedish text samples and their translations into English; two main text categories (fiction and non-fiction); samples are of 10,000-15,000 words, but in the non-fiction component there are also some shorter complete texts as well as some composite texts consisting of several shorter complete texts; size and proportion of the two text categories (in terms of running words) are roughly the same in the two languages.

    (5) Stockholm Umeå Corpus: 1 million words written Swedish, annotated with its part-of-speech, inflectional form and lemma; 1990’s, balanced according to genre, following the principles used in the Brown and LOB corpora. SUC was developed in a joint project between the universities of Stockholm and Umeå, and it is freely distributed for research purposes

    Tanaka Corpus

    "parallel Japanese-English sentences; cannot be regarded as containing natural or representative examples of text in either language because of the way it was originally compiled and the artificial nature of the sources. Also it still contains a large number of errors and repetitions. It certainly should not be used for any statistical analyses of the text."

    Thai National Corpus

    The Thai National Corpus is is designed as a general corpus of standard Thai. Only written texts are being collected at this moment. The aim is to include at least 80 million words. Texts are word-segmented and tagged following the Text Encoding Initiative (TEI) guidelines on text encoding. The TNC is designed to be comparable to the (written part of the) British National Corpus, so a comparative study between the two languages will be possible.

    Thai Bitext Corpus

    (Doug Cooper)

    On-line search of a collection of Thai and (mostly) English parallel translations or bitexts. The complete library can be searched for usage examples, or individual texts can be read in a variety of layouts. Not much more information (e.g. number of words), but there seem to be six texts at the moment. Bitext searches allow either Thai or any available second language.

    Thai Corpus & Concordancer (Wirote Aroonmanakun)

    On-line search of a 55 million-word Thai collection of texts (mainly newspaper texts (c. 49m words), journals (1.2m), academic writings and talks (2.2m), short novels (0.7m), laws (6.1m), and Prime Ministers' speeches (2.2m). [1MB is about 170,000 Thai words]

    TIGER Corpus/Treebank (Versions 2.1 & 2.2)

    A German newspaper corpus, taken from the Frankfurter Rundschau. 900,000 tokens, 50,000 syntactically annotated sentences. Characteristics of the TIGER corpus: the labelling of edges to express relations between nodes and their children, and the use of crossing edges to describe long-distance dependencies like extraposed relative clauses. Includes a TIGERSearch query program designed to process data based on the rather complex data model of the TIGER corpus (it is, in theory, therefore able to process other available treebanks for other languages too).

    TransSearch (English-French translations)

    (not free). A tool that enables translators to submit queries to a translation memory, in order to locate ready-made solutions to all sorts of translation problems. A translation memory is a textual database made up of groups of documents that are mutual translations and in which the various links between translated segments are explicitly recorded. Includes a bilingual concordancer. Two data bases are currently available: (i) The Hansard: debates of the Canadian House of Commons (April 1986 to May 2001), translations are between English and French totalling many millions of words. (ii) The Canadian courts: documents drawn from the collected decisions of the Supreme Court of Canada, the Federal Court of Canada and the Tax Court of Canada (from 1986 to the present). Both these databases are periodically updated.

    Tübingen Treebank of Written German (TüBa-D/Z)

    A manually annotated, German newspaper corpus taken from the daily issues of the 'die tageszeitung' (taz). The annotation scheme distinguishes four levels of syntactic constituency: lexical, phrasal, topological fields, and clausal. In addition to constituent structure, annotated trees contain edge labels between node labels which encode grammatical functions. Words are annotated with inflectional morphology at the lexical level. The treebank currently comprises approximately 104,787 sentences (ca. 1,959,474 words). 3 formats: NEGRA export, XML and Penn Treebank formats. Enriched with anaphoric and coreference relations referring to nominal and pronominal antecedents. Free of charge for scientific use.

    Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z)

    a collection of German articles from the taz newspaper which have been automatically annotated with clause structure, topological fields, and chunks, in addition to more low level annotation including parts of speech and morphological ambiguity classes. All texts were processed automatically, starting from paragraph, sentence and token segmentation. Tokens include information about some regular types of named entities, including dates, telephone numbers, and number/unit combinations. The TüPP-D/Z data are based on taz newspaper articles from September 2, 1986 up to May 7, 1999, consisting of more than 200 million word tokens. License will be granted for a nominal fee for scientific use.

    Turin University Treebank (TUT)

    A dependency-grammar-based treebank of Italian. 500 sentences.

    Turkish News Text

    Turkish texts which have been morphologically analyzed and disambiguated.

    UAM Spanish Treebank

    1,500 (in Sept 1999) syntactically annotated sentences of Spanish from the Laboratorio de Lingüística Informática Universidad Autónoma de Madrid. Goal is for 5,000 sentences.

    UAGT-PNAW Parallel corpus of Welsh -- English

    a bilingual sentence aligned corpus of around 510,813 sentence pairs in Welsh and English taken from the proceedings of the National Assembly for Wales (Welsh—-English).

    Web-based text collections of minority languages (not systematically designed corpora) by Kevin P. Scannell.

    Texts collected from web pages using a web crawler: Welsh, Irish Gaelic, Catalan, Swahili, Maori, Faroese, Scots Gaelic, Walloon, Breton, Cebuano, Manx Gaelic. Particularly useful for a range of NLP and lexicographical projects.

    * Can’t find what you want? Try querying the OLAC archives


    Did you find this web site/page useful? Most people, sadly, don’t bother to let me know, but if you want to encourage me to keep updating the site, drop me a line.