L1 & L2 Acquisition Corpora

Corpora for research on 1st language acquisition

Child Language Data Exchange System (CHILDES)

~20 million words (180 million characters), 20 languages. The CHILDES system provides tools for studying conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, & systems for linking transcripts to digitized audio & video. Includes language acquisition bibliographies for multiple languages.
Lancaster Corpus of Children’s Project Writing (LCPW) A digitized collection of project work produced by children aged between 8 & 11; part of a larger research program (a longitudinal study of children’s writing-for-learning, based on the writing of 8-12 year old children)
Polytechnic of Wales (POW) Corpus 100,000 words spoken English by 120 children, aged 6-12; parsed according to Hallidayan Systemic-Functional Grammar. See the manual here. Distributed from two places: The Oxford Text Archive & ICAME. The AMALGAM tagger emulates the POW tagset.

Learner & Lingua Franca Corpora

Language produced by non-native speakers/writers, for various languages/2nd Language Acquisition research. For a more structured and comprehensive overview of learner corpora, see the Learner corpora around the world page of the Centre for English Corpus Linguistics in Louvain.

International Corpus of Learners English (ICLE Ver. 2) As of May 2009, over 3.7 million words of writing by advanced/university learners of English (EFL, not ESL) from 16 different mother tongue backgrounds.
Two types of essay writing: (1) Argumentative essays (untimed); using language reference tools (dictionaries, grammars, etc.) but entirely the students' own work, i.e. no quoting, no native speaker help; (2) Literature examination papers (no more than 25% of each national corpus). Each Essay: between 500 to 1,000 words long. In May 2009, there were 5,554 argumentative essays & 531 literary or 'other' essays.
LINDSEI (Louvain International Database of Spoken English Interlanguage) A corpus of spoken learner English from learners from 11 different language backgrounds (Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish, & Swedish to date). Two types of speech: informal interviews (free talk on a given topic) and picture-prompted speech (based on a standard set of pictures). There is a comparable corpus of speech from English native speakers called LOCNEC (Louvain Corpus of Native English Conversation).
International Corpus Network of Asian Learners of English (ICNALE)The ICNALE includes more than 10,000 topic-controlled monolgues, dialogues, and essays produced by college students in ten countries/regions across Asia (China, Hong Kong, Indonesia, Japan, Korea, Pakistan, the Philippines, Singapore/ Malaysia, Taiwan, and Thailand), as well as produced by English native speakers. It currently comprises four modules: Spoken Monologue, Spoken Dialogue, Written Essays, and Edited Essays.
Fehlerannotierte Lernerkorpus des Deutschen als Fremdsprache (FALKO; Ver 2.01)381,447 words of essays and summaries, produced by learners and native speakers (for reference) of German. Annotated for POS, lemmata, as well as target hypotheses for learner errors.
International Corpus of Crosslinguistic Interlanguage (ICCI) An international effort headed by Yukio Tono at Tokyo University of Foreign Studies (TUFS). Two aims: (1) to compile corpora of the writing of young learners of English across different proficiency levels (from primary up to pre-university) and first language backgrounds (different mother tongues). There are currently 10 scholars from 8 different countries/regions (Hong Kong, Germany, Israel, Japan, Poland, Singapore, Spain, and Taiwan) actively contributing to this project. (2) to compile TUFS students’ L2 writing in the partner country’s languages (e.g., essays in Spanish written by Japanese learners of Spanish).
Chinese Learner English Corpus (CLEC)
(link no longer working)
1 million words of English compositions collected from 5 different levels of Chinese learners of English, tagged according to an error tagging scheme of 61 types of error (excludes stylistic errors & error sources, which are difficult to tag objectively & consistently); consists of a book & CD-ROM. The book has an introduction (in Chinese) which gives an account of the corpus design, the methodology used in the statistical analysis of the corpus, and the major findings, + an Alphabetical List, a Lemmatized List, a Word-Frequency Distribution, a Summary Table of Errors, & a List of Spelling Mistakes. The CD-ROM consists of the error-tagged corpus with a simple concordancer, & all the lists & tables of the book.
Available by mail: Shanghai Foreign Language Education Press, 295 Zhong Shan Bei Yi Road, Shanghai 200083, PRC. Contact Mrs. Fan Jianying, email sflep@sflep.com.cn Fax (+86)021-55512177. List price in PRC ¥76.00 plus 15% postage; Overseas: US$60.80 (including postage)
SWECCL (The Spoken & Written English Corpus of Chinese Learners) Developed under the leadership of Qiufang Wen, Lifei Wang, Maocheng Liang and Xiaoqin Yan, and published on CDs by Foreign Language Teaching and Research Press.
Spoken Learner Corpus (SLC) The Spoken Learner Corpus (SLC) Project is a collaboration between Trinity and the Centre for Corpus Approaches to Social Science (CASS) at Lancaster University.
The aim of the project is to create a large corpus of learner (and examiner) speech to be used in a wide range of research contexts, including Second Language Acquisition, language testing, L2 pedagogy and materials development.
The corpus, which is also known as the Trinity Lancaster Corpus, currently comprises 3.5 million words and has been created from recordings of Trinity’s Graded Exams in Spoken English (GESE) across a range of grades from B1–C2 on the CEFR scale. It represents language used in a variety of speaking tasks which reflect speech events in the world outside the test and covers multiple different language backgrounds.
Lancaster Corpus of Academic Written English (LANCAWE)
(link no longer working)
Academic writing samples from non-native speakers of Eng taking study Skills/EAP pre-sessional & undergrad courses. There is also a small native speaker subcorpus that can be used for comparison. Some sub-corpora are organised according to writing task & topic, writer’s L1, writing conditions & time at which the piece was produced; contains more than one piece of writing from each learner, & these comprise similar essays written by the same learner at different points in time (e.g., before, during & after the pre-sessional course), as well as different types of essays (e.g., descriptive, argumentative, etc.) written by the same learner at the same or different times. A longitudinal sub-corpus of LANCAWE is called the Hinestroza-Kim Corpus (HKC).
MELD (Montclair Electronic Language Learners' Database)
(link no longer working)
English (ESL) text written by all levels of learners in North America; publicly available; timed & untimed writing of undergrad ESL students, dated so that progress can be tracked over time. Demographic data is also collected for each student, including age, sex, L1 background, & prior experience with English. The essays are continuously being tagged for errors in grammar & academic writing as determined by a group of annotators. The database currently (May 2009) consists of 44,477 words of tagged text & another 53,826 words of text. Allows various analyses of student writing, from assessment of progress over time to relation of error type & L1 background. Errors are annotated independently by two trained annotators without reference to a pre-determined list of error types. The error annotation is then adjudicated by the two annotators in consultation with one of the project directors.
ELFA
(English as a Lingua Franca in Academic Settings)
(link no longer working)
Recordings & transcripts of spoken English used as a lingua franca in academic settings (Tampere University & Tampere Technological University in Finland). Sessions with speakers who all share an L1 are not included, neither are Engish language courses. Coded for speech event type/genre, discipline/domain, interaction type (dialogic/monologic), age group, gender, nationality & mother tongue.
VOICE (Vienna-Oxford International Corpus of English) A corpus of English as a Lingua Franca (i.e., English as the means of communication regarded as the most convenient one by speakers from different first-language backgrounds). The focus is on unscripted, largely face-to-face communication among competent speakers from a wide range of L1 backgrounds whose primary & secondary education & socialization did not take place in Eng. Speech events include private & public dialogues, private & public group discussions & casual conversations, & one-to-one interviews. An on-line search for VOICE 1.0 Online is available (pre-registration required).
FRIDA (French Interlanguage Database) A corpus of French as a foreign language, with a target size of 450,000 words.
EnglishTLC (English Taiwan Learner Corpus)
(link no longer working)
~2 million words of unrestricted running text written by learners of English in Taiwan (majority by senior high school & university students). Essentially a self-propogating corpus: EnglishTLC is integrated with the writing component of a web-based English learning platform called IWiLL. Partially annotated for errors, consisting of comments made by teachers in their everyday process of correcting essays online using the IWiLL essay correction interface (the comments provide a window onto actual teacher feedback & teaching practice). The research interface provides a search function for extracting every error token marked by teachers on essays in the corpus. This function then lists all comments in descending order of the number of instances marked as tokens of that error type. Then each comment in this list links to a listing of all of the sentences in EnglishTLC that have been marked as that error type. Since teachers are selective in the errors which they mark in student writing, this sort of annotation in EnglishTLC should be regarded as partial annotation. There are devised heuristics for bootstrapping from these partially annotated texts to the extraction of further error tokens that the teachers left unmarked (see Wible et al 2003 for details). Feedback effects are traceable. The errors that teachers have marked as feedback to the students are also indexed to any revisions the learner may have made to their essay after reading that teacher feedback. This makes it possible to uncover learners’ attentiveness to or grasp of comments given.
Hong Kong University of Science & Technology (HKUST) Corpus The biggest corpus of Chinese (Cantonese) learners of English (or, indeed, of any single group of learners of English). 25 million words, with grammatical & discourse-feature tags.Texts consist of written undergrad assignments & 'A-level' scripts. Contact: Gregory James, Language Centre, Hong Kong University of Science & Technology, Clear Water Bay, HK. See Milton, John & K.S.T Tong (eds.) (1991). Text Analysis in Computer-Assisted Language Learning. Hong Kong: Hong Kong University of Science & Technology.
Learner Business Letters Corpus 209,461 word tokens in 1,464 letters written by Japanese business people. Searchable through a web concordancer here.
European Science Foundation Second Language Database (ESF) A computerized archive of the spontaneous second language acquisition of forty adult immigrant workers living in Western Europe, & their communication with native speakers in the respective host countries (France, Germany, Great Britain, The Netherlands & Sweden). For each target language, two source languages were selected.
Hungarian Learner English (JPU Corpus) Hungarian university students’ English
PICLE
(link no longer working)
The Polish component of ICLE. This corpus, along with some comparable English (undocumented) & Polish corpora, can be searched on-line using various tools provided.
SILS Learner Corpus of English (Waseda Univ) Essays by students at SILS, the School of International Liberal Studies at Waseda Univ, Japan; wide variety of backgrounds (majority Japanese); can be used to look at the effects of native lg and educational background on writing skills in English; Will be collecting many essays from each indiviual student (longitudinal), and both 1st and 2nd drafts, with teachers' comments.
Thai English Learner Corpus (TELC) Written corpus of 1.3 million words (on 23/5/2002), tagged for part of speech & lemma. Comprises writing samples of Thai EFL university students, starting 1997 & continues to grow. 700,000 words of written Eng taken from university entrance exams at the Institute for English Language Education (IELE, Assumption University, Thailand) & 600,000 words from essays written by fourth year Thai EFL learners at the Institute. Searchable on-line, but limited to 100 concordance lines. For full access, contact the owners.
Tswana Learner English Corpus (TLEC)
(link no longer working)
Modelled on ICLE; corpus of 200K words of argumentative essays from advanced learners of English in institutions of higher learning in South Africa.
Longman Learners’ Corpus Not generally available (except by arrangement with publishers). Students & teachers throughout the world sent in essays & exam scripts to help create the Longman Learners' Corpus, a 10-million word computerised database made up entirely of language written by students of English. Every nationality, every language level is represented in the corpus & this provides a unique insight into learner English.
VALICO (Varietà di Apprendimento della Lingua Italiana: Corpus Online)
(link no longer working)
(Online Corpus of the Learning Varieties of the Italian Language); texts encode a set of sociolinguistic data to determine the learners' profiles (learners' age, gender, proficiency in Italian, knowledge of other languages, mother tongue…). Learners were given a common stimulus to elicit the texts, to allow comparisons across countries.
International Corpus of Learner Finnish Timeline: 2008-2011. Spontaneous texts produced by learners of Finnish around the world.
Cambridge Learner Corpus (CLC) Not generally available. A large collection of examples of English Writing from learners of English all over the world; over 15 million words & expanding all the time;part of the Cambridge International Corpus (CIC); comes from anonymised exam scripts written by students taking Cambridge ESOL English exams around the world; each script is coded with information about the student’s first language, nationality, level of English, age, etc. Currently, it can only be used by authors & writers working for Cambridge University Press & by members of staff at Cambridge ESOL.

If you found this web site useful, or found an outdated link, don’t forget to let me know.