Bookmarks for Corpus-based Linguists

Introductory Textbooks | Other Relevant Titles | Publishers' series on CBL | Spoken Discourse & Prosody | Pedagogical Books | On-line Bibliographies | On-line Papers/Dissertations | EAGLES papers | Journals | NLP Bibliographies | Library/Citation Searches | [Bookmarks HOME]

References, Papers, Journals

Preamble: This is a select bibliography of books only, rather than an exhaustive list of all useful articles/references in the field of CBL, which is probably impossible to keep (or keep current). Recently, more and more links to publishers' pages have been disappearing, due to restructuring of websites, so at least some of these may be outdates, and I’ll try to update them whenever I find the time. However, also have a look below at the on-line bibliographies kept on other sites. The section "On-line Papers & Dissertations in CBL", lists only web-accessible journal articles/papers/squibs which have came to David’s attention while he was maintaning the site, leaving out paper-based journal articles. Please share with the community of scholars and contact me if you have a paper/squib or any resource available for downloading which you would like me to list and link to.

General Introductory Textbooks on Corpus-based Linguistics (CBL)

(The most recent titles are listed first)

Weisser, Martin. (2016). Practical Corpus Linguistics: an Introduction to Corpus-Based Language Analysis. Oxford: Wiley-Blackwell.(Shameless plug:) The first really practical introduction to Corpus Linguistics. Covers everything from collecting and cleaning up corpus data, via different types of analysis, to annotation.
Don’t trust me on this statement, though, but instead read the review on the Linguist List
Cheng, Winnie. (2011). Exploring corpus linguistics: Language in action. Abingdon: Routledge. An introductory textbook. Publisher’s site is here
Hoffmann, Sebastian, Evert, Stefan, Smith, Nicholas, Lee, David & Ylva Berglund Prytz. (2008). Corpus Linguistics with BNCweb: A Practical Guide. Frankfurt am Main: Peter Lang More than just a manual that explains how to use and exploit BNCweb, this book can be used as an introductory textbook on corpus-based linguistics. The BNCweb information page is here. Publisher’s site is here.
Adolphs, Svenja. (2006). Introducing Electronic Text Analysis: A practical guide for language and literary studies. Abingdon: Routledge. Publisher’s blurb: " …guide to ways in which the use of computers can complement more traditional types of text & discourse analysis along with a range of sample analyses of contemporary English language... exploration of literary texts, the study of ideology in text and discourse, and the use of electronic text analysis in the English Language Teaching context.". Unfortunately, the companion site has disappeared.
Scott, Mike, & Tribble, Chris. (2006). Textual patterns: Keyword and corpus analysis in language education, Amsterdam: Benjamins. "Shows how key word analysis, combined with the systematic study of vocabulary and genre, can form the basis for a corpus informed approach to language teaching." Publisher’s web page is here .
McEnery, Tony, Xiao, Richard & Tono, Yukio. (2006). Corpus-based language studies: An advanced resource book. Abingdon: Routledge. "adopts a ‘how to’ approach with exercises and cases, affording students with the knowledge and tools to undertake their own corpus-based research.". Unfortunately, the companion site has disappeared.
Meyer, Charles. (2002). English corpus linguistics: An introduction. Cambridge: Cambridge University Press. A review of this book on the LINGUIST list is available here. Read publisher’s description here.
McEnery, Tony, & Wilson, Andrew. (2001). Corpus linguistics (2nd Ed.). Edinburgh: Edinburgh University Press. First published textbook on corpus linguistics (in 1996). Has a companion site here which serves as a very basic introduction to the field. A review of the first edition may be found here.
Kennedy, Graeme D. (1998). An introduction to corpus linguistics. London: Longman. Valuable historical survey of the rise of corpora and corpus-based research, informative review of available resources and an intelligent, critical appraisal of some key research in CBL. Particularly geared towards language teachers/ELT.
Biber, Douglas, Conrad, Susan, & Reppen, Randi. (1998). Corpus Linguistics: Investigating language structure and use. Cambridge: CUP. A rather programmatic introduction to corpus-based linguistics and methodological primer, presenting one particular orientation (aims to be scientific and rigorous). Some bits are repetitive and some bits (on factor analysis) I take issue with, but, overall, cogently presented and lucidly persuasive. Lots of help with basic corpus-related statistics.
Lawler, John, & Aristar Dry, Helen. (Eds.). (1998). Using Computers in Linguistics: A Practical Guide. London: Routledge. Not so much a CBL textbook as a primer on linguistic computing (i.e. for beginners starting from scratch). An informative but much neglected book, in my opinion. Has a very readable history of the Internet & computing, and surveys software & web resources for linguistics; practical, basic discussion of the nature of linguistic data and how this affects how we use computers in language study; plus an introduction to language analysis on the UNIX platform. Has a companion site (includes an on-line versions of the introduction). Lawler’s page has links to Chapter 5 (on Unix) of the book, the bibliography, the glossary, and the index.
Barnbrook, Geoff. (1996). Language and computers: A practical introduction to the computer analysis of language. Edinburgh: Edinburgh University Press. A very basic & practical introduction to using computers in language research in the humanities & linguistics, especially suited to real beginners (who are at the same time not afraid of technical details). Also usefully introduces basic programming techniques for linguists (AWK, in particular). Simply written and handily illustrated with lots of examples. * A review of this book is available here.

Other relevant titles (alphabetical, by last name)

(N.B. Conference proceedings are only selectively included. For detailed listings of these, please see the publishers' websites linked to below)

Aijmer, Karin, & Bengt Altenberg. (Eds). (1991). English corpus linguistics: Studies in honour of Jan Svartvik. London: Longman. perhaps a little dated, but contains some 'classic' papers which people still refer to.
Baker, Mona, Francis, Gill, & Tognini-Bonelli, Elena (Eds.). (1993). Text and technology: In honour of John Sinclair. Amsterdam: John Benjamins. focuses on three major areas of modern linguistics: discourse analysis, corpus-driven analysis of language, and computational linguistics.
Baker, Paul, Hardie, Andrew, & McEnery, Tony (2006). A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press. Publisher’s page is here
Baker, Paul (2006). Using Corpora in Discourse Analysis. London: Continuum. Never goes beyond basics and stops when it would actually start to get interesting. Samples used for examplification generally far too small to represent genuine corpus linguistics work. Publisher’s page is here
Barnbrook, Geoff, Danielsson, Pernilla, & Mahlberg, Michaela. (Eds.). (2005). Meaningful texts: The extraction of semantic information from monolingual and multilingual Corpora. London: Continuum. The common focus of all the papers is meaning, studied not only in monolingual environments, but also contrastively in multilingual contexts; largely derived from work being carried out by the partners of the TELRI projects; a survey of the current extent and depth of semantic investigation using corpora. Publisher’s page with contents here (abstracts of the papers here)
Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan, & Finegan, Edward. (1999). Longman grammar of spoken and written English. London: Longman. latest reference grammar based on a variety of corpora. Includes comparison of British & American English, differences across four genres (spoken and written). Publisher’s page here. A review in TESL-EJ may be found here.
Biber, Douglas. (2006). University language: A corpus-based study of spoken and written registers. Amsterdam: John Benjamins. "Based on analysis of the T2K-SWAL Corpus, the book describes university registers from several different perspectives, including: vocabulary patterns; the use of lexico-grammatical & syntactic features; the expression of stance; the use of extended collocations ('lexical bundles'); and a Multi-Dimensional analysis of the overall patterns of register variation… in university registers: academic and non-academic; spoken and written." Publisher’s page with contents here
Bowker, Lynne & Pearson, Jennifer. (2002). Working with specialized language: A practical guide to using corpora. London: Routledge. Targeted at translators, technical writers and subject specialists who are interested in a corpus-based approach to LSP (Language for Special Purposes). Publisher’s blurb here and companion web site (with exercises, weblinks, etc.) here.
Coffin, Caroline, Hewings, Ann, & O’Holloran, Kieran (Eds.). (2004). Applying English grammar: Corpus and functional approaches. London: Hodder & Stoughton. illustrates "how researchers can fruitfully bring together corpus and functional approaches to reveal how grammar and lexis create and transmit values, identities and ideologies.... [presents] work in CDA which brings together the methodologies of corpus linguistics & functional grammar, demonstrating their combined potential for illuminating ideological perspectives, particularly in media texts." Publisher’s blurb here.
Conrad, Susan, & Biber, Douglas. (Eds.). (2001). Variation in English: Multi-dimensional studies. Harlow: Longman. various studies based on Biber’s multidimensional (MD) approach (based on factor analysis). I am not a big fan of the MD methodology as a whole, so cannot find much to recommend here.
Deignan, Alice. (2005). Metaphor and Corpus Linguistics. Amsterdam: John Benjamins. Critiques, using corpus data, different ways of researching metaphor; "demonstrates the need for naturally-occurring language data to be used in the development of metaphor theory, and shows the value of corpus data and techniques in this work". Table of Contents on publisher’s site here.
Garside Roger, Leech, Geoffrey, & Sampson, Geoffrey. (Eds.). (1987). The computational analysis of English: A corpus-based approach. London: Longman one of the seminal classics in the field (though obviously outdated now in terms of details). See UCREL web site.
Garside, Roger, Leech, Geoffrey, & McEnery, A.M. (Eds.). (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman informative book on the Lancaster approach to corpus annotations of all kinds. See UCREL web site for the blurb. Or see Longman page here for Contents & description
Geoffrey Sampson & McCarthy, Diana (Eds.). (2004). Corpus linguistics: Readings in a Widening Discipline. London: Continuum. a collection of articles previously published in journals & books, with an intro and running commentaries. Read the publisher’s blurb here.
Gries, Stefan. (2010). Quantitative Corpus Linguistics with R. London: Routledge. Demonstrates how to use the open source programming language R for corpus linguistic analyses – searching & processing corpora, arranging & outputting the results of corpus searches, statistical evaluation, & graphing. Publisher’s page with contents here. A review can be found here.
Halliday, M. A. K., Teubert, Wolfgang, Yallop, Colin, & Cermáková, Anna. (2004). Perspectives in lexicology and corpus linguistics: An introduction. London: Continuum. Publisher’s page with contents here. A review can be found here also here.
Hockey, Susan. (2000). Electronic texts in the humanities: Principles and practice. Oxford: Oxford University Press. "an introduction to the ways in which humanities scholars and students can use electronic texts for research and teaching in literature, linguistics, and history. The book goes beyond current Internet technology to show how computers can be used not only to show electronic texts, but to manipulate and analyse them." Contents page on publisher’s web site here.
Hoey, Michael, Mahlberg, Michaela, Stubbs, Michael, Teubert, Wolfgang. (2007). Text, discourse and corpora: Theory and analysis. London: Continuum. A collection of articles discussing theoretical positions and models in corpus linguistics.
Hundt, Marianne. (2007). English mediopassive Constructions: A cognitive, corpus-based study of their origin, spread, and current status. Amsterdam: Rodopi. "first empirical study of the history & spread of mediopassive constructions… and looks into text type-specific preferences for the construction. On a more abstract level, it combines the corpus-based description of mediopassive constructions with cognitive linguistic models, drawing largely on notions such as 'prototype', 'family resemblances', 'patch' and 'construction.'
Hundt, Marianne, Nadja Nesselhauf & Carolin Biewer (Eds.). (2007). Corpus linguistics and the Web. Amsterdam/New York: Rodopi. Publisher’s description & contents page is here.
Kenny, Dorothy. (2001). Lexis and creativity in translation: A corpus-based study. Manchester: St. Jerome. Not reviewed yet. (If you’ve read this and have a review I could link to, please contact me)
de Klerk, Vivian. (2001). Corpus linguistics and World Englishes: An analysis of Xhosa English. London: Continuum. Publisher’s description here. A Linguist List review is here.
Leech, Geoffrey, Rayson, Paul, & Wilson, Andrew. (2001). Word frequencies in written and spoken English: Based on the British National Corpus. London: Longman. aims to be the successor to the ubiquitous (but ouated) West (1953) word list. Has a companion web site giving downloadable word frequency lists
Hans Lindquist & Mair, Christian. (Eds.). (2004). Corpus approaches to grammaticalization in English. Amsterdam: Benjamins. Publisher’s description, contents page, and abstracts of all articles are here
Mair, Christian. (2006). Twentieth-century English: History, variation and standardization. Cambridge: Cambridge University Press. Publisher’s description here
Mukherjee, Joybrato. (2002). Korpuslinguistik und Englischunterricht: Eine Einführung. [Corpus Linguistics and English Language Teaching: An Introduction.] Frankfurt am Main: Peter Lang. an introductory book on CBL (in German), targetted at English language teachers in Germany.
Oakes, Michael P. (1998). Statistics for corpus linguistics. Edinburgh: Edinburgh University Press. all the statistics you’ll ever need for language processing. More for NLP researchers/computational linguists, and as a reference for linguists venturing into the darker corners of quantitative data analysis, and therefore not really for beginners. Written in a rather bland, no-frills, slightly formulaic style -- turgid, but concise and informative.
Pearson, Jennifer. (1998). Terms in context. Amsterdam/Philadelphia: John Benjamins. Could be of interest to terminologists, lexicographers, and people in ESP/LSP. Publisher’s description here.
Renouf, Antoinette, & Kehoe, Andrew (Eds.). (2005). The changing face of corpus linguistics. Amsterdam: Rodopi. highlights the growing emphasis on language as a changing phenomenon, both in terms of established historical study and the newer short-range diachronic study of 20th C & current English; and the growing area of overlap between these two. [includes papers on] recent changes in the definition of 'corpus' … due to the emergence of new technologies and… the World Wide Web; + a discussion panel on the relationship between corpus linguistics & grammatical theory. Publishers page here.
Reppen, Randy, Fitzmaurice, Susan M., & Biber, Douglas (Eds.) (2002). Using corpora to explore linguistic variation. Amsterdam/Philadelphia: John Benjamins. "illustrates the ways in which linguistic variation can be explored through corpus-based investigation. Two major kinds of research questions are considered: variation in the use of a particular linguistic feature, and variation across dialects or registers." Publisher’s web page and Table of Contents here.
Roach, Peter. (1992). Computing in linguistics and phonetics: Introductory readings. London: Academic Press. a collection of very basic and very introductory articles on how to exploit computers to analyse language. Rather dated now.
Rudanko, Martti Juhani. (2000). Corpora and complementation: tracing sentential complementation patterns of nouns, adjectives and verbs over the last three centuries. Lanham, Maryland./Oxford: University Press of America Not reviewed yet. (If you’ve read this and have a review I could link to, please contact me)
Rudanko, Martti Juhani. (2002). Complements and constructions: corpus-based studies on sentential complements in English in recent centuries. Lanham, Maryland/Oxford: University Press of America Not reviewed yet. (If you’ve read this and have a review I could link to, please contact me)
Rühlemann, Christoph. (2007). Conversation in context: A corpus-driven approach. London: Continuum. Table of Contents and more publisher’s blurb here.
Saito, Toshio, Nakamura, Junsaku, & Yamazaki, Shunji (Eds.). (2002). English corpus linguistics in Japan. Amsterdam: Rodopi. a collection of 20 papers reflecting the state of art in English corpus linguistics in Japan. Four sections: 1) Corpus-based studies of contemporary English, 2) Historical and diachronic studies of English, 3) English corpora and English language teaching, 4) Software for analyzing corpora. Table of Contents and more blurb here.
Sampson, Geoffrey. (2001). Empirical linguistics. London: Continuum. a collection of articles previously published elsewhere, plus two new ones. Many chapters relevant to corpus-based linguists. Read the publisher’s blurb here.
Santos, Diana M. (2004). Translation-based corpus studies: Contrasting English and Portuguese tense and aspect systems. Amsterdam: Rodopi. Not reviewed yet. (If you’ve read this and have a review I could link to, please contact me)
Sinclair, John. (1991). Corpus, concordance, collocation. Oxford: OUP a foundational classic, but hard to get these days
Sinclair, John. (2004). Trust the text: Language, corpus and discourse. London: Routledge. Not reviewed yet.
Stubbs, Michael. (1996). Text and corpus analysis: Computer assisted studies of language and culture. Oxford: Blackwell. valuable survey of the nature of linguistic data and the place of CBL in the history of linguistics and linguistic theorising. Focuses on using corpora to do social linguistics (i.e. undercovering 'semantic prosodies' that reveal aspects of culture)
Stubbs, Michael. (2001). Words and phrases: Corpus studies of lexical semantics. Oxford: Blackwell. "fills a gap in studies of meaning by providing detailed case studies of attested corpus data on the meanings of words and phrases. It places lexis and phraseology at the centre of semantics & pragmatics...starts from traditional concepts of lexical semantics, including meaning as use, denotation and connotation, lexical field, sense relations, phraseology and collocation... The main chapters are detailed case studies of words in collocations, words in texts and words in culture. Concluding chapters discuss the implications of corpus analysis for linguistic theory, especially lexico-grammar and theories of competence and performance." Table of Contents and more blurb here
Studer, Patrick. (2008) Historical Corpus Stylistics: Media, Technology and Change. London: Continuum. "Using data from a newspaper corpus, this book offers the first empirical study into the development of style in early mass media. The book analyses how news discourse was shaped over time by external factors, such as the historical context, news production, technological innovation and current affairs, and as such both conformed to and deviated from generic conventions." Publisher’s page with contents page is here
Svartvik, Jan. (Ed). (1992) Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82, Stockholm, 4- 8 August 1991. Berlin: Mouton de Gruyter. a little dated, but a useful anthology containing some interesting papers.
Teubert, Wolfgang & Cermáková, Anna. (2007). Corpus linguistics: A short introduction. London: Continuum. Publisher’s page with contents page is here. A review of this book on the Linguist List is here. (The book seems to be linked with a previously published book, Halliday, Teubert, Yallop, & Cermáková (2004), Perspectives in lexicology and corpus linguistics: An introduction.)
Thomas, Jenny, & Short, Mick. (Eds.). (1996). Using corpora for language research: Studies in the honour of Geoffrey Leech. London: Longman. Festschrift containing some interesting papers which show a range of linguistic issuesrelated to corpora. Somewhat hastily thrown together and edited. See UCREL web site for the blurb. Or Longman page here for Contents & description.
Tognini-Bonelli, Elena. (2001). Corpus linguistics at work. Amsterdam: John Benjamins.
* See LINGUIST List review by Anna-Maria De Cesare here.
"The two main approaches to corpus work are discussed as the “corpus-based” & the “corpus-driven” approach and the theoretical positions underlying them explored in detail. The book adopts and exemplifies the parameters of the corpus-driven approach and posits a new unit of linguistic description defined systematically in the light of corpus evidence. The applications where the corpus-driven approach is exemplified are language teaching & contrastive linguistics.".
Wray, Alison. (2002). Formulaic language and the lexicon. Cambridge: Cambridge >University Press See Cobb’s review here

* In addition to the above, there are, of course, a number of volumes showcasing a wide variety of linguistic research based on corpora. Most notable are the volumes which proceed from the ICAME conferences. These volumes, unfortunately, come with variable titles and do not begin with 'Proceedings of ICAME...'. Fortunately, however, they are almost all published by the one publisher, Rodopi. Click on the link below to see their list of titles in corpus-based linguistics (mixed together with titles in computational linguistics).

Click to go to Rodopi Publisher’s site


John Benjamins: A list of current and upcoming titles in the series Studies in Corpus Linguistics can be found here. For Benjamins books in the fields of Computational and Corpus Linguistics, click here.

Routledge: A new research monographs series, Routledge advances in corpus linguistics, has been started. Search the current catalogue using the series title here.

Continuum: A series called Corpus and Discourse can be found here.

Reviews: Pieter de Haan has done some reviews of CBL books, availableon-line here

CBL Books on Spoken Discourse & Prosody

Sergio, Francesco Straniero, & Falbo, Caterina. (2012). Breaking ground in corpus-based interpreting studies. Frankfurt am Main: Peter Lang focuses on interpretation corpora (spoken language + its interpretation into another language). Publishers Book URL: here
Aijmer, Karin. (2002). English discourse particles: Evidence from a corpus. Amsterdam: John Benjamins. Research on discourse particles (focussing on now, oh, just, sort of, and that sort of thing, actually) based on the London-Lund Corpus.
Breivik, Leiv Egil, & Hasselgren, Angela. (2002). From the COLT’s mouth - and others': Language corpora studies in honour of Anna-Brita Stenström. Amsterdam: Rodopi. See publisher’s blurb and table of contents here and Linguist List review here.
Mukherjee, Joybrato. (2001). Form and function of parasyntactic presentation structures: A corpus-based study of talk units in spoken English.. Amsterdam: Rodopi. "investigates prosody-syntax interactions from a functional perspective and based on authentic corpus data.[London-Lund Corpus...] The focus of both the quantitative and the functional analysis is on the interplay between prosodic status and syntactic status at tone unit boundaries by means of which talk units as parasyntactic units are established [...] to structure information effectively and to allow for or facilitate turn taking." More of the publisher’s description here.
McCarthy, Michael. (1998). Spoken language and applied Linguistics. Cambridge: Cambridge >University Press. Implications of corpus work (based on the CANCODE corpus) for language teaching and linguistics. Very interesting research. My beef with CANCODE-based research is that it cannot be verified easily because the corpus is restricted to in-house use at Nottingham and the publishing house.
Knowles, Gerry, Wichmann, Anne, & Alderson, Peter. (1996). Working with speech: Perspectives on research into the Lancaster/IBM Spoken English Corpus. London: Longman. Individual essays analyse the preparation of the corpus, the issues underlying transcription, grammatical analysis and the application of the corpus in speech research. See Longman page here for Contents & Description
Aijmer, Karin (1996). Conversational routines in English: Convention and creativity. London: Addison Wesley Longman Conversational routines = pre-fabricated phrases, fixed expressions, idioms, phraseology, lexicalised sentence stems (take your pick of the terminology); based on the London-Lund Corpus; provides a discoursal and pragmatic account of the more common expressions found in conversational routines, such as apologising, thanking, requesting and offering. See Longman page here for Contents & description
Leech, Geoffrey, Myers, Greg, & Thomas, Jenny. (Eds.). (1995). Spoken English on computer: Transcription, mark-up and application. London: Longman. A little dated, but still the best collection of papers on the nitty gritty of recording, compiling, transcribing and marking up spoken data. See Longman page here for Contents & Description
Stenström, Anna-Brita. (1994). An introduction to spoken interaction. London: Longman. Corpus-based [London-Lund Corpus] insights into patterns and structures in spoken discourse. See Longman page here for Contents & Description
Svartvik, Jan. (Ed.) (1990). The London-Lund corpus of spoken English: Description and research. Lund: Lund University Press. Detailed manual for the London-Lund corpus + some research based on it.

Corpus-based Books with a Paedagogical Focus

(the most recent books are listed first)

Kübler, Natalie. (Ed). (2011). Corpora, language, teaching, and resources: From theory to practice. Bern: Peter Lang a selection of papers originally presented at the 7th Teaching and Language Corpora (TALC) Conference held in Paris in 2006. Publisher’s site here.
Braun, Sabine, Kohn, Kurt, & Mukherjee, Joybrato. (Eds). (2006). Corpus technology and language pedagogy: New resources, new tools, new methods. (English Corpus Linguistics, Volume 3) Frankfurt/Main: Peter Lang "intended to take stock of some major developments in corpus-informed language pedagogy... present new resources, new tools and new methods for corpus-informed language pedagogy. In general, the papers demonstrate a noticeable shift from the more 'traditional' uses of corpora and corpus technology in linguistic research towards uses with specific pedagogical goals in mind."
Kettemann, Bernhard & Marko, Georg. (Eds). (2006). Planing, Gluing and Painting Corpora: Inside the applied corpus linguist’s workshop. Frankfurt am Main: Peter Lang. Table of Contents and Description at publisher’s site here
Gavioli, Laura. (2005). Exploring corpora for ESP learning. Amsterdam: John Benjamins. Investigates the effects of corpus work on the process of foreign language learning in ESP settings; suggests that observing learners at work with corpus data can stimulate discussion and re-thinking of the pedagogical implications of both the theoretical and empirical aspects of corpus linguistics. The ideas presented here are developed from the Data-Driven Learning approach introduced by Tim Johns in the early 90s. The experience of watching students perform corpus analysis provides the basis for the two main observations in the book: a) corpus work provides students with a useful source of information about ESP language features, b) the process of "search-and-discovery" implied in the method of corpus analysis may facilitate language learning and promote autonomy in learning language use. View Table of Contents and Description here.
Nesselhauf, Nadja. (2005). Collocations in a learner corpus. Amsterdam: Benjamins. On the basis of the German subcorpus of ICLE (the International Corpus of Learner English), advanced learners' performance in the area of verb-noun collocations (such as make a decision) are investigated. Idiosyncratic collocation use by learners is uncovered, the building material of learner collocations examined, and the factors that contribute to the difficulty of certain groups of collocations identified. An extensive discussion of the implications of the results for the foreign language classroom is also presented, and the contentious issue of the relation of corpus linguistic research and language teaching is thus extended to learner corpus analysis. Web site blurb here.
Römer, Ute. (2005). Progressives, patterns, pedagogy: A corpus-driven approach to English progressive forms, functions, contexts and didactics. Amsterdam: John Benjamins. a large-scale corpus-driven study of progressives in 'real' English and 'school' English; comparative analysis of more than 10,000 progressive forms taken from the largest existing corpora of spoken British English and from a small corpus of EFL textbook texts highlights numerous differences between actual language use and textbook language concerning the distribution of progressives, their preferred contexts, favoured functions, and typical lexical-grammatical patterns; pedagogical implications are derived, the integration of which then leads to a first draft of an innovative concept of teaching progressives - a concept which responds to three key criteria in pedagogical description: typicality, authenticity, and communicative utility.
Aston, G., Bernardini, S., & Stewart, D. (Eds). (2004). Corpora and language learners. Amsterdam: John Benjamins. from selected presentations at the 5th Teaching and Language Corpora conference in Bertinoro, Italy. Table of Contents and Description at publisher’s site here.
Granger, Sylviane, & Petch-Tyson, Stephanie. (Eds.). (2003). Extending the scope of corpus-based research: New applications, new challenges. Amsterdam: Rodopi. Papers from the 22nd ICAME conference in Louvain-la-Neuve, Belgium>. View Table of Contents and Description here. A review from the LINGUIST List is available here
Granger, Sylviane, Hung, Joseph, & Petch-Tyson, Stephanie. (Eds.). (2002). Computer learner corpora, second language acquisition, and foreign language teaching. Amsterdam: John Benjamins. "takes stock of current research into computer learner corpora conducted both by ELT and SLA specialists; assesses relevance of corpora to SLA theory and ELT practice. Throughout the volume, emphasis is also placed on practical, methodological aspects of computer learner corpus research, in particular the contribution of technology to the research process. The advantages and disadvantages of automated and semi-automated approaches are analyzed, the capabilities of linguistic software tools investigated, the corpora (and compilation processes) described in practical insight to researchers who may be considering compiling a corpus of learner data or embarking on learner corpus research" View Table of Contents and Description here. A review from the LINGUIST List is available here
Hunston, Susan. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press. An interesting and wide-ranging survey of the uses of corpora in applied linguistics.
Tan, Melinda. (Ed.). (2002). Corpus studies in language Education. Bangkok: IELE Press. a collection of papers from scholars in Europe and Asia who describe the applications of corpora in various areas of language education, using authentic data taken from different English reference and learner corpora around the world. Each section of the book introduces the reader to a specific area in language education which could benefit from the findings presented by the corpus-based investigations in this collection. Two appendices have also been provided with useful information to assist readers who might be interested in conducting their own corpus-based investigations: a) a list of on-line corpus resources in English and b) a background description of the first on-line Thai English Learner Corpus (TELC) in South-East Asia. View Table of Contents and Description here.
Kettemann, Bernhard, & Marko, Georg. (Eds.). (2002). Teaching and learning by doing corpus analysis. Amsterdam: Rodopi. Proceedings of the Fourth International Conference on Teaching and Language Corpora [TALC 2000], 19-24 July, 2000, Graz.
Ghadessy, Mohsen, Henry, Alex, & Roseberry, Robert L. (Eds.). (2001). Small corpus studies and ELT: Theory and practice. Amsterdam: John Benjamins. "ultimate aim of this book is to encourage the exploitation of small corpora by the ELT profession to make language learning more effective. In addition to descriptions of the basic corpus analysis tools, chapters in the collection cover syllabus and materials design, comparisons of different genres, descriptions of local and functional grammars, compilation and use of learner corpora, and making cross-linguistic comparisons. The message of this collection is that language use is purposeful and culture specific and that small corpus analysis is an effective method of linguistic investigation."
Aston, Guy. (Ed.). (2001). Learning with corpora. Houston TX: Athelstan.; Bologna: Cooperativa Libraria Universitaria Editrice. Not reviewed yet. Table of contents may be viewed here.
Burnard, Lou, & McEnery, Tony. (Eds.). (2000). Rethinking language pedagogy from a corpus perspective. Frankfurt am Main: Peter Lang. Papers from the Third International Conference on Teaching and Language Corpora [TALC 98], 24-27th July 1998, Oxford >University.
Hunston, Susan, & Francis, Gill. (2000). Pattern grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins. describes an approach to lexis and grammar based on the concept of phraseology and of language patterning arising from work on large corpora. View blurb here and a review in System here (this link requires subscription to ScienceDirect)
Lewis, Michael. (Ed.). (2000). Teaching collocation: Further developments in the lexical approach. Hove: Language Teaching Publications. Not reviewed yet. (If you’ve read this and have a review I could link to, please contact me)
Botley, Simon P., McEnery, Tony, & Wilson, Andrew. (Eds.). (2000). Multilingual corpora in teaching and research. Amsterdam: Rodopi. see UCREL web site
Aston, Guy, & Burnard, Lou. (1998). The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press. main title rather misleading; could be of interest if you have to use SARA with BNC version 1; somewhat ouated since the release of BNC World Edition (for which you should use the updated on-line tutorial here instead)
Partington, Alan. (1998). Patterns and meaning: Using corpora for English language research and teaching. Amsterdam: John Benjamins. suggests ways of exploiting corpus data to shed light on language phenomena such as word sense, phraseology and syntax, metaphor and creative use, text reference, idiom, and translation. Emphasis is given to information that usually cannot be found in dictionaries, grammars, language textbooks or other resources, but which the study of corpus data makes available. Publisher’s page here.
McCarthy, Michael. (1998). Spoken language and applied linguistics. Cambridge: Cambridge University Press Implications of corpus work (based on the CANCODE corpus) for language teaching and linguistics.
Granger, Sylviane. (Ed.). (1998). Learner English on computer. London: Longman. The main text on learner (non-native) language corpora. View Table of Contents and Description here. A review of the book is available here.
Lewis, Michael. (1997). Implementing the lexical approach: putting theory into practice. Hove: Language Teaching Publications. Not reviewed yet. (If you’ve read this and have a review I could link to, please contact me)
Tribble, Chris, & Jones, Glyn. (1997). Concordances in the classroom: A resource book for teachers. Houston, Texas: Athelstan Press. Second edition, and still going strong. One for the language teachers.
Wichmann, Anne, Fligelstone, Steve, McEnery, Tony, & Knowles, Gerry. (Eds.). (1997). Teaching and language corpora. London: Longman. Papers from the First International Conference on Teaching and Language Corpora [TALC 94], 10-13th April, 1994, Lancaster University, UK. See UCREL web site. Or Longman page here for Contents & description.
Botley, Simon, et al. (Eds.). (1996). Proceedings of Teaching and Language Corpora. UCREL Technical Paper. Lancaster: Lancaster University. Papers from the Second International Conference on Teaching and Language Corpora [TALC 96], 9-12th August, 1996, Lancaster University, UK
Lewis, Michael. (1993). The lexical approach: The state of ELT and a way forward. Hove: Language Teaching Publications. main focus is on collocation as a key feature in ELT syllabus design
Willis, David. (2003). Rules, patterns and words: Grammar and lexis in English language teaching. Cambridge: CUP. "illustrates a new way of describing the grammar of spoken and written English...demonstrates how lexical phrases, frames and patterns provide a link between grammar and vocabulary...discusses how the different aspects of the language require different learning processes and different teaching techniques. These processes and techniques are contextualised within a task-based approach to teaching and learning. Numerous interactive tasks are provided to guide readers and over forty examples of teaching exercises are included to illustrate techniques which can be applied in the classroom immediately."
Willis, David. (1990). The lexical syllabus: A new approach to language teaching. London: Collins ELT. emphasizes the role of lexis in language teaching, based on the Collins COBUILD corpus. Out of print; now available on-line at:

* See also: Tognini-Bonelli (2001) listed earlier.

Books with a Computational Focus

Hammond, Michael. (2002). Programming for linguists: Java (tm) technology for language researchers. Oxford: Blackwell. Not reviewed yet. (If you’ve read this and have a review I could link to, please contact me)
Baayen, R.H. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers (Text, Speech and Language Technology, 18) technical treatise on word frequency distributions for NLP people.
Mason, Oliver. (2000). Programming for corpus linguistics: How to do text analysis in Java. Edinburgh: Edinburgh >University Press. for those who want to go beyond the available standard packages and write their own programs for text analysis. Details here (with a list of errata)
Jurafsky, Daniel, & Martin, James. (2000). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. London: Prentice Hall. more for NLP/computational linguistics people. Has companion web site here, with errata and LaTeX bib file.
Manning, Chris, & Schütze, Hinrich. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press. an excellent, largely readable (but, unsurprisingly, heavily mathematical/technical) introduction to NLP containing a number of chapters which corpus linguists would find useful. Has a companion website here.

On-line Bibliographies on CBL (corpus-based linguistics) & related areas

For general bibliographical searches in other areas of linguistics, see below.

*Bibliography on Corpus-based Linguistics (maintained by Przemyslaw Kaszubski) Up-to-date (and regularly updated) bibliography which aims to be comprehensive (includes Altenberg’s listings and goes beyond it (i.e. 1998 onwards) Available in two formats: Microsoft Access 2000 database or Plain text tab-delimited file (Windows, Central European)
* On-line, searchable ICAME bibliographical database
(relates only to English language corpora)
Search corpus-based journal articles, books, edited collections, etc. by Author, Title, Year, etc. At the time of writing (22/5/2002), it only includes entries up to 1999. It is updateable (you may submit new entries for inclusion).
LLT’s Corpora Bibliography Part of the the Language Learning and Technology journal’s special issue on 'Using Corpora in Language Teaching and Learning'
Bibliography at links and references for the use of corpora, corpus linguistics and corpus analysis in the context of language learning and teaching.
Bengt Altenberg’s ICAME bibliography
(relates only to English language corpora) The mother of bibliographies on corpus-based research on English up until 1998, but now superseded by the on-line ICAME bibliographical database (see above link) and Kaszubski’s database.
Go to the ICAME web page here and click on the links near the bottom of the page, or use the direct links below:
Part 2 (Bibliography for CBL research up to 1989): in TXT format only (59 KB)
Part 3 (1990-98): in HTML (115 KB), RTF (145 KB) or Word 6 DOC format (127 KB)
(Part 1 is in print form only, as an article in: Johansson, Stig & Anna-Brita Stenström (eds). 1991. English computer corpora. Selected papers and research guide. Berlin: Mouton de Gruyter.)

Learner Corpora Bibliographies

CECL bibliography on Learner Corpora a select bibliography of works related to learner (non-native language) corpora
Yukio Tono’s bibliography on Learner Corpora - ditto -

Other Relevant Bibliographies

* BL Online (Bibliographie Linguistique) a searchable bibliographical database of linguistics. The BLonline database provides bibliographical references to scholarly publications on all branches of linguistics and all the languages of the world, irrespective of language or place of publication. It contains all entries of the printed volumes of Bibliographie Linguistique/Linguistic Bibliography for the years 1993-1998 and an increasing number of more recent references.
A Bibliography of Phraseology with links to phraseological dictionaries and other related biblios.
Bibliography on Multiword Expressions follow link on the home page of the MWE Project
Yukio Tono’s lexicography links useful sites on lexicography
John Higgin’s CALL bibliography on computer-assisted language learning
Bernhard Kettemann’s bibliography esp. relevant for language teachers
Bibliography on Parallel Corpora prepared by Jean Véronis & Marie-Dominique Mahimon
Michael Barlow's Corpus Linguistics Bibliography incomplete and out of date, but useful as a starting point
Joaquim Llisterri’s Bibliography on corpus-related work not comprehensive, but suggestive and categorised.
W3C’s Bibliography on Corpus Linguistics rather dated, but a starter.
SEU’s ICE-GB Bibliography books and articles using material from the original Survey Corpus (‘the Quirk corpus’) or from the British Component of the International Corpus of English (ICE-GB) up to January 2001.
Select bibliography for humanities computing detailed (but ouated) bibliography (1996) for beginners in the humanities
Human-Computer Interaction (HCI) bibliography includes stuff on assistive technologies, user interfaces in info retrieval, etc.; not directly relevant for corpus-based linguists, but some HCI research is corpus-based

On-line Papers & Dissertations related to CBL, use of corpora in the classroom, etc.

(for non-on-line publications, use the bibliographies above)

Please click the link below:
View links to On-line Papers and Dissertations - click hereView links to On-line Papers and Dissertations - click here

Some EAGLES (Expert Advisory Group on Language Engineering Standards) Papers relevant to CBL

Preliminary recommendations on corpus typology aims to offer a sound and resonably replicable way of classifying corpora, with clearly delimited categories wherever possible, and informed suggestions elsewhere. The paper has been reviewed by many experts in the field, who are in broad agreement that to present a more rigorous classification would be intellectually unsound and would be ignored by the majority of workers in the field. The present paper has a chance of acceptance because it raises the relevant issues and offers usable classifications.
Preliminary recommendations on text typology some views on external and internal criteria for the classification of texts in corpora
Recommendations on corpus encoding on the Corpus Encoding Standard (CES), which specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora.
Browse the Table of Contents at the EAGLES site a whole lot of other guidelines on annotation, speech corpora, linguistic software and lexicons

Journals relevant to CBL (some on-line)

International Journal of Corpus Linguistics (IJCL) language research, lexicography and natural language processing (NLP); TOC and Abstracts on-line
ICAME Journal (from the International Computer Archive of Modern and Medieval English) articles and information about English computer corpora; Full, free on-line access.
Corpus Linguistics and Linguistic Theory new, peer-reviewed journal publishing high-quality original corpus-based research focusing on theoretically relevant issues in all core areas of linguistic research. It will feature papers that: develop new corpus-linguistic methods or extensions of existing methods of interest in the context of linguistic theorizing; test or evaluate theoretical claims using corpus data and corpus-linguistic methods; offer systematic and detailed analyses of individual linguistic phenomena within a theoretical framework; compare corpus data to other kinds of empirical data, such as experimental or questionnaire data. Also will contain: critical surveys of relevant areas of research; squibs; reviews of new books, corpora, or software packages.
Language Learning & Technology a refereed on-line journal for second & foreign language educators; Free & full on-line access
Corpora a new journal; research findings based on the exploitation of corpora as well as accounts of corpus building, corpus tool construction and corpus annotation schemes; 3 key features: (1) Theoretical inclusiveness (2) Interdisciplinarityand disciplines (e.g. cultural studies, historical studies, literary studies) (3) Multilinguality (not just English or major European languages)
English for Specific Purposes Not specifcally for corpus-based work, but very open to it. "international peer-reviewed journal; topics relevant to the teaching and learning of discourse for specific communities: academic, occupational, or otherwise specialized."
System an an international journal of educational technology and applied linguistics (subscription-based)
Literary and Linguistic Computing all aspects of computing and information technology applied to literature and language research and teaching. Papers include results of research projects, description and evaluation of techniques and methodologies, and reports on work in progress; TOC and Abstracts on-line
Journal of Quantitative Linguistics "all aspects of language and text phenomena, including the areas of psycholinguistics, sociolinguistics, dialectology, pragmatics, etc., as far as they use quantitative mathematical methods (probability theory, stochastic processes, differential and difference equations, fuzzy logics and set theory, function theory etc.), on all levels of linguistic analysis." (Not for the mathematically challenged amongst us.)

Journals geared towards humanities subjects

Computers & Texts (ceased publication -- back issues only) journal/newsletter of the CTI Centre for Textual Studies (which no longer exists)
Computers and the Humanities (as it says: computing applications in the humanities)
CHWP (Computing in the Humanities Working Papers) interdisciplinary series of refereed publications on computer-assisted research; has articles, preprints, postprints, essays, experimental papers and mutanda.

Others of interest (CALL and pedagogy)

Computers & Education papers on cognition, educational or training systems development using techniques from and applications in any technical knowledge domain.
CALL - An International Journal On Computer-Assisted Language Learning.
ReCALL fully-refereed journal of EUROCALL (European Association for Computer Assisted Language Learning).
CALL-EJ On-line Free on-line journal on CALL.
Humanising Language Teaching Free on-line magazine for language teachers that has a section on corpora in each issue.

On-line Papers, Archives and Bibliographies relevant to NLP

The Computation and Language E-Print Archive (or UK mirror here) has on-line proceedings
Bibliography of Computational Linguistics from the Collection of Computer Science Bibliographies
Survey of the State of the Art in Human Language Technology (1997) some stuff here is relevant to corpus-based linguists, but bear in mind the date (1997). PDF version here.
Multilingual Corpora - Current Practice and Future Trends by Tony McEnery; a survey of multilingual corpus building

Bath Information & Data Services (BIDS) bibliographic search service for the UK
Bibliography of Computational Linguistics from the Collection of Computer Science Bibliographies
Bibliographic Databases at Essex several searchable databases, mainly on computational linguistics & machine translation, but also on LFG, HPSG, syntax and semantics
Web of Science (subscribers only) journal citations search service for the UK, covering all disciplines (find out who has been quoting whom where). Password needed.
COPAC search the merged online catalogues of 20 of the largest university research libraries in the UK and Ireland. Free access. Useful for finding that last bit of information you’re always missing from your bibliography (year, publisher, etc)!
getCITED getCITED is a free, online, member-controlled academic database, directory and discussion forum. Its contents are entered and edited by members of the academic community. By putting its content in the hands of its members, getCITED makes it possible to enter in and search for publications of all types. This means that, in addition to the books and articles accessible with other databases, book chapters, conference papers, working papers, reports, papers in conference proceedings, and other such research outlets can all be entered and then searched for within getCITED. In addition, getCITED makes it possible to link publications with all the publications in their bibliographies, thereby making possible a wide variety of publication and citation reports.
ResearchIndex (formerly CiteSeer) a scientific literature digital library and citation search site that indexes Postscript and PDF research articles on the Web. Only indexes 'scientific' (i.e. computational linguistics) papers.

On-line Document Archives

Cogprints an electronic archive for papers in any area of Psychology, Neuroscience, and Linguistics, and many areas of Computer Science, Philosophy and other disciplines pertinent to the study of cognition. It is a free, electronic self-archiving service, where you can archive your own papers, whether published or not, refereed or not, and, where you can, read or download the papers of others. It provides a way to make scholars' pre-refereeing preprints and refereed, published reprints available to the world scholarly and scientific community. Suggests ways to get around the copyright issue.

Did you find this web site useful? Mostpeople, sadly, don’t bother to let me know, but if you want to encourage me to keep updating the site,drop me a line.

[TOP of this page]

Back to HOME ([Bookmarks HOME]