Corpus-based Linguistics

Note: This site was originally created by David Lee, who has recently done me the great honour of handing it over to me for maintenance & administration. I hope I’ll prove myself worthy of his trust! Taking over such a site from someone else and to keep on doing the original ideas justice is a difficult task, but one that I hope has been made easier for me by the fact that I seem to share many of David’s original ideas. This is why I’m also retaining the next few paragraphs, which explain the general philosophy of the site, largely as they were originally written, and with only slight modifications.

In terms of content & structure, I’ve initially started cleaning up the original HTML (written, sadly, in Word) and reformatting the site. This has now been followed by a certain degree of re-structuring and housekeeping, starting with breaking down the software & (main) corpora pages into multiple sections, which has hopefully made these more manageable. I will continue re-structuring, updating, and pruning the remaining pages over the next few months. As all this is bound to involve a few mishaps, please feel free to alert me to any strange 'features' you might observe, as well as to provide comments on my restructuring efforts.

Why this site?

Most similar sites tend to be quite outdated, or haven’t been spring-cleaned for years. Also, they’re either not as comprehensive or wide-ranging in scope, or don’t cater to applied linguists. Having this bookmarks site on the web means all you corpus-based linguists out there won’t have to each create your own, so if you find a really good web site or resource not already listed here, don’t keep it to yourself! Let me know, and I’ll put it up here to share with others. Let’s make this web site a collaborative resource.

Having said this, please avoid sending me information that is of relatively little relevance to corpus linguistics, but only relevant for computational linguistics.

What’s here?

The annotated links on this site are mainly meant for linguists and language teachers who work with corpora, not computational linguists/NLP people, so even if the language-engineering-type links here may be fairly extensive, they are certainly not exhaustive, and do not aim to be. For such info, you’ll have to look elsewhere. Stuff here also represent my/our personal interests and biases (which will be obvious in some of my descriptive notes), and consequently there may be gaps, errors and omissions which you are welcome to tell me about. The English language bias on these pages will, I hope, be forgiven. It simply reflects my/our research interests and what’s available out there.

Why ‘CBL’?

I use this acronym for three reasons:

to put the focus on linguistics: i.e. what we primarily do is ‘linguistics’ – it just happens to be corpus-based (or ‘corpus-driven’, ‘corpus-informed’, whatever you want to call it)
‘CL’ would be confusing since it is already widely used to mean ‘computational linguistics’
the term corpus linguistics, while shorter and more popular, tends to give the impression that it is a branch of linguistics rather than just a methodology which can be applied to any existing branch of linguistics. Our interest should be in language, not corpora per se. I therefore personally avoid using the term ‘corpus linguistics’.

How to use this site?

Here, you’ll find links to any- & everything to do with the use of language corpora. The links are categorised and annotated to facilitate browsing/searching. Just click on a category in the left frame to see a list of links in this main window.

Before you delve deep into the links, though, it may be useful to clarify some conceptual or terminological issues, so that you’ll be able to understand the content of the site better.

How to categorise corpora?

Kennedy (1998) suggests a three-way categorisation of corpora:

Pre-electronic corpora, i.e. biblical & literary studies, early dictionaries, etc.,
First-generation Corpora, generally based on the ‘BROWN model’,
Second-generation (Mega) Corpora, such as the BNC & COCA.

We’ll initially follow this distinction, but will the expand our categories, based on the way in which corpora have developed further in recent years and for various purposes.

What’s the difference between ‘Corpora’, ‘Collections’, & ‘Data Archives’?

Throughout the site, you may encounter frequent references to these terms, so we’d best clarify them first.

Some definitions (from Atkins, Clear, & Ostler. (1992). Corpus design criteria. Literary & Linguistic Computing, 7(1), pp.1-16):

Archive: a repository of readable electronic texts not linked in any coordinated way, e.g. the Oxford Text Archive
Electronic Text Library (or ETL, Fr. 'textothèque'): a collection of electronic texts in standardized format with certain conventions relating to content, etc., but without rigorous selectional constraints.
Corpus: a subset of an ETL, built according to explicit design criteria for a specific purpose, e.g. the Corpus Révolutionnaire (Bibliothèque Beaubourg, Paris), the Cobuild Corpus, the Longman/Lancaster corpus, the Oxford Pilot corpus.

What are some of the file formats I can download materials in?

If you’ve downloaded a file that you don’t know what to do with, here are some pointers:

Zipped/compressed files (files ending in .zip or .gz or .tar)	Use 7zip (freeware) to extract them.
Postscript files (.ps)	may be previewed or read on screen before printing (or instead of printing) using Ghostview (+ the Ghostscript interpreter) [Get latest version of Ghostview + Ghostscript].
Adobe Acrobat (.PDF) format	Get Adobe Acrobat Reader (freeware) to view the files.
Microsoft Word format (files ending in .doc(x)) or PowerPoint files (.ppt(x))	if you don’t have a particular version of Word or PowerPoint you can install the freeware package OpenOffice, which will display the content and also allow you to extract the textual data.

What do some of these terms mean? A mini-glossary.

Corpus (plural=corpora)	“a collection of pieces of language [texts] that are selected and ordered according to explicit linguistic criteria in order to be used as a [representative] sample of the language” (taken from EAGLES 1996:4) A corpus can be synchronic (closed), presenting a snapshot of the language of a particular period, or it can be a monitor corpus (e.g., the Bank of English), where new material is added on a continual basis.
Concordance	A formatted/sorted listing of all the instances of a search term (word/phrase) in a corpus, usually in a ‘KWIC’ format with the search term in the centre of the page or screen. Hence we talk about concordance lines (individual search ‘hits’) produced by the concordancer (the software).
KWIC/KIIC	Keyword/Key Item in Context: a display format showing the search item (word/phrase) plus the surrounding words/charcters to the left and right of it. Useful for examining how a word/phrase is used in real samples of language (embedded in real-life co-texts, genres, and social relationships/contexts); helps analysts discover regularities or patterns governing the usage of the item/word/phrase.
Collocation	The phenomenon, tendency or specific instance of words/lexical items habitually co-occurring close to one another (i.e. the greater-than-chance co-selection of words revealing the language habits of native speakers). For example, if you look up the word jubilee in a large collection of (British) English texts you will tend to find the following words (the collocates) nearby: silver, diamond, golden, Queen’s, and line (the ‘Jubilee Line’ is one of the London Underground subway routes). In language teaching, learners are shown or taught collocations in order to help them speak and write natural-sounding ('idiomatic') language. Some nouns, for example, have very strong verb collocates: e.g. conclusions are drawn/reached, but not made [ = noun-verb collocation]. The term ‘collocation’ is very broad and allows for varying degrees of collocability (or collocational strength), which is measured via several statistical formulae (e.g. log-likelihood, mutual information). At one extreme of the scale, collocations which are totally predictable are usually analysed as ‘idioms’, ‘cliches’, ‘fixed expressions’, ‘lexical bundles’, etc. At the other extreme, items which co-occur significantly in statistical terms may not be recognised as predictable collocations by native speakers, i.e. the collocational regularity or statistical cooccurrence is there, but it may not have any psycholinguistic reality for native speakers. Some abstract patterns of meaning resulting from collocations (whether intuitive or not) form a system referred to as ‘semantic prosody’ (a systematic connotational ‘colouring’ of a word or phrase that arises from its collocational patterning into one or more semantic sets).
ASCII text format	“American Standard Code for Information Interchange” = printable text format = Plain vanilla Anglocentric text format, based on Roman/English orthography, essentially consisting of everything you can see on an ordinary US computer keyboard: letters (A-Z, a-z), digits (0-9), punctuation marks, plus a few miscellaneous symbols ($ % @ # ~ * & _ + - ( ) < > { } \| \ ^ etc.). No ‘exotic/foreign’ (non-English) characters are included except for those with the accent marks (diacritics) used in French and Spanish (e.g., è é ê ñ ).
Mark-up (or markup) versus Annotation	Some people don’t make a distinction between the terms, but sometimes this may be useful one. Mark-up= tags (added character strings) used to code the structural or surface format/renditional attributes of a text (e.g., headings, sections, page breaks, sentences, bold/italics, speaker ID, speaker turns, pauses), OR non-interpreted aspects of the situated context of the discourse (e.g. bibliographical or demographic details about the author or speaker, location of speech event, genre, etc., and also gestures, laughter, voice quality, and events such as “writes on blackboard”). In the mark-up languages *HTML/SGML/XML, mark-up is always contained within angled brackets. Annotation* = a subset of mark-up; tags (added character strings) used to code ‘value-added’ or interpreted information, derived through analysis by humans or machines; usually added for research purposes. The most common annotations are part-of-speech (PoS) tags, lemmas, semantic tags, discourse-level/pragmatic tags. Marked-up/annotated texts are often designed for computational tractability, and not meant to be read ‘raw’. They can, however, be rendered for human consumption with the right software/user interface.
lemma (plural=lemmas or, less commonly, lemmata)	An abstract lexical category (usually represented in all-capitals, e.g. BLOW) consisting of a lexeme base plus its inflected forms (regular, irregular & suppletive inflections) which share the same part of speech. For example, the verbal lemma BLOW contains the word forms blow, blows, blew, blown and blowing, while the lemma GO encompasses go, goes, went, gone, going. Lemmas for nouns (or ‘substantives’) group together singular and plural forms (e.g. wolf/wolves); adjectival lemmas group together positive, comparative and superlative forms (e.g. happy, happier, happiest; good, better, best); pronominal lemmas consist of the different ‘cases’ of the same pronoun (e.g. I, me, my, mine).

If you’ve found this web site useful, or found an outdated link, don’t forget to let me know.

Corpus-based Linguistics – Introduction