Practical Corpus Linguistics – Online Materials & Resources
This page contains links to the online materials/exercises accompanying my textbook Practical Corpus Linguistics. In future, I’m also planning to add links to some of the relevant resources, such as concordance programs, web-interfaces to generally accessible corpora, etc.
In addition, to keep the textbook up-to-date even if some of the resources originally described there may change, revised information containing the most recent changes to program interfaces, latest program versions, etc., will be posted here.
- Understanding Encoding: Character Sets
- Understanding File Formats & Their Properties
- Cleaning Written Data
- Regular Expressions
- Understanding Units in Texts
Luckily, over quite a few years, nobody has actually reported any errata in the book to me, apart from yesterday (Wed 06-Jul-2022), when one of my students pointed out that on p. 176, just prior to Exercise 66, I wrote “[...], where we investigate potential differences in the use of positions in economics texts.”, where of course it should read ‘prepositions’.
Use of Editors on Different Operating Systems
Fri 13-Mar-2020 11:35:08: I recently discovered that Komodo Edit is now available for all Operating Systems discussed in the book. While I would still recommend using Notepad++ on Windows, if you should work on Mac and/or Linux systems, I would recommend using this editor there in favour over any of the other options I discuss.
The New BYU Interface
In May 2016, just about 3 months after the publication of the book, the BYU corpora interface underwent a rather drastic change, partly to make it more user-friendly for mobile phones. In the following, I shall try to summarise these changes inasmuch as they affect the content of the descriptions in the book, so as to allow readers to carry work through the exercises in the book using the new interface, rather than having to resort to switching to the old one.
The BYU interface is used in various places throughout the book, mainly as an interface to COCA, but also to carry out comparisons between COCA and the BNC, starting from section 8.2 (p. 132). Instead of the original frame-based display depicted in Fig. 8.4 (p. 133), the interface now has one basic window, as shown below.
Once you run a query by clicking on the Find matching strings button, you are taken to the FREQUENCY ‘tab’, which essentially looks like the top right-hand frame in the original figure in the book. Selecting the desired results from the frequency list and clicking on the CONTEXT button will then produce the output from the bottom right-hand frame, only this time on the CONTEXT ‘tab’.
The search syntax has also been changed extensively, most notably removing most of the square bracket options, and introducing some abbreviations.
- Lemma queries, as in our example, are now carried out by capitalising the word, so MOVIE – instead of the previous [movie] – finds both singular and plural of the word.
- Instead of using square brackets + equals sign – e.g. [=lazy] – for finding synonyms, these can now be written without the brackets – =lazy –, even combining lemma + synonym as =LAZY.
- When looking for whole words with a particular PoS, it is now possible to use more explicit or abbreviated forms, similar to the abbreviated forms in BNCweb.
The basic full or explicit, forms are NOUN (for common nouns), NAME (for proper nouns), VERB (lexical verbs, but not auxiliaries), ADJ (adjectives), ADV (adverbs), PRON (pronouns), PREP (prepositions), ART (articles), and DET (determiners), while NOUN+ and VERB+ find all nouns or verbs, respectively.
In addition, there are abbreviated forms for common nouns (N), proper nouns (NP), all nouns (N+), lexical verbs (V), all verbs (V+), adjectives (J), and adverbs (R).
Thus, either go ADV or go R will now find go followed by any adverb. Similarly, -NOUN WATER or -N WATER will now retrieve all instances of water or waters not preceded by a noun.
- In word (or lemma) + tag combinations, the dot linking the two parts has now been replaced by an underscore, and abbreviated lowercase tag forms can now be used. These are _nn (nouns), _np (proper nouns), _n (all nouns), _vv (all lexical verbs), _v (all verbs), _j (adjectives), _r (adverbs), _p (pronouns), _i (prepositions), _a (articles), and _d (determiners). Thus, mind_n will now find mind only as a noun, and MIND_n, both singular & plural thereof.
More to come soon...
Notes (most recent ones first)
- 07-Jul-2016 14:40:52: Mark Davies has recently (May 2016) changed the BYU interfaces to a new design without frames, as well as introducing some other changes. The general idea behind this is to make everything more user-friendly, but, sadly for us, the queries described in the book will now frequently no longer work in exactly the same way. To be able to do all the exercises based on the descriptions in the book, you will currently need to access the old BYU/COCA interface. I will try to put together some information about how to work with the new interface here within the next few months, so please be patient in the meantime...
- Please note that the corpora website referred to on page 21 is now no longer maintained by David Lee, but that I have taken over administration and maintenance in January 2016. The short URL http://tiny.cc/corpora still works, though, but if you have trouble in accessing this, you can also use/bookmark the full address.