Bookmarks for Corpus-based Linguists

[N.B. This is an excerpt from Lee (2001). If quoting, go to the source and cite it.]



‘Domains’ versus ‘genres’: the BNC Sampler & why we need genre information

The BNC Users’ Reference Guide states that only three criteria were used to ‘balance’ the corpus: domain, time and medium. In choosing texts for inclusion into the BNC Sampler (the 2-million word sub-set of the BNC), ‘domain’ was probably the most important criterion used to ensure a wide-enough coverage of a variety of texts. On the BNC web page for the Sampler[1], the following comment on its representativeness is made:

In selecting from the BNC, we tried to preserve the variety of text-types represented, so the Sampler includes in its 184 texts many different genres of writing and modes of speech. [my emphasis]

It should be noted that no real claim to representativeness is made, and that what they really meant was that many different texts were chosen on the basis of domain and other criteria.[2] The fact that the Sampler contains many different genres is not in doubt, but the texts were not chosen on this basis, since they had no genre classification, and hence the Sampler cannot (and, indeed, it does not) claim to be representative in terms of ‘genre’.

It is my belief that it is because ‘domain’ is such a broad classification in the BNC that the Sampler turned out to be rather unrepresentative of the BNC and of the English language. Anyone wishing to use the Sampler should be under no illusion that it is a balanced corpus or that it represents the full range of texts as in the full BNC. The Sampler may be broadly ‘balanced’ in terms of the ‘domains’, but when broken down by genre, a truer picture emerges of exactly how (un)representative it really is. The following lists of missing or unrepresentative genres in the Sampler BNC demonstrate this:


SPOKEN BNC Sampler: Missing or Unrepresentative Genres

Consultations: medical (none)
Consultations: legal (none)
Classroom discourse (only 3 texts)
Public debates (only 3 texts)
Job interviews (none)
Parliamentary debates (none)
News broadcasts (none)
Legal presentations (there are 2 legal cross-examinations, but no presentations, i.e., monologues)
University lectures (none)
Telephone conversations (no pure telephone conversations in the BNC as a whole)
Sermons (only 1 text)
Live sports discussions (none)
TV/radio discussions (only 4 texts)
TV documentaries (only 2 texts)


WRITTEN BNC Sampler: Missing or Unrepresentative Genres

Academic prose: humanities (none)
Academic prose: medicine (none)
Academic prose: politics, law and education (only 2 texts on law, none on politics or education)
Academic prose: natural sciences (nothing on chemistry, only 1 on biology & 3 on physics)
Academic prose: social sciences (nothing on the core subject areas of sociology or social work, nor on linguistics, which is arguably a social science, even though it is often treated as a humanities subject)
Academic prose: technology & engineering (nothing on engineering)
Administrative prose (only 1 text)
Advertisements (none)
Broadsheets: the only broadsheet material included consisted entirely of foreign news, and only from the Guardian.
Broadsheets: sports news (none)
Broadsheets: editorials and letters (none)
Broadsheets: society/cultural news (none)
Broadsheets: business & money news (none)
Broadsheets: reviews (none)
Biographies (none)
E-mail discussions (none)
Essays: university (only 1 text)
Essays: school (none)
Fiction: Drama (only 1 text)
Fiction: Poetry (only 2 texts)
Fiction: Prose (insufficient texts, and only 1 short story)
Parliamentary proceedings/Hansard (none)
Instructional texts (none)
Personal letters (none)
Professional letters (none)
News scripts (only 1 radio sports news script)
Non-academic: humanities (only 2 texts)
Non-academic: medicine (none)
Non-academic: pure sciences (none)
Non-academic: social sciences (2 rather odd texts, and 1 which possibly could be non-academic)
Non-academic pure science material (i.e. popularisations of science texts: there were none of these in the Sampler)
News scripts (classified as 'written-to-be-spoken' in the main BNC. None included in the Sampler)
Official documents (only 1 text)
Tabloid newspapers (only Today and East Anglian Daily Times, the latter of which is not really a tabloid, but a regional newspaper)


I hope the above proves my point that ‘genre’ is perhaps a more insightful classification criterion than ‘domain’, as least as far as getting a ‘representatively balanced corpus’ is concerned.

If the compilers of the BNC Sampler had known the genre membership of each BNC text, they would probably have created a more balanced and representative sub-corpus. As things stand, however, any conclusions about ‘spoken English’ or ‘written English’ made on the basis of the BNC Sampler will have to be evaluated very cautiously indeed, bearing in mind the genres missing from the data.

There is another example of how large, undifferentiated categories similar to domain can unhelpfully lump disparate kinds of text together. Wikberg (1992) criticises the LOB text category E (‘Skills, trades and hobbies’) as being too baggy or eclectic. He demonstrates how, on the evidence of both external and internal criteria, the texts in Category E can actually be better sub-classified into ‘procedural’ versus ‘non-procedural’ discourse. He also notes that it is not just text categories which can be heterogeneous but that some texts are ‘multitype’ or mixed in terms of having different stages with different rhetorical or discourse goals. He thus concludes with the following comment:

An important point that I have been trying to make is that in the future we need to pay more attention to text theory when compiling corpora. For users of the Brown and the LOB corpora, and possibly other machine-readable texts as well, it is also worth noting the multitype character of certain text categories. (p. 260)

This is a piece of advice worth noting.


Did you find this useful? Do let me know, to encourage me to keep updating the site.

[TOP of this page]

Back to HOME (tiny.cc./corpora)[Bookmarks HOME]