Bookmarks for Corpus-based Linguists

If you need help with file formats for some of the downloads, [click here]

Back to References, Papers, Journal

On-line Papers, Dissertations & Squibs related to CBL - An annotated list

(Speech Corpora-related topics are in a separate section below)
Abney, Steven (1996) Statistical Methods and Linguistics. In Judith Klavans and Philip Resnik (eds.), The Balancing Act: combining symbolic and statistical approaches to language. The MIT Press, Cambridge, MA. (61 KB, 23+1 pages, A4) A brilliant paper, in my opinion, and much neglected. Brain food. [local PDF version (zipped) here] [Gzipped Postscript version here]
Barlow, Michael. (1998) 'Investigating form and meaning using parallel corpora'. In W. Teubert, E. Tognini Bonelli and N. Voltz (eds), Proceedings of the Third European Seminar. Translation Equivalence, pp. 13-28. issues surrounding the notion of translation equivalents are discussed in terms of a schema-based approach to grammatical knowledge described in Barlow and Kemmer (1994) and Barlow (1996). A brief description of a parallel concordance program is also included.
Berber Sardinha, Tony (2000) Comparing Corpora with WordSmith Tools: How large must the reference corpus be? in Kilgarriff, Adam & Tony Berber Sardinha (eds.), Proceedings of the Workshop on Comparing Corpora (Held in conjunction with The 38th Annual Meeting of the Association for Computational Linguistics), pp.7-13. investigates the optimal size of the reference corpus when doing "Keywords" analysis. (Hint: Fives times the size of the study corpus.)
Berber Sardinha, Tony (1996) A window on lexical density in speech. (Unpublished?) Paper presented at the 8th Euro-International Systemic Functional Workshop, Nottingham Trent University. looks at 'micro lexical densities' (computed for intervals in the texts rather than for the whole text) for two dialogues from the BNC.
Berber Sardinha, Tony (1996) Schoolchildren writing: A corpus-based analysis. Paper presented at TALC 96, Lancaster University. a sample corpus of schoolchildren’s essays was created and compared against newspaper texts, revealing some typical characteristics of student writing.
Berber Sardinha, Tony (1996) EFL writing assessment and corpus linguistics: The 'sound right' factor. Applications of corpus linguistics seminar, Aston University, 19th April 1996. Outline slides. Looks at how corpus-based linguistics could possibly help EFL teachers gain sensitivity to the language.
Berber Sardinha, Tony (1999) Looking at discourse in a corpus: The role of lexical cohesion. Paper presented at the 12th World Congress of Applied Linguistics (AILA), Waseda University, Japan. Outline slides. Studies text segmentation; suggests that we can get the computer to ‘focus on the text’, and that computers can do discourse analysis.
Berber Sardinha, Tony (1999) Using KeyWords in text analysis: Practical aspects. DIRECT Papers 42:1-8. São Paulo and Liverpool reviews some of the features available for language analysis in the KeyWords tool of WordSmith, presents some arguments related to the use of chi-square in comparing word frequencies, and proposes two techniques for extracting a representative subset of key words for analysis.
Berber Sardinha, Tony (1999) Word sets, keywords, and text contents - an investigation of text topic on the computer. In DELTA - Revista de Documentação de Estudos em Lingüística Teórica e Aplicada, no. 1, 15:141-149. São Paulo. looks at the identification of coherent word sets based on keywords analysis; results suggest that that the key words may indeed represent the main contents of the text.
Berglund, Ylva; Mason, Oliver (2001) But this formula doesn’t mean anything...!?": Some reflections on parameters of texts and their significance. In edited collection to be published (by Peter Lang) in honour of Geoffrey Leech. part of a larger project on the automatic stylistic assessment of students' essays. The overall aim of the project is to identify parameters that affect the naturalness of English text produced by non-native speakers, that is, parameters that make the L2 texts seem un-natural, `foreign' or un-English.
Bernardini, Silvia (1997) A trainee translator’s perspective on corpora (Paper presented at Betrinoro Workshop). discusses how to develop the 'translating skills' required to produce a 'good' translated text using corpus-based language learning activities Other papers on translation and corpora from the same workshop are available here
Cobb, Tom (2003) Analyzing late interlanguage with learner corpora: Quebec replications of three European studies. Canadian Modern Language Review, 59(3), 393-423. [PDF format] "When assembled by teachers or researchers into a learner corpus (LC) of suitable size and character, such a corpus provides the empirical means to discover what advanced learners know and do not know about their L2. A strong tradition of LC analysis has emerged in Europe; the present study introduces this work and tests its applicability to a North American context."
Cobb, Tom, Chris Greaves & Marlise Horst (2000) Can the rate of lexical acquisition from reading be increased? An experiment in reading French with a suite of on-line resources. In Raymond, P. & C. Cornaire, Regards sur la didactique des langues secondes. Montréal: Éditions logique. [Aussi en français.]. discussion of concordancing as a learning resource.
Cobb, Tom (1998). Breadth and depth of vocabulary acquisition with hands-on concordancing. Computer Assisted Language Learning 12, p. 345 - 360. "describes how students, in effect, become concordancers, using concordance and database software to create their own dictionaries of words to be learned. This method combines the benefits of list coverage with at least some of the benefits of lexical acquisition through natural reading. The method is further enhanced by computerized learning activities based on the principle of moving words through five stacks as they are reviewed and learned."
Cobb, Tom (1997) Is there any measurable learning from hands-on concordancing? System, 25, 301-315. as it says...
Collins, Heloisa & Mike Scott (1996) Lexical Landscaping in Business Meetings. DIRECT Paper 32. "presents an interpretation of lexical aspects of business meetings in Portuguese and in English as native languages, introducing the notion of topical nets... methods used... are from computational lexical analysis tools and include word listing, concordancing, collocation analysis and keyness of words in and across texts (Scott 1996). Results indicate that the variables used and the features generated by the analysis can be powerful analytical instruments for the lexical description of variants within the business meeting genre". Official version should be here soon (?).
Deutschmann, Mats. (2003) Apologising in British English. PhD Thesis. University of Umeå, Sweden. Available on-line here (PDF file). Makes use of BNCweb to explore sociolinguistic parameters of apologising in British English. Abstract: "This sociolinguistic study of apologies in the spoken part of the British National Corpus examines the use of the apology form in dialogues produced by over 1700 speakers, acting in a number of different conversational settings. The forms and functions of the apologies are examined and variations in usage patterns across the social variables gender, age and social class are elucidated. The study also treats aspects of the conversational setting, such as formality, group size and the genre, which affect the use of this politeness formula. Finally, the effects of the speaker-addressee relationship on apologetic behaviour are considered."
Edited selections from the 15th ICAME (1994, May 18 - 22, Århus, Denmark) Hermes, vol. 13, (1994), Lauridsen, Karen M. & Ole Lauridsen (eds.). Århus School of Business, Faculty of Modern Languages. Available on-line at: http://hermes.asb.dk/archive/1994/Hermes13.html. Paper version can be ordered from: The Århus School of Business, Elin Madsen/Faculty of Modern Languages, Fuglesangs alle 4, DK 8210 Århus V. Fax: (+45)_ 86157727. E-mail: kal@hdc.hha.dk
Hadley, Gregory (2002) "Sensing the Winds of Change: An Introduction to Data-Driven Learning." RELC Journal 33 (2), 99-124. "studies the rationale for allowing Data-driven learning (DDL) more prominence in the EFL classroom. After covering some pertinent issues and recent developments in the field of pedagogic grammar, the case for DDL will be discussed. The last part of this paper features the first-documented use of data-driven learning with Japanese university students, with special consideration given to their reactions to this new form of grammar learning."
Hadley, Gregory. (2001). "Concordancing in Japanese TEFL: Unlocking the Power of Data-Driven Learning." In K. Gray, M.A. Ansell, S. Cardew and M. Leedham (Eds.) The Japanese Learner: Context, Culture and Classroom Practice. Oxford: Oxford University. Pp. 138-144. as the title says
Johns, Tim (1996). 'If our descriptions of language are to be accurate ... A footnote to Kettemann' TELL&CALL 1996/4 pp. 44-6. .
Johns, Tim (1997). 'Kibbitzing One-to-Ones'. (Web version of notes for presentation at BALEAP meeting on Academic Writing, University of Reading, 29th November 1997). .
Kettemann, Bernhard (1999) On the Role of Context in Syntax and Semantics. In Kettemann, Bernhard and Georg Marko (eds.), Crossing Borders: Interdisciplinary Intercultural Interaction. Tübingen: Narr, pp.105-114. tries to show through concordance examples that context is important for the co-selection of semantic and syntactic structures. Argues that concordancing makes it possible for us to look at performance and offers a window on competence.
Kettemann, Bernhard (1996) Concordancing in English Language Teaching. In Botley, Simon, Julia Glass, Tony McEnery, Andrew Wilson, (eds.), Proceedings of Teaching and Language Corpora 1996. UCREL Technical Papers 9, Lancaster University, 1996, 4-16.[URL forthcoming] argues "that the use of concordancing in the teaching of EFL is motivating and rewarding. It describes a possible way of having students approach certain language phenomena (grammar, vocabulary, style) in an inductive and learner centered way."
King, Philip & Woolls, David (1996). 'Creating and Using a multilingual parallel concordancer', Translation and Meaning Part 4, pp. 459-466. describes in detail the Windows version of the software developed at the University of Birmingham by the second author, and outlines its potential use for students and trainers of translation.
Kytö, Merja & Romaine, Suzanne (1997) Competing Forms of Adjective comparison in Modern English: What could be more quicker and easier and more effective? In Nevalainen, Terttu & Leena Kahlas-Tarkka (eds.), To Explain the Present: studies in the changing English language in honour of Matti Rissanen. Helsinki: Société Néophilogique. compares the competing forms of adjective comparison in contemporary spoken English, using the BNC and ARCHER corpus.
Lee, David YW (2001) Genres, Registers, Text Types, Domains, and Styles: clarifying the concepts and navigating a path through the BNC jungle. Language Learning & Technology. Vol. 5, No. 3, September 2001, pp. 37-72. [PDF file] [N.B. The web-based journal’s HTML version is here. The above link is to a non-official PDF version with corrected hyperlinks.] The use of the somewhat confusing terms genre, register, text type, domain, sublanguage, and style by various linguists and literary theorists is examined and illustrated with reference to the disparate categories used to classify texts in various existing computer corpora. With this terminological problem resolved, a personal project which involved giving each of the 4,124 British National Corpus (BNC, version 1) files a descriptive "genre" label is described. The result of this work, a spreadsheet/data bank (the BNC Index) containing genre labels and other types of information about the BNC texts is described and its usefulness shown. This resource will allow linguists, language teachers, and other users to easily navigate through or scan the huge BNC jungle more easily, to quickly ascertain what is there (and how much) and to make informed selections from the mass of texts available. It should also greatly facilitate and encourage genre-based research (e.g., EAP, ESP, discourse analysis) and focus everyday classroom concordancing activities by making it easy for people to restrict their searches to highly specified sub-sets of the BNC using PC-based concordancers such as WordSmith, MonoConc, or the Web-based BNCWeb.
Lee, David YW (forthcoming) Computer corpus-based linguistics & the uninitiated postgraduate. To appear in the proceedings of the BAAL/CUP Seminar: Postgraduate Research in Applied Linguistics: The Insider Perspective, 20-21st March 1999, Department of Linguistics and Modern English Language, Lancaster University. A short squib discussing what corpus-based linguistics is about, what typical CBL courses look like, what kinds of knowledge/skills are necessary, and how to prepare yourself and choose where to go to learn about corpus-based work.
McCarthy, Michael (1998a) Taming the spoken language: genre theory and pedagogy. The Language Teacher (On-line version) Volume 22, Number 9. draws on his work on the CANCODE corpus, reflects on the use of corpora in the language classroom.
McEnery, Tony & Andrew Wilson (1997) Teaching and language corpora. ReCALL, 9(1): 5-14. somewhat outdated survey of the use of corpora in teaching language and linguistics.
Mike Nelson’s PhD dissertation on Business English (or a summary here) asks whether the lexis of Business English is significantly different from that of ‘everyday’ general English, and secondly, if the lexis found in Business English published materials is significantly different from that found in real-life business. Uses WordSmith’s 'keywords' feature, with the BNC Sampler as the reference corpus (which I find rather unfortunate, since the BNC Sampler is by no stretch of the imagination a representative sample of any notional 'general English').
Michael Rundell on "The future of the corpus, and the corpus of the future" (1996) somewhat dated, but interesting transcript of a talk given in 1996
Rayson, Paul, GeoffreyLeech & Mary Hodges (1997) Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics. Volume 2(1): 133-152. use the spoken demographically sampled part of the BNC to make a comparison of the vocabulary of speakers, highlighting differences marked by a very high χ2 value of difference between different sectors of the corpus according to gender, age and social group. (As background the authors also briefly examine differences between spoken and written material in the BNC.)
Rodriguez, Maria Rosario Caballero (1999) Using a Concordancer in Literary Studies. The European English Messenger, Vol VII/2, pp. 59-62. includes introductions to Wconcord and ConcApp.
Someya, Yasumasa. 1999. A Corpus-based Study of Lexical and Grammatical Features of Written Business English. M.A. thesis. Graduate Department of Language and Information Sciences, University of Tokyo as it says, this study aims to identify and describe some of the major lexico-grammatical features of (written) English for Business Purposes (EBP).
Stevens, Vance 1995. Concordancing with Language Learners: Why? When? What? CAELL Journal, 6(2): 2-10 one for language teachers
Stubbs, Michael. 2001. Words in use: introductory examples. In Stubbs, Michael, Words and Phrases: Corpus Studies in Lexical Semantics. Oxford: Blackwell, pp. Chapter one of the book (N.B. No pages numbers available for this on-line version)
Stubbs, Michael. 2000. Using very large text collections to study semantic schemas: a research note. In Words in Context: a tribute to John Sinclair on his retirement, Heffer, Chris & Helen Sauntson (eds), Birmingham : University of Birmingham. English language research discourse analysis monograph ; no.18. [CD-ROM] as it says.
Stubbs, Michael. 1998. German loanwords and cultural stereotypes. English Today, 53 (14, 1, Jan 1998): 19-26. as it says.
Stubbs, Michael. 1995. (1995) Collocations and semantic profiles: on the cause of the trouble with quantitative studies. Functions of Language, 2, 1: 23-55. HTML version. No page numbers.
Thompson, Geoff (forthcoming) Corpus, comparison, culture: doing the same things differently in different cultures. To appear in M. Ghadessy, R. Roseberry & A. Henry (eds.) Small Corpus Studies and ELT Theory and practice. Amsterdam: John Benjamins. on the comparative analysis in the classroom of small corpora of equivalent genres in different languages. The case is argued for including explicit attention to language forms even within communicative language teaching approaches, and for using cross-linguistic comparison as an awareness-raising resource.
Tribble, Chris (2000) Genres, Keywords, Teaching: towards a pedagogic account of the language of Project Proposals. In Burnard, Lou and McEnery, Tony (eds). Rethinking language pedagogy from a corpus perspective: papers from the third international conference on teaching and language corpora. (Lodz Studies in Language). Hamburg: Peter Lang. shows "how it is possible to use a corpus of instances of a specific genre to provide learners with access to aspects of both language knowledge and, as a result of further analysis, context knowledge. I have also shown that although a POS marked corpus will provide the fullest account of the linguistic characteristics of a genre, an analysis of keywords also offers a powerful means of establishing which words (and phrases) matter in a collection of examples of a genre."
Tribble, Chris (1998) Writing Difficult Texts. Lancaster: Unpublished PhD thesis, Lancaster University. "This thesis uses the concepts and techniques associated with genre analysis, corpus linguistics and discourse analysis to offer some solutions to problems in writing instruction -- in particular the problem of learning to write into a new or unfamiliar genre."
Tribble, Chris (1997) Improvising corpora for ELT: quick-and-dirty ways of developing corpora for language teaching. A paper presented at the First international conference: Practical Applications in Language Corpora (1997) University of Lodz, Poland. Aims to show that it is possible to begin to use a "data-driven" approach (Johns 1991) to language learning and teaching even if you do not have access to established corpus resources.

On-line Speech Corpora-related Papers, Dissertations & Squibs

Arnfield, Simon. Prosody and Syntax in Corpus-Based Analysis of Spoken English. PhD dissertation, Leeds University. concerned with the relationship between syntax and prosody as it existed in a corpus of data, the Machine Readable Spoken English Corpus (MARSEC), in particular, any relationships which could be used to enhance speech synthesis systems by providing a syntactically based prosodic element to the synthesis. In practice this turned out to be a system to predict the location and type of stress and segmentation within an utterance given the parts-of-speech. The relationship between syntax and prosody has applications for speech synthesis and speech recognition in providing a disambiguating element.
Lee, David YW (1992) Givenness and Prosodic Patterns : a preliminary study of some texts from the Spoken English Corpus. MA dissertation, Lancaster University. (NB. Page numbers differ from original printed version) a preliminary examination of the relation between the Given-New information distinction and prosodic features using the SEC. Results show that a case can be made for the existence of a general strategy of downranking Given information: that is, putting such items 'into the background' prosodically. This is done either by deaccentuation or by the assignment of an accent which is perceptually less prominent on a scale of accentuation (particularly with respect to the other accents in the same tone group). Whether a Given item will be deaccented or take an accent (and which low-ranked accent it is likely to take) is subject to a host of other factors: syntactic, semantic, pragmatic, contextual (i.e. the preceding/ following tones within the same tone group) and even stylistic. The study also points to a need for finer distinctions to be made in the categorisation of linguistic items into Given and New, as English prosody seems to be sensitive to more distinctions than could be made for the purposes of this study. However, the findings adduced here provide some tentative prosodic-linguistic evidence for the existence of a cline of Givenness, different categories of Givenness having different likelihoods of conforming to the proposed Given Information Downranking Strategy.

If you need help with file formats for some of the downloads, [click here]

TOP of this page

BACK to References, Papers, Journal

Have you found this web site useful? Have you found dead links or want to suggest something? Do let me know, to encourage me to keep updating the site.

Back to HOME (tiny.cc./corpora)[Bookmarks HOME]