Bookmarks for Corpus-based Linguists

CBL-related Conferences & some Project Sites


Conferences (past & future) (or click here to go to list of CBL-related project sites/centres)

Most of the events here have links to programmes/speakers and abstracts. I try to update this page frequently, but… no guarantees!

* Upcoming conferences are listed first, and indicated by" ".

ICAME 2013 (35th conf)

22-26 May 2013, Santiago de Compostela, Spain. (Deadline: 5 December 2012, via Web site)

Corpus Linguistics 2013

23-26 July 2013, Lancaster University, UK.

AACL 2013 (American Association for Corpus Linguistics)

18-20 January, 2013, San Diego, USA.

Genre- and Register-related Text and Discourse Features in Multilingual Corpora

11-12 January 2013, Brussels, Belgium (Organizers: The Linguistic Society of Belgium and Institut Libre Marie Haps – Brussels)

ICAME conferences: Annual conference series for corpus-based research on English (ICAME = International Computer Archive of Modern and Medieval English).

For a complete list of ICAME conferences and their proceedings, click HERE.

ICAME 2012 (33th conf)

Wednesday 30 May - Sunday 3 June 2012, Leuven, Belgium

ICAME 2011 (32nd conf)

1-5 June 2011, Oslo, Norway.

ICAME 2010 (31st conf)

26-30 May 2010, Giessen, Germany.

ICAME 2009 (30th conf)

27-31 May 2009, Lancaster University, Lancaster, UK. Has links to conference pictures.

ICAME 2008 (29th conf)

14-18 May 2008, Ascona, Switzerland.

ICAME 2007 (28th conf)

23-27 May 2007, Stratford-upon-Avon, UK.

ICAME 2006 (27th conf)

24-28 May, 2006, Helsinki, Finland.

- Some pictures from the conference taken by Sebastian Hoffmann can be viewed here.

ICAME 2005 (26th conf)

See above entry for the North American Conference on Corpus Linguistics

ICAME 2004 (25th conf)

19-23 May, 2004, University of Verona, Italy. (abstracts & programme available)

ICAME 2003 (24th conf)

23-27 April, 2003, St. Peter Port, Guernsey, British Isles

ICAME 2002 (23rd conf)

22-26 May 2002, Göteborg, Sweden

- Some pictures from the conference taken by Sebastian Hoffmann can be viewed here and some by Antoinette Renouf here.

ICAME 2001 (22nd conf)

16-20 May 2001, Louvain-la-Neuve, Belgium

- Has programme and abstracts of all papers and posters.

- Some pictures from the conference taken by Knut Hofland are here, and Sebastian Hoffmann’s are here.

ICAME 2000 (21st conf)

21-25th April 2000, Sydney, Australia

- Programme here

- And pictures taken by Knut Hofland are here, and pictures by Sebastian Hoffmann are here.

ICAME 1999 (20th conf)

26-30th May 1999, Freiburg, Germany

- No web site that I can find, but here’s a summary of the conference, and here are pictures from the conference taken by Knut Hofland and Sebastian Hoffmann.

TaLC (Teaching & Language Corpora) Conference series held every 2 years; pedagogical focus; not restricted to English.

For a list of TALC, PALC, and "Corpus Linguistics" conferences and their proceedings, click HERE.

TaLC 2012

11-14 July 2012, Warsaw, Poland. (Deadline: 31 Jan 2012)

TaLC 2010

30 June - 3 July 2010, Brno, Czech Republic.

TaLC 2008

4-6 July 2008, Lisbon, Portugal. [Pre-conference workshop on 3 July 08]

TaLC 2006

2-4 July 2006, Paris, France. [Pre-conference workshop on 1 July 06]

TaLC 2004

6-9 July 2004, Granada, Spain. Some pictures here, and here

(b&w).

TaLC 2002

27-31 July 2002, Bertinoro, Italy. Some pictures & video clips by Bill Fletcher (and links to those by others) are here.

TaLC 2000

19-23rd July 2000, Graz, Austria

TaLC 98

24-27th July 1998, Oxford University, UK

TaLC 96

9-12th August, 1996, Lancaster University, UK

TaLC 94

10-13th April, 1994, Lancaster University, UK

"CL" ("Corpus Linguistics"-branded) Conferences. Held every 2 years. Differs from ICAME in that it’s not restricted to English.

CL2011(Corpus Linguistics 2011)

20-22 July 2011, Birmingham University, UK (Abstracts & papers available).

CL2009(Corpus Linguistics 2009)

21-23 July 2009, University of Liverpool, UK. A biennial conference series not limited to English.

(Pre-conference workshop: 20 July 2009). Deadline for submission was 23 January 2009.

CL2007(Corpus Linguistics 2007)

27-30 July 2007, Birmingham University, UK. (Abstracts & papers available).

CL2005(Corpus Linguistics 2005)

14-17 July 2005, Birmingham University, UK. A biennial conference series not limited to English.
* Proceedings from the Corpus Linguistics Conference Series, Vol. 1, no. 1, ISSN 1747-9398

CL2003(Corpus Linguistics 2003)

28 March - 1 April 2003, Lancaster University, UK. A biennial conference series not limited to English.

CL2001(Corpus Linguistics 2001)

30 March - 2 April 2001, Lancaster University, UK. A biennial conference series not limited to English.

- Some pictures from the conference here

AACL Conferences (American Association for Applied Corpus Linguistics) Started off as the "North American Symposium on Corpus Linguistics and Language Teaching", and was once "AAACL", but I guess there were too many 'A’s, so one was dropped.

AACL 2011

7-9 October, 2011, Georgia State University, Atlanta, GA, USA.

AACL 2009

8-11 October, 2009, University of Alberta, Edmonton, Alberta, Canada. (Deadline for submission of abstracts was May 1, 2009)

AACL 2008

13-15 March, 2008, Brigham Young University. Provo, Utah, USA.

AAACL 2006

20 -22 October, 2006, Flagstaff, AZ USA. (Old link was here.)

AAACL 2005

12-15 May, 2005, University of Michigan, Ann Arbor, Michigan, USA. [Joint meeting of the AAACL (American Association of Applied Corpus Linguistics) and ICAME (International Computer Archive of Modern and Medieval English)]

AAACL 2004

21-23 May, 2004, Montclair State University, Upper Montclair, NJ, USA.

[Proceedings published as: Fitzpatrick, Eileen (Ed.). (2007). Corpus Linguistics Beyond the Word. Corpus Research from Phrase to Discourse. Amsterdam: Rodopi. Publisher listing here]

AAACL 2002

1-3 November 2002, Indianapolis, Indiana, USA. [Web site used to be here.]

AAACL 2001

23-25 March 2001, Park Plaza Hotel, Boston, MA, USA. [Call for papers archive here]
[Proceedings published as: Granger, Sylviane, Lerot, Jacques & Petch-Tyson, Stephanie (Eds.). 2003. Corpus-based Approaches to Contrastive Linguistics and Translation Studies. Amsterdam: Rodopi. Amazon listing here.]

AAACL 2000

31 March - 2 April, 2000, Northern Arizona University, Flagstaff, AZ, USA. [Web site used to be here]

AAACL 1999

20-22 May 1999, University of Michigan, Ann Arbor, MI, USA.
[Proceedings published as: Simpson, Rita C. & Swales, John M. (Eds.). (2001). Corpus Linguistics in North America: Selections from the 1999 Symposium. Ann Arbor: University of Michigan Press. Publisher listing here.]

IVACS (Inter-Varietal Applied Corpus Studies) Focuses on pedagogical applications of CBL

IVACS 5 ("Applying Corpus Linguistics")

18-19 June, 2010, University of Edinburgh, UK

* Call for Papers & conference details from here (Deadline for abstracts: 20 Dec 2009)

IVACS 4 ("Applying Corpus Linguistics")

13-14 June, 2008, University of Limerick, Ireland

* Call for Papers & conference details from here

IVACS 3 ("Language at the Interface")

23-24 June, 2006, Centre for Research in Applied Linguistics, Nottingham University, UK

* Call for Papers & conference details from here

IVACS 2 ("Analyzing Discourse in Context")

25-26 June, 2004, The Graduate School of Education, Queen’s University, Belfast, Northern Ireland

* Call for Papers & conference details from here

IVACS 1 ("Analyzing Discourse in Context")

June 2002, University of Limerick, Ireland

PALC (Practical Applications in Language Corpora) Conference series held every 2 years in Poland.

PALC 2003

4-6 April 2003, Lódz University Conference Centre, Poland.

Call for Papers & conference details from the Corpora List archive here

Lewandowska-Tomaszczyk, B. (ed.) Practical Applications in Language and Computers (PALC 2003). Frankfurt am Main: Peter Lang.

PALC 2001

7-9 Sept 2001, Lódz, Poland.

PALC 1999

15-18 April 1999, Lódz, Poland.

Lewandowska-Tomaszczyk, Barbara & Patrick James MELIA (Eds.) (2000). PALC’99 – Practical Applications in Language Corpora: papers from the international conference at the University of Lódz, 15-18 April 1999. Frankfurt am Main: Peter Lang.

PALC 1997

12-14 April 1997, Lódz, Poland.

Lewandowska-Tomaszczyk, Barbara & Patrick James MELIA (Eds.) (1997). International Conference on Practical Applications in Language Corpora (Lódz, Poland, 10-14 April 1997: proceedings). Lódz: Lódz University Press.

Learner Corpora Conferences

SUMMER SCHOOL: Learner Corpus Research: Theory and Applications

13-17 September 2004, University of Louvain, Louvain-la-Neuve, Belgium.

International Symposium on Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching

14-16 December, 1998, The Chinese University of Hong Kong

Exploiting Computer Learner Corpora

4-9 August 1996, part of AILA '96, in Jyväskylä, Finland,

Freiburg Workshop on Romance Corpus Linguistics

September 11th-13th, 2003, Albert-Ludwigs University Freiburg i.Br., Germany.

Other Corpus-related Conferences

Genre- and Register-related Text and Discourse Features in Multilingual Corpora

11-12 January 2013, Brussels, Belgium (Organizers: The Linguistic Society of Belgium and Institut Libre Marie Haps – Brussels)

Corpus Technologies and Applied Linguistics

28-30 June 2012, Xi’an Jiaotong-Liverpool University, Suzhou, China. Submission deadline: 1 March 2012

4th International Conference on Corpus Linguistics - Language, corpora and applications: diversity and change (CILC2012)

22-24 March 2012, University of Jaén, Spain. Submission deadline: 1 Jan 2012

Asia Pacific Corpus Linguistics Conference

15-19 February, 2012, University of Auckland, New Zealand. Deadline: 1 June 2011.

CILC 2012 - 4th International Conference on Corpus Linguistics

22-25 March 2012, University of Jaén, Spain

CILC2011 - 3rd International Corpus Linguistics Conference

7-9 April, 2011, Universidad Politécnica de Valencia (SPAIN)

International Conference Corpus Linguistics - 2011 (CORPORA 2011)

27-29 June 2011, St. Petersburg, Russia.

No web site. Call for papers here

Summer Schools, Short Courses, One-offs

Seminar on "New trends in corpus linguistics for language teaching and translation studies" (In honour of John Sinclair)

Seminar-cum-conference to be held in Granada, 22-24 September 2008

Seminar on "Corpora in Discourse Analysis and in Language Teaching"

Seminar to be held at the ESSE 9 conference (European Society for the Study of English) in Aarhus, Denmark, 22-26 August 2008

Computing, Theorizing, Communicating

120th MLA Annual Convention, Session 658, 27-30 December 2004, Philadelphia, USA.
A program arranged by the Discussion Group on Computer Studies in Language and Literature. Presiding: Donald E. Hardy, Colorado State Univ

One-day Workshop (free):

Corpora (Electronic Texts) Applied to Language Teaching (CALT) Workshop

28 February 2005, Thammasat University, Bangkok, Thailand.
Speakers: Sebastian Hoffmann (University of Zurich), David Lee (Thammasat University), Passapong Sripicharn (Thammasat University), + others to be confirmed.)
Participation is FREE, and ALL are welcome

One-day Workshop (free):

IT & Linguistics (Tips and Tricks for Teaching Linguistics with Technology)

17 February 2005, CILT, London, UK.
-- examples of good practice in the use of technology for teaching Linguistics (rather than specific tools which are used for Linguistics). If you would like to present an item at this event, please contact David E Newton at CILT (david.newton "at" cilt.org.uk)

Summer School:

Learner Corpus Research: Theory And Applications

13-17 September 2004 Louvain-la-Neuve, Belgium.

Organised by the Centre for English Corpus Linguistics. University of Louvain

Summer School:

Australasian Language Technology Summer School 2004 (ALTSS 2004)

4-7 December 2004, Macquarie University, Sydney, NSW, Australia.

In conjunction with the Australasian Language Technology Workshop (ALTWS 2004) and the Australian International Conference on Speech Science & Technology (SST 2004).

Short Course at Birmingham Univ:

Using a Corpus in Learning and Teaching English for Academic Purposes

6th – 8th September, 2004, University of Birmingham Centre for Corpus Research

Tuscan Word Centre courses: "How to use corpora in language work"

19th-22nd May, and 26th-29th May 2003

Translation Conferences

Using Corpora in Contrastive and Translation Studies (UCCTS 2010)

27-29th July 2010, Edge Hill University, Ormskirk (near Liverpool), UK.

Using Corpora in Contrastive and Translation Studies (UCCTS 2008)

25-27 September 2008, Zhejiang University, Hangzhou, China. Proceedings on-line.

1st Athens International Conference on Translation and Interpretation

13-14 Oct 2006, Athens, Greece.

Translating and the Computer

16-17 November 2006, Kensington, London

29-30 November 2007, Kensington, London

Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

May 31, 2003, HLT-NAACL 2003 Workshop

Corpus Use and Learning to Translate (CULT-BCN 2004)

Barcelona, January 22-25th 2004. Program available here.

Corpus Use and Learning to Translate (CULT 2K)

3-4 November 2000, Bertinoro, Italy (Scuola Superiore di Lingue Moderne per Interpreti e Traduttori Università di Bologna) Overview: the design and use of corpora in translation-related areas, with special reference to translator and interpreter training

Corpus Use and Learning to Translate (CULT 1997)

Bertinoro, Italy

III International and IX National Conference on Translation

August 30 - September 3, 2004, Fortaleza, Ceará, Brazil.

One of the thematic sessions will be on Translation and Corpora (papers dealing with any aspects - theoretical, practical, pedagogical - involving the use of corpora in translation)

Language Resources for Translation Work and Research

28th May 2002, Las Palmas, Canary Islands, Spain.(A pre-conference workshop at LREC 2002)

Translation: From Theory to Practice and from Practice to Theory

3-5 July 2003, Université de Bretagne Sud, Lorient, France.<

Other Possibly Relevant Conferences

Web as Corpus Workshop

7 September 2009, San Sebastian, Basque Country, Spain

The many faces of Phraseology: An interdisciplinary conference

13-15 October 2005, Louvain-la-Neuve, Belgium. Organised by the UCL Centre for English Corpus Linguistics (CECL)

TELRI Seminar: Information in Corpora (7th TELRI Seminar)

TELRI (Trans-European Resources Infrastructure)

26-28 September 2002, International Centre of Croatian Universities, Dubrovnik, Croatia.

TELRI Seminar: Multilingual Corpus Research (6th TELRI Seminar)

9-11 November 2001, Bansko, Bulgaria. Abstracts and information available from web site.

XII SUSANNE HÜBNER SEMINAR

19th-21st November, 2003, Zaragoza, Spain. Theme: Corpus Linguistics: Theory and Applications for the Study of English (programme used to be here)

3èmes Journées de la Linguistique de Corpus

11-13 Septembre 2003, Lorient, France

2èmes Journées de la Linguistique de Corpus

12-14 Septembre 2002, Lorient, France

Journées de la Linguistique de Corpus

14 Septembre 2001, Lorient, France


Computational/Technically-oriented Conferences

(more for people in LE, NLP, IR, TTS, CL, HLT, etc. -- if you don’t know what those acronyms stand for, these conferences are probably not for you!)

For Computational Linguistics/NLP Conferences, please visit Joel Tetreault’s web site, which is devoted to this:

http://www.cs.rochester.edu/~tetreaul/conferences.html

Places and Projects

Use this as an alphabetical list to find web links related to places, projects, and people.

AC/DC Project (Brazilian Portuguese & parallel corpora)

Acesso a corpora/Disponibilização de corpora ("access and availability of corpora"), and is one of the activities that is part of the resource centre Linguateca

Association for Computers and the Humanities (ACH)

the major professional society for people working in computer-aided research in literature and language studies, history, philosophy, and other humanities disciplines, and especially research involving the manipulation and analysis of textual materials. The ACH is devoted to disseminating information among its members about work in the field of humanities computing, as well as encouraging the development and dissemination of significant textual and linguistic resources and software for scholarly research. Publishes the journals Computers and the Humanities and is linked with the Humanist discussion list.

ACL NLP-CL Universe
(Association for Computational Linguistics)

a Web catalog/search engine that is devoted to Natural Language Processing and Computational Linguistics Web sites. The Association for Computational Linguistics (ACL) is THE international scientific and professional society for people working on problems involving natural language and computation. Publishes the journal Computational Linguistics

Association for Literary and Linguistic Computing (ALLC)

supporting the application of computing in the study of language and literature; remit encompasses not only text analysis and language corpora, but also image processing and electronic editions. Journal:Literary and Linguistic Computing

Birmingham English Department

home of the Centre for Corpus Research (and the Centre for English Language Studies (CELS))

CECL Belgium

Centre for English Corpus Linguistics,Université catholique de Louvain

Centre for Computing in the Humanities (CCH)

at King’s College London

Centre for Corpus Research (Birmingham)

part of the School of Humanities

CETH (Rutgers University)

Center for Electronic Texts in the Humanities

CHILDES

at Carnegie Mellon University

CLG at Oxford

the website of the Computational Linguistics Group (CLG) at the University of Oxford, UK

CLLT Site

E-list "Corpus Linguistics and Language Teaching"

COBUILD Home Page

Home of the Bank of English (Sampler here)

COCOSDA

International Committee for the Co-ordination and Standardisation of Speech Databases and Assesment Techniques, COCOSDA, has been established to encourage and promote international interaction and cooperation in the foundation areas of Spoken Language Processing, esepcially for Speech Input/Output.

CORPORA List Hypermail Archive

click on the link to read back issues,

OR try the SIGLEX index of the discussion list (Selected messages to the CORPORA mailing list have been categorized and links to the threads have been provided. The categorization is based on a SIGLEX ontology. The links have been generated automatically based on subject, the date, and the sender. The links include only the years 1997 to the present. Before 2000, the CORPORA archive is not threaded)

Corpus Encoding Standard (see also XCES: XML Version of the CES)

more for the NLP people and larger-corpus builders; a set of encoding standards for corpus-based work and natural language processing applications

CTI Centre for Textual Studies

now superseded by the Learning and Teaching Support Network (LTSN)

EAGLES

Expert Advisory Group on Language Engineering Standards. An initiative of the European Commission which aims to accelerate the provision of standards for: (i) Very large-scale language resources (such as text corpora, computational lexicons and speech corpora); (ii) Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools; (iii) Means of assessing and evaluating resources, tools and products.

See also the entry for ISLE (International Standards for Language Engineering), which is the world-wide continuation of the EAGLES Project

ECHO (European Cultural Heritage Online)

or contributor’s web site here

The main goal of ECHO is the establishment of a European infrastructure fostering the transfer of cultural heritage to the Internet under some essential conditions, among them: free access to high-quality documents pertaining to cultural heritage; interoperability between different corpora; co-evolution of corpora, standards, and tools; access to the primary data through an ECHO-portal via scholarly metadata.

ELAN (European Language Resources Activity Network)

ELAN aspires to link resources developed by the members of the two Associations (PAROLE and TELRI) with their potential users throughout Europe. In order to serve the electronic multilingual resource market ELAN plans:

a) to reinforce or, where necessary, create international standards by - conforming a significant part of the data of the members of PAROLE and TELRI to a common format providing standardised resources for the following languages : Belgian French, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Swedish, Turkish and Ukrainian; - designing a common query language;

b) to operate, on the basis of the common query language, a user community network that will make accessible a large stock of electronic resources, with a clear copyright policy, user support, e-mail user groups, etc.

ELI (English Language Institute), University of Michigan

Home of the MICASE (Michigan Corpus of Academic Spoken English) project.

ELDA(European Language Resources Distribution Agency)

the distribution arm of ELRA: provides the organisational infrastructure for identifying, classifying, collecting, validating, marketing, distributing and licensing European language resources (spoken, written and terminological corpora and related resources); also disseminates info on HLT (human language technology). ELDA also participates in some evaluation projects and campaigns, has considerable knowledge and skills in HLT applications and has participated in many French, European and international projects.

ELRA (European Language Resources Association)

non-profit-making organisation founded by the European Commission in 1995, with the mission of providing a clearing house for language resources and promoting Human Language Technologies. A focal point for information related to language resources in Europe. Language Resources (LRs) include such materials as recorded speech databases, lexicons, grammars, text corpora, and terminological data.

ELSNET

ELSNET is the European Network of Excellence in Human Language Technologies. Its main objective is to advance human language technologies in a broad sense by bringing together Europe’s key players in research, development, integration or deployment in the field of language and speech technology and neighbouring areas. The network’s role is to offer an environment that allows for optimal exploitation of the available human and intellectual resources in order to advance the field. This environment comprises a number of structures (committees, special interest groups), actions (summer schools, workshops) and services (web site, email lists, newsletter, information dissemination, knowledge brokerage).

Electronic Text Center (Univ. of Virginia)

Text archive site

Euralex

European Association for Lexicography

Foundations of Statistical NLP

Book Companion Site

FORENSIC-LINGUISTICS Mailing List

Archives of the discussion list for Language and the Law (sometimes involves corpus work)

FrameNet Project

an online lexical resource for English, based on frame semantics and supported by corpus evidence (the BNC). Aims to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through manual annotation of example sentences and automatic capture and organization of the annotation results. Each FrameNet entry will provide links to other lexical resources, including definitions (from the Concise Oxford Dictionary), WordNet synsets and the COMLEX subcategorization frames.

GUTENBERG

Text archive site

Intute: Arts & Humanities Hub

free online service providing access to the best Web resources for education and research, selected and evaluated by a network of subject specialists. There are over 21,000 Web resources listed here that are freely available by keyword searching and browsing

iLoveLanguages

a comprehensive catalog of language-related Internet resources, hand-reviewed. Includes online language lessons, translating dictionaries, native literature, translation services, software, language schools, or just a little information on a language you’ve heard about

ICAME

International Computer Archive of Modern and Medieval English. The aim of the organization is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions.

IDS

(Corpus Technology research group at the Institut für Deutsche Sprache, Mannheim, Germany)

central non-university institution for the study and documentation of current usage and recent history of the German language. The Corpus Technology research group at the IDS hosts the Mannheimer Corpora project (the world’s largest collection of German online corpora, and the COSMAS
project, an online corpus search and analysis toolbox. Main goal of the group: development of collocation analysis and clustering methods for corpus-based lexicography.

INL (Instituut voor Nederlandse Lexicologie)

The Institute for Dutch Lexicology (based in Leiden) collects and studies Dutch words (lexicology) and writes dictionaries (lexicography). It compiled the Woordenboek der Nederlandsche Taal (WNT/Dictionary of the Dutch Language on Historical Principles). Also trains foreign lexicographers in building a language database and compiling sizeable national dictionaries

Internet Grammar of English

an online course in English grammar written primarily for university undergraduates. Based at UCL, London.

ISCA (International Speech Communication Association)

main goal of this non-profit organization is "to promote Speech Communication Science and Technology, both in the industrial and Academic areas", covering all the aspects of Speech Communication (Acoustics, Phonetics, Phonology, Linguistics, Natural Language Processing, Artificial Intelligence, Cognitive Science, Signal Processing, Pattern Recognition, etc.)."

ISLE (International Standards for Language Engineering)

the world-wide continuation of the EAGLES Project

LDC - Linguistic Data Consortium

creating and sharing linguistic resources (mainly corpora): data, tools and standards; an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC’s host institution.

Leverhulme Corpus Project

a 1-million-word corpus which matches as closely as possible the LOB and FLOB corpora of written British English, except that the year of data collection is 1931, or near to that date (+/- 3 years). The immediate purpose of building this corpus is to make it possible to compare these three temporally equidistant corpora (1931, 1961, 1991): "Pre-LOB", LOB, and FLOB. This will enable us to track grammatical change through a period of 60 years of the 20th century. The new corpus under construction is as yet unnamed

LINGUIST List

Searchable web site for the main discussion list for all aspects of linguistics (including CBL/corpus-based linguistics)

Linguistic Annotation

describes tools and formats for creating and managing linguistic annotations; the page is no longer being actively maintained.

Literary and Linguistic Computing

Searchable archive of abstracts from 1986 to date.

Louvain CECL

CECL = Centre for English Corpus Linguistics, home of learner corpora projects

LTG (Language Technology Group, Edinburgh University)

Tools, software, etc.

Language Technology World

a virtual information center on the wide spectrum of technologies for dealing with human languages; maintained by the German Research Center for Artificial Intelligence (DFKI)

MATE (Multilevel Annotation, Tools Engineering Telematics Project)

The MATE project aims to facilitate re-use of language resources by addressing the problems of creating, acquiring, and maintaining language corpora. The problems are addressed along two lines: through the development of a standard for annotating resources; through the provision of tools which will make the processes of knowledge acquisition and extraction more efficient. Specifically, MATE will treat spoken dialogue corpora at multiple levels, focusing on prosody, (morpho-) syntax, co-reference, dialogue acts, and communicative difficulties, as well as inter-level interaction.

Machine Readable Spoken English Corpus (MARSEC)

a corpus of mainly prepared speech, time-aligned with sound files, annotated for prosody, phonemes, part-of-speech.

MICASE

The Michigan Corpus of Academic Spoken English

Natural Language Software Registry (NLSR)

a searchable/browsable catalogue of a natural language processing (NLP) software, with summary notes on their features/capabilities, platforms, costs, etc. Includes academic, commercial and proprietary software. The NLSR does not undertake any distribution of the listed software.

Open Language Archives Community (OLAC)

A metadata initiative for language data and NLP tools

an international partnership of institutions and individuals; creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. OLAC works in a simple way: linguistic archives or individual linguists ("Data Providers") describe their resources using the simple OLAC metadata format. This metadata is then electronically "harvested" by "Service Providers" which make the information available to users in the form of a searchable database. Even if the resource itself is not available on the internet (a collection of cassettes, for example), people will still be able to find out what resources exist and where to find them.

* Has a search facility covering the resource catalogs of LDC, ELRA and the ACL/DFKI Natural Language Software Registry, and permits single searches to be applied to all catalogs simultaneously. The OLAC cross-archive search engine now harvests 11,000+ records from 12 OLAC archives.

OTA- Oxford Text Archive

Collections of downloadable texts (mostly literary).

LE-PAROLE (Language Engineering - Preparatory Action for Linguistic Resources Organization for Language Engineering)

The European PAROLE project aimed to create a large-scale harmonised set of "core" corpora and lexica for over a dozen West European languages in compliance with common guidelines. 14 Western European language groups participated in the PAROLE project. Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998. For each of these languages, the project has resulted in a 20-million-word text corpus composed according to similar design principles and TEI encoded according to the PAROLE DTD. 250,000 words are POS encoded. Another product of the PAROLE project is a set of harmonised lexica containing a minimum of 20,000 entries provided with morphosyntactic and syntactic information.

PELCRA (Polish and English Language Corpora for Research and Applications)

Polish and English modern language corpora for research and other linguistic applications. (co-operation between the Lodz University and Lancaster University). Includes: compilation of the Polish National Corpus (the fully annotated corpus of native Polish, mirroring the BNC in size and structure), Polish Learner English Corpus (learner data from a range of learner styles at different proficiency levels), and the Polish-English & English-Polish Parallel and Comparable Corpora covering Polish materials translated from and to English as well as a range of authentic non-translated texts on comparable subject matters.

Professional English Research Consortium (PERC)

an association of scholars, educators, publishers, test developers, and education providers committed to research in Professional English (PE) and the development of high-quality Professional English resources, products, and services to meet growing international demands.Professional English (PE) consists of all spoken and written discourse that is used by working professionals and professionals-in-training to engage in the work of their profession. Professional English includes English for science, engineering, technology, law, medicine, finance, and other professions.

Proteus Project

NLP at New York University; focus on the application areas of Information Extraction and Machine Translation; long-term goal is to build systems that automatically find the information you’re looking for, pick out the most useful bits, and present it in your preferred language, at the right level of detail. One of our main challenges is to endow computers with linguistic knowledge. The kinds of knowledge that we have attempted to encode include vocabularies, morphology, syntax, semantics, grounding, genre variation, and translational equivalence. We work on both deterministic and stochastic knowledge models

REAL Centre (Research in English and Applied Linguistics, at Chemnitz University)

various research projects at Chemnitz (Lampeter Corpus, Chemnitz Internet Grammar, East African ICE, English-German translation corpus, etc.)

SACODEYL

SACODEYL is a web based system for the assisted compilation and open distribution of European teen talk in the context of language education. It includes the collection and distribution of English, French, German, Italian, Lithuanian, Romanian, and Spanish teen talk. It also distributes software tools HERE. You can search the seven corpora HERE,

SIL Links

Links on corpora, dictionaries, etc. from the Summer Institute of Linguistics.

SIMPLE

The goal of SIMPLE project is to add semantic information, selected for its relevance for LE applications, to the set of harmonised multifunctional lexica built for 12 European languages by the PAROLE consortium.

SPARKLE (Shallow PARsing and Knowledge extraction for Language Engineering)

The first goal of SPARKLE is to produce generic software able to reliably produce a unique, correct but simple phrasal-level syntactic analysis of naturally-occurring free text. This software will be capable of practical use for processing of substantial quantities of such (corpus) material. Such phrasal-parsers will be generic in the sense that they aim to be compatible with a variety of extant approaches to lemmatisation, morphological analysis and lexical syntactic tagging and aim to be straightforwardly parameterisable for different (European) languages. The second goal is to develop a lexical acquisition system capable of learning subcategorisation, argument structure and semantic selection preferences for individual predicates from free text containing instances of such predicates. The lexicon acquisition system will also be developed as a parameterisable multilingual software tool incorporating language-independent and-dependent linguistic knowledge concerning membership of predicates in broad semantic classes, (diathesis) alternations, the linking of arguments to thematic relations.

TELRI (Trans-European Language Resources Infrastructure)

pan-European alliance of currently 28 focal national language (technology) institutions with the emphasis on Central and Eastern European and NIS countries. Objectives: to strengthen the pan-European infrastructure for the multilingual language research and development community; and to collect, promote, and make available monolingual and multilingual language resources and tools for the extraction of language data and linguistic knowledge. TELRI maintains the TRACTOR Archive.

TEI Consortium (Text Encoding Initiative)

an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent

TRACTOR (TELRI Research Archive of Computational Tools & Resources)

Corpora in TWENTY languages (including Bulgarian, Croatian, Czech, Dutch, English, Estonian, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Romanian, Russian, Serbian, Slovak, Slovenian, Swedish, Turkish, Ukrainian and Uzbek) ; Parallel corpora in a variety of pairings; Software for processing corpus evidence ; Lexicons and other language-information resources.

The TRACTOR archive, network and user community are a key part of the TELRI agenda to build links between the research communities in Western, Central and Eastern Europe. Resources distributed through TRACTOR are available for non-commercial use only, but TRACTOR aims to promote and foster commercial links between academic and industrial researchers.

Tuscan Word Centre (TWC)

A non-profit association devoted to promoting the scientific study of language. It organises one-week high-level courses for language researchers and workers in the language industries. Concentrates on the use of electronic corpora for different purposes, including: translation, automatic or machine-aided language processing, tagging, parsing etc., language teaching support, language learning assistance, lexicography and language reference

UCREL (Lancaster University)

University Centre For Computer Corpus Research On Language, Lancaster University

VISL project (Visual Interactive Syntax Learning)

a research and development project at the Institute of Language and Communication (ISK), University of Southern Denmark (SDU) - Odense Campus; concerned with designing and implementing Internet-based grammar tools for education (e.g. self-study) and research. Languages involved: Arabic, Bosnian, Danish, Dutch, English, Esperanto, French, German, Greek, Italian, Japanese, Latin, Portuguese, Russian, Spanish (the list is expected to expand).

Building on a complex web of HTML-pages, CGI-scripts, Java- and Perl-programs, manually annotated text data bases, and Constraint Grammar (CG) tools for automatic analysis, the VISL internet site offers a graphic interface which allows the user, for a wide variety of languages, to analyse corpus examples, textbook material and free running text in an interactive way - choosing between full automatic parsing and guided manual analysis on various levels of complexity. At the core of the analyses, whether manual or automatic, is a clear distinction between form and function on the word, group, and clause levels. Although the automatic grammatical description is based on CG, it can be transformed to different user-specified notational systems, such as tree structures, tagged running text or in-text color codes.

Vocabulary Acquisition Research Group

specialises in the area of lexical processes in second language learning (at the Univ. of Wales Swansea, directed by Paul Meara and Nuria Lorenzo-Dus with help from Jim Milton and Geoff Hall). Includes a large scale bibliographical resource & copies of some of their recent testing tools that you can download.


Did you find this web site useful? Spotted dead links or want me to add links? Do let me know

[TOP of this page]