On-line Dictionaries & Machine-Readable Lexica

Roget’s Thesaurus as an Electronic Lexical Knowledge Base

Roget’s Thesaurus (1911 edition) in Java, designed for Natural Language Processing; includes four examples of NLP applications: (1) detecting lexical chains in text, (2) determining semantic distance between words and phrases, (3) clustering words based on their meaning, and (4) solving a word quiz.


Comprehensive listing of on-line dictionaries. Hundreds of dictionaries for more than 260 languages

ACL SIGLEX Resource Links

A bookmarks page by the Special Interest Group on the Lexicon of the Association for Computational Linguistics. Pretty good listing of lexicons and electronic dictionaries.

ACL NLP/CL Universe List of Dictionaries

links (many dead ones!) to on-line dictionaries, including parallel/multilingual ones

American English Spoken Lexicon (LDC)

A collection of pronunciations captured in individual audio files for more than 50,000 of the most common words in English (words were extracted from newswire and telephone conversation)

CMU Pronouncing Dictionary

A machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their ASCII phonemic transcriptions.

Cambridge Dictionary Data (Commercial)

SGML-encoded text files: The text of the Cambridge International Dictionary of English CD-ROM, English Pronouncing Dictionary, the Cambridge Dictionary of American English, the Cambridge International Dictionary of Idioms, the Cambridge International Dictionary of Phrasal Verbs and the Word Routes/Selector series of parallel bilingual mini-thesauri in French, Spanish, Portuguese, Italian, Greek and Catalan, and sound files from the CIDE CD-ROM.

CELEX Database

Lexical data stored in three separate databases for Dutch, English, and German. The Dutch database, version N3.1, was released in March 1990 and contains information on 381,292 present-day Dutch wordforms, corresponding to 124,136 lemmata. The latest release of the English database (E2.5), completed in June 1993, contains 52,446 lemmata representing 160,594 wordforms. The German database (D2.5), made accessible in February 1995, currently holds 51,728 lemmata with 365,530 corresponding wordforms. Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For Dutch and English lemma homographs, frequencies have been disambiguated on the basis of the 42.4 m. Dutch INL and the 17.9 m. English Collins/COBUILD text corpora. Furthermore, information has been collected on syntactic and semantic subcategorisations for Dutch.

CLR Catalog

Consortium for Lexical Research – Links to tools and resources.

Early Modern English Dictionaries Database (EMEDD)

An on-line searchable database of entries from sixteen early dictionaries, dating from between 1530 and 1657. The sources include bilingual lexicons as well as specialist and 'hard-word' dictionaries. By combining full texts of early dictionaries written over 160 years by lexicographers with varying purposes, the EMEDD is a reference work for English of the Renaissance period. It is designed to make accessible the English-language content of bilingual (English and other languages) and monolingual (English-only) dictionaries, glossaries, grammars, and encyclopedias published in England from 1500 to 1660.


The Berkeley FrameNet project is creating an online lexical resource for English, based on frame semantics and supported by corpus evidence (the BNC). The project has produced two types of data, a collection of approximately 50,000 hand-annotated sentences and a database containing information about frames, frame elements, lemmas and lexical entries. All of this data is distributed as ASCII files with markup that is compatible with both SGML and XML, with accompanying DTDs.


English, Medical, Legal, and Computer Dictionaries, Thesaurus, Encyclopaedia, Literature Reference Library, and Search Engine all in one.

Hebrew lexicons

Hebrew WordNet aligned with the English WordNet 1.6 (GNU General Public License).

Morph-it! (Italian)

Free lexicon of Italian inflected forms with their lemma and morphological features. 568,771 entries, 28,500 lemmas.

Svenska ord (LEXIN)
(from Språkdata and Språkbanken (The Bank of Swedish), Department for Swedish,
University of Göteborg)

A Swedish dictionary containing appr. 20 000 lexical units (lexical categories: pronunciation, part-of-speech, inflexion, definition, valency, and linguistic exemples). Available in two formats: (1) web version (access only for Swedish universities): http://spraakbanken.gu.se/lb/lexin/ (2) XML version for language technology purposes: ftp://ftp.spraakbanken.gu.se/pub/reskit/LEXIN.zip

(GNU) Collaborative International Dictionary of English (G)CIDE

An electronic dictionary-in-the-making derived from the Webster’s Revised Unabridged Dictionary (1913), with some words supplemented with definitions from WordNet. Caveats: it still contains typing errors and is being proof-read and supplemented by volunteers from around the world; and definitions are >100 yrs old.

(and related databases)

A lexical database for English; an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.

See also EuroWordNet: a multilingual database with wordnets for several European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian), MultiWordNet (Italian WordNet is strictly aligned with Princeton WordNet 1.6.), and BalkaNet (for six Balkan languages: Greek, Turkish, Bulgarian, Romanian, Czech and Serbian).

An alphabetic version of WordNet 2.0 is available at http://www.clres.com/WordNet.html. There are 143991 entries in this dictionary, with a sense for each occurrence of an entry in a distinct synset. Virtually all information in WordNet has been captured, including the new domain relations, verb groups, and derivational forms.

worldlanguage.com (commercial)

An Internet store specialising in language products for practically all the world’s languages.