Word & Frequency Lists

English – Frequency Lists ("word lists" are below)


BNC Frequency lists
(from the companion web site to the book: Leech, Geoff, Paul Rayson & Andrew Wilson. (2001). Word Frequencies in Written and Spoken English: based on the British National Corpus. Longman, London.)

Frequency lists for the whole BNC (version 1), for the spoken versus written components, for the conversational (i.e. demographic) versus task-oriented (i.e. context-governed) parts of the spoken component, and for the imaginative versus informative parts of the written component. Also: ranked frequency word lists according to parts of speech (e.g. all nouns, all conjunctions) based on the whole BNC corpus (version 1), as well as frequencies for individual part-of-speech tags (e.g. NN1, VDG) based on the BNC Sampler.
Although the frequency lists for this book were based on all 4,124 files of the original BNC version 1 corpus, the text classifications and POS tags used were the updated and more accurate ones implemented in the BNC World Edition.

** For those who want a user-friendly word list (i.e. without frequency figures) based on the entire BNC, I am making one available here (all word forms occurring at least 10 times per million words, alphabetically arranged)

BYU-BNC’s Frequency Lists

Select any of 70+ registers/genres, and then get a frequency listing for that genre. Just enter "*" (without quotation marks) for a general frequency listing for the selected genre, "[nn1]" for singular nouns in that genre, etc. You can also easily compare word frequency in one genre (or set of genres) against another, e.g. sermons vs. spoken, tabloids vs. broadsheet, medical vs. academic, etc..

Kilgarriff’s BNC Frequency Lists

Lemmatised frequency list (various formats); unlemmatised, or 'raw', frequency lists (various formats); variances of word frequencies.

Largely superseded by the Leech, Rayson & Wilson (2001) book/web pages, due to text classification & word-tagging errors in the version of the BNC corpus used for these lists. See this note.

Kučera & Francis word list
(from Kučera, Henry & Francis, WN (1967) Computational Analysis of Present-Day American English Providence, RI: Brown University Press).

word+frequency lists based on the Brown corpus (not disambiguated by parts of speech) may be found at the Brandeis University Computational Memory Lab or at the Psycholinguistic database at Rutherford Appleton Laboratory.

Various Word Lists and Freq Lists for ESL/TESOL/pedagogical purposes

a set of links collected by the Internet TESL Journal

English - Word Lists/Stop Lists (no frequencies)

The Academic Word List (AWL)

Compiled by Averil Coxhead as a replacement for/update to Xue & Nation’s University Word List (UWL).

570 word families assumed to reflect the shared vocabulary of written academic English as used in a wide variety of disciplines (28 in total, 125K words from each) in an Academic Corpus of 3.5m words.

Selection was based on the principles of range, frequency and dispersion, using a specially compiled academic corpus of journal articles, book chapters, course workbooks, laboratory manuals, and course notes.

Sadly, though, the corpus composition was heavily skewed, a fact that affects its representativeness immensely. However, even these days, many people still appear to not have cottoned on to this, as the list still keeps getting cited as a model ;-)

For Exercise-making tools based on the AWL, see Sandra Haywood’s site or Tom Cobb’s Compleat Lexical Tutor site.

Billuroglu Neufeld List (BNL)

The Billuroglu and Neufeld List of the most commonly used words in English, defining an improved critical lexical mass from the old GSL and AWL lists.

Lemma List for English (by Yasumasa Someya)

40,569 words (tokens) in 14,762 lemma groups (Format: worry -> worries,worrying,worried)

Function Words/Stop Lists for English

Stop list for info-retrieval purposes, from Cornell, originally compiled by Salton & Buckley.

Longman Defining Vocabulary (extended by David Lee)

An expanded & slightly extended word list from the back of Longman Dictionary of Contemporary English. 1987. (2nd ed). This represents the 2000+ controlled vocab used by the dictionary in its definitions. All the stems (e.g. walk) have been expanded to include inflected forms (walks, walked, walking) and a few uncontroversial derived forms (e.g. awkwardly, from awkward). The list is available as a Microsoft Word document, with notes at the top of the file.

If you use the list, please reference David’s article: Lee, David. (2001). Defining core vocabulary and tracking its distribution across spoken and written genres: Evidence of a gradience of variation from the British National Corpus. Journal of English Linguistics, 29(3), pp. 250-278.

Moby project resources

Grady Ward’s free word lists & texts. Moby Hyphenator: 185,000 entries fully hyphenated; Moby Language: Word lists in five languages; Moby Part-of-Speech: 230,000 entries fully described by part(s) of speech, listed in priority order; Moby Pronunciator: 175,000 entries fully International Phonetic Alphabet coded; Moby Shakespeare, the complete unabridged works of Shakespeare; Moby Thesaurus: 30,000 root words, 2.5 million synonyms and related words; Moby Words (English): 610,000+ words and phrases.

Ogden’s Basic English word list

Everything to do with Charles Kay Ogdens 1930s classic Basic English vocabulary list, including the electronic version of Basic English: International Second Language. New York: Harcourt, Brace & World Inc./Orthological Institute.

West’s (1953) General Service List (GSL)

in electronic format, as entered by Bauman & Culligan; Michael West’s (in)famous (+ outdated and skewed) set of 2,000+ words selected to be of the greatest "general service" to learners of (written) English. This version ranks the words by their frequency in the Brown Corpus (1960s written American English).

Miscellaneous Word lists

(1) Outpost9: eclectic collection of useful and not-so-useful word lists: surnames, given names, dictionary word lists, etc.
(2) SCOWL (Spell Checker Oriented Word Lists) and Friends: Words+inflections list, Part-of-speech database, jargon word lists, lists for spell checkers, etc.

Other Languages: Frequency & Word lists/Stop lists

If you have lists for other languages which you can share, please let me know.

See also the page on On-line Dictionaries, Machine-readable Lexicons & Related Resources.

** Stop Lists for various languages (e.g. Danish, Dutch, English, Finnish, French, Italian, Norwegian, German, Portuguese, Russian, Spanish

The Snowball web page has stop lists (and stemmers) for various languages. On the web site, just click on the language of interest and look for the link to the stop list.

Chinese Frequency List

Ranked frequency list with frequencies, characters, including Pinyin.

German Frequency Lists

German lists from About.com, Wortschatz (Uni Leipzig), IDS Mannheim.

Morph-it! (Italian)

Free lexicon of Italian inflected forms with their lemma and morphological features. 568,771 entries, 28,500 lemmas. Sadly, now seems to have disappeared :-(

Russian Frequency List
(by Serge Sharoff)

based on a corpus of modern Russian fiction and political texts (more than 35 million words). The list includes about 33000 words which frequency is greater than 1 ipm (instances per million words). A shorter selection of 5000 most frequent words is also available. The list provides word rank, frequency (per million), part of speech. Some analytical information about the lexical stock is provided, such as coverage of the total language use by word bands, e.g. first 3000 lemmas cover 76.6824% of the total number of word forms. The corpus, tools for working with it, as well as an aligned parallel English-Russian corpus are discussed in: Sharoff, Serge, (2002). Meaning as use: exploitation of aligned corpora for the contrastive study of lexical semantics. Proc. of Language Resources and Evaluation Conference (LREC02). May, 2002, Las Palmas, Spain.

Word Frequency generators and Vocabulary Analysis software

For a quick-and-easy frequency listing/index of words in your own texts, try the following programs. For pedagogical software and vocabulary analysis programs, see the Teaching and Miscellaneous Links page.

AntWordProfiler

A freeware word profiling program (for Windows and Macintosh OS X), similar to Paul Nation’s Range program. It compares one or more target texts with vocabulary level lists (e.g. Range baseword lists of the most frequent 1000, 2000, 3000 words of English), and produces tables showing which words in the target file(s) appear in the level lists and which do not. It also generates a set of statistics about the target file(s), including number of types and tokens. AntWordProfiler can also display target files with the words in each level list color coded. These can then be edited, for example, to produce simplified texts used for classroom materials.

Compleat Lexical Tutor

Tools include a concordancer, a phrase (n-gram) extractor, VocabProfile (tells you how many words in the text come from the following four frequency levels: (1) the list of the most frequent 1000 word families, (2) the second 1000, (3) the Academic Word List, and (4) words that do not appear on the other lists), a vocab-level-based cloze passage generator and a traditional nth-word cloze builder.


If you found this web site useful, or found an outdated link, don’t forget to let me know.