Concordancers

This page consists of two sections, one listing offline concordance programs & the other web-based concordance facilities. Most of these programs these days offer more than just allowing you to run concordances, but often also include facilities for producing frequency lists, calculating collocations, etc.


Offline Concordancers

These concordancers can be downloaded and run on your own computer, provided they are designed to run on your operating system or through an emulator.

AntConc
(For discussion/support group, see here)

The best free concordancer for Windows, Mac OS X and Linux that I know of. Some commercial programs may have a couple more features, but this one’s free, so don’t complain! Pros: works with all languages (fully Unicode compliant); allows full regular expressions (for very complex searches); does word lists, n-grams/clusters, collocations, and keywords (by comparing against a reference corpus); does distribution plots of occurrences within each file; can handle lemma lists; can handle XML-type and underscored_tag-type part-of-speech tags; the developer continually improves it and is open to feedback (and may I emphasize that it’s free?).

Cons: (at the moment...) very minimal support for SGML/XML/HTML corpora (it simply ignores rather than intelligently mines structural tags) but that’s a problem common to most concordancers.

WordSmith Tools (v. 7)

Mike Scott’s impressive Windows-based set of tools, including concordancer, word list, keyword list, as well as concgrams. Extrememly fast, and offers a bewildering set of options for configuration and more advanced work than I can describe here.

Limitations:

  • not free
  • no real SGML/XML awareness (but there are workarounds)
  • only runs on Windows

#LancsBox (v 4.0)

A fairly comprehensive Java-based free tool that runs on all major operating systems. Includes facilities for concordancing, investigating dispersion, analysing collocations and displaying them as networks, word lists & n-grams.
Although the tool provides a wide range of analysis types and measures, the interface is rather gloomy and and clunky, and, apart from being able to choose the different measures, there appear to be no proper configurability options.
On import, corpus data are also automatically tagged using the TreeTagger, which may at first glance seem to be an advantage facilitating work with the corpora, but in fact probably rather induces users to simply rely on the accuracy achieved by the tagger, instead of first tagging and hand-correcting their data to ensure a suitable level of accuracy.

MonoconcEsy (v.2.2)

A fairly full-featured Windows concordancer. Free for individuals carrying out non-commercial research. One feature not available (compared to MonoConc Pro) is Corpus comparison (keywords). If you need that or if you use MonoConc Pro 2.2 and want a copy for your students, you can download the MonoConc Pro Semester Version, which will expire after around 9 or 10 months.

MonoConc Pro (v.2.2)

Concordancer for Windows with powerful (regular expression) search facilities.

Good points: ability to show/hide tags; colour-codes collocates within the main concordance window itself; handles many languages (including Chinese, Japanese and Korean); the Advanced Collocations feature (similar to WordSmith’s clusters feature, but does other things too) is great.

Not-so-good points: not as flexible/customisable as WordSmith.

Simple Corpus Tool

My own free concordancer, mainly designed to handle data annotated in DART format, but also usable with other types of plain text.

Apart from concordancing, also allows the user to edit/annotate corpus texts in basic XML, do n-gram analyses that allow for ignoring and re-interpolating tags/fillers, as well as do user-defined feature counts based on regular expressions.

Concordance and n-gram analyses are hyperlinked to the original files. In contrast to most other concordancers, these files are then directly editable to include annotations.

Currently, no manual available. Instead, you can refer to my presentation given at CoLTA 2015 for features and explanations.

aConCorde

A free multi-lingual concordance tool that supports different encodings. Originally developed for Arabic, and the interface can be switched between English and Arabic. Only has very basic concordancing and frequency analysis functionality. Java-based, so platform-independent.

Downside: Loading a corpus essentially means loading a single file, as no multi-select or folder selection are available. Thus, the only way to select a real corpus appears to be to copy a number of files together into a single file.

CasualConc

KWIC concordance lines, word clusters, collocation analysis, and word counts. Integrated with R. Only runs on Mac OS X.

Multiconcord

Multilingual Parallel Concordancer for Windows. It uses truly parallel texts; that is texts which relate to the same source. Priced at £40 for educational licence.

Concordancer for Windows (WConcord)

WConcord is a fast and easy to use concordancer for unlimited amounts of text. It allows the user to load multiple plain text files (.txt) and create concordances based on simple or complex search patterns. Searches can be stored in a simple file format and called again for later searches over other corpora. The search facility has some capabilities for handling regular expressions which are described in the accompaying help file.

WConcord also creates word frequency lists. It provides plain frequency information as well as the cumulative frequency of the tokens in a corpus.

A special feature of WConcord is its ability to create collocation statistics. This function calculates the frequencies of co-occurrence of a node word (the search item(s)) with its collocates. The results can be exported in a format that can be imported into a spreadsheet or a database for further processing.

JConcorder

JConcorder is Java software for building and managing word catalogues – created by parsing text documents – and generating concordances therefrom. It is now available in beta version, either as an application or as an applet version.
As a Java program, it’s platform-independent.

corpkit

A tool designed to create, interrogate and visualise parsed corpora. It’s got both a graphical interface (http://interrogator.github.io/corpkit/) and an API (https://github.com/interrogator/corpkit). The user starts with plain text files in corpora/subcorpora (i.e. folders/subfolders). In corpkit, the user can then leave them as plain, have them tokenised, or fully analysed by CoreNLP, which includes POS, lemma, constituency, dependency, etc.
corpkit does a lot of the usual things (parsing, concordancing and keywording) but also extends their potential significantly: you can concordance by searching for combinations of lexical and grammatical features, and can do keywording of lemmas, of subcorpora compared to corpora, or of words in certain positions within clauses. Corpus interrogations can be quickly edited and visualised in complex ways, or saved and loaded within projects, or exported to formats that can be handled by other tools.

SCP

free concordancer for Windows & MacOS X

TextStat (Matthias Hüning)

freeware concordancer; reads ASCII/ANSI texts (in different encodings). HTML files (directly from the internet) and MS Word and and OpenOffice files (no conversion needed). Produces word frequency lists & concordances (uses regular expressions). Includes a web-spider which reads as many pages as you want from a particular web site and puts them in a TextSTAT-corpus. The news-reader puts news messages in a TextSTAT-readable corpus file.
Multilingual interface and uses Unicode internally: can cope with many different languages & file encodings. Written in Python: should run everywhere where Python runs (Windows XP, Linux, MacOS X).

Multilingual Concordancer (MLTC) (Scott Piao)

free: MLCT (Multilingual Corpus Toolkit) is a JAVA software package with a GUI (Graphical User Interfce). It provides various useful functionalities for building and processing corpora, including sentence boundary detection, concordancing, collocation extraction etc. To run the program, user needs to install the Java Runtime Environment (JRE).

NoSketchEngine,
Manatee, Bonito2

NoSketch Engine is an open-source project combining Manatee and Bonito into a powerful and free corpus management system. It is essentially a limited version of the software empowering the Sketch Engine service, a commercial variant offering word sketches, thesaurus, keyword computation, user-friendly corpus creation and other features.
Manatee is a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures. It utilitates a fast indexing library called Finlib. Bonito is a graphical user interface to corpora mantained by Manatee. It is available as a standalone graphical application in Tcl/Tk (version Bonito1, not developed/supported anymore) and web interface in Python (version Bonito2, under constant development).

TAPoR 0.2

TAPoR is a gateway to the tools used in sophisticated text analysis and retrieval.

Xaira

A general purpose XML-aware search engine (Windows platform) that will operate on any corpus of well-formed XML documents as well as plain text files (best used with TEI-conformant documents); Unicode-compliant, so works with any language provided the relevant Unicode font is installed on the system. Originally developed at OUCS for use with the British National Corpus.

Emdros

a text database engine for analyzed or annotated text; supports storage and retrieval of any kind of text plus annotations/analyses of that text. Linguistic analyses are its primary target, and here syntactic analyses are in focus (although other linguistic levels are supported, too). It excels in storing and querying structured data, supporting multiple hierarchies of embedding over the same text. Its powerful query language is built around sequence and embedding as the primary structuring operations. It implements the EMdF database model and the MQL query language.

IMS Corpus Workbench (CWB)

Excellent corpus query system (my personal favourite) for SunOS 4.1.x, Solaris 2.x/Linux; powerful (full regular expression searches). Fast (indexed) concordancer with both command-line (including batch mode) & X-windows interface; Free for educational use.[Query Syntax & Examples here]
Drawbacks
: Steep learning curve for beginners and non-UNIX/Linux initiates; corpora need to be pre-indexed (can be complicated for marked-up texts); very limited SGML awareness.

Concordance (R.J.C. Watt)

concordancer for Windows; has facility for publishing concordances to the web; supports non-European character sets (inc. Chinese, Japanese & Korean; currently [18:10 11-Feb-2016] not available).

Web-based Concordancers

BNCweb (CQP edition)
(Free access via Lancaster University’s server)

The most powerful and user-friendly free interface to the British National Corpus (XML World Edition): a browser-based tool for exploring the BNC. Incorporates genre categories as set out in David Lee’s BNC Index and access to the audio recordings for more than 5 million words of spoken data. For more information on how to work with audio data, see the Searching Audio Data guide (also available directly from within BNCweb).

There is a manual/textbook that accompanies this tool: Hoffmann, Sebastian, Evert, Stefan, Smith, Nicholas, Lee, David & Ylva Berglund Prytz. (2008). Corpus Linguistics with BNCweb: A Practical Guide. Frankfurt am Main: Peter Lang. (Publisher’s site is here.)

BYU-BNC
(Mark Davies)

allows word-, phrase- or part-of-speech-based searches of the British National corpus (BNC) with genre-restrictions; allows wildcards and "fuzzy matches". (Formerly called VIEW: Variation In English Words And Phrases)

Compleat Lexical Tutor

web-based suite of tools for data-driving self-learning (mainly for vocabulary). The online tools allow any reader with an Internet connection to transform any text of interest into a self-teaching text linked to speech, dictionary, concordance, and self-test resources. You paste a text/corpus into one of the tools provided and get results via your browser. Tools include a concordancer, a phrase (n-gram) extractor, VocabProfile (tells you how many words in the text come from the following four frequency levels: (1) the list of the most frequent 1000 word families, (2) the second 1000, (3) the Academic Word List, and (4) words that do not appear on the other lists), a vocab-level-based cloze passage generator and a traditional nth-word cloze builder.

Just The Word (Sharp)

Simplest and most pedagogically accessible tool for ESL/EFL learners based on the British National Corpus (BNC). Enter a word and get back a bunch of collocations & colligations, sorted into similarity groups. (Based on a 80-million-word subset of the BNC.)

Phrases in English (PIE)

PIE incorporates a database of all 1-6-grams (phrases 1 to 6 "words" long) with part-of-speech (POS) codes occurring three or more times in the 100-million-word British National Corpus (BNC). You can explore English phraseology either through lists of forms and their frequencies or by searching for specific forms or collocations, e.g. 2-grams of the pattern "ADJ work", to find the most frequent adjectives describing work. PIE also offers a phrase pattern discovery tool, "phrase-frames": sets of variants of an n-gram identical except for one word (wildcard symbol *), e.g., "the * of the", with variants such "the end of the", "the rest of the", "the top of the", "the nature of the". Over the next year PIE will add: (i) Click on an n-gram in the query results to see concordances from the BNC (ii) POS-grams and POS-frames for studying the relative productivity of phrase structures (iii) Filtering by text type (domain, genre, target audience) for contrastive studies (iv) Query by regular expression (currently only wildcards are supported).

SACODEYL Search

A web-based search tool that can be loaded directly with corpora created using SACODEYL Annotator.

SketchEngine

A fee-based Corpus Query System (based on Manatee) incorporating word sketches, grammatical relations, and a distributional thesaurus.
A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour.
A 30-day free trial account is available.
Web-based service using standard browsers: no software installation required.

Available Resources:

  • preloaded corpora 1M to 26B words
  • 90+ languages
  • WebBootCaT (for building your own instant corpus from web pages, then extracting keywords, specialist terminology, etc.)
  • CorpusBuilder (upload and install your own corpora)
  • New feature: diachronic analysis (trending words)

SKELL (Sketch Engine for Language Learning)

Searches more than one billion words of English from news, scientific papers, Wikipedia articles, fiction books, web pages, blogs. Three functions: (1) Examples [concordance]: search for a word or a phrase and get the most presentable sentences for it. (2) Word sketch [collocations & colligations]: a list of words which occur frequently together with the searched word. (3) Similar words (not only synonyms) are words used in similar contexts visualized with a word cloud.

Also available for Russian and Czech.

Turbo Lingo (Danko Sipka)

free web-browser-based concordancer. You can get concordances and frequency lists of entire Web pages (by entering a URL), or by pasting a text into the input box. Also features "1x1phonotactics" and "1x1 lex. combinatorics".

* The above represent just a personal selection. There are many more out there. Kennedy (1998: 258-267) lists and describes quite a number of them.

* See also: Using the Web as a corpus


If you found this web site useful, or found an outdated link, don’t forget to let me know.