Software for Linguistics

The tools on this page are programs I have developed for various language analysis or annotation purposes over a number of years. They are freely usable for non-commercial purposes under GPL 3.0 licence.

Upon request by some colleagues, I’ve recently also created 32bit versions for most programs. These will run on older Windows versions, can easily be carried on a memory stick to be used on any Windows computer, and also be run on Mac OS X and Linux using Wine. As 64bit computers are more and more becoming the norm, though, I’ll only produce 32bit versions on special request from now on (Fri May 03 14:30:15 2019), though.

This page will be expanded, e.g. by adding more tools and more extensive descriptions, in due course. Please report any bugs you might identify to me, so I can fix them and make the programs more useful/useable. Also feel free to email me suggestions for improvement or additional features. Please note, though, that I’m currently ‘transitioning’ my programs from Perl & Perl/Tk to Python & PyQt, so that I will probably not make any major changes in the foreseeable future.

‘Installation’ Instructions

My tools generally don’t require any installation, but I sometimes provide installers for the sake of convenience for users less experienced with extracting from zip archives. The main advantage in extracting from zip files, though, is that you should also be able to run the programs from a memory stick without installation.

As most programs are designed to allow more experienced users to change/customise the configuration files or to store associated data in a ‘data’ folder within the same folder the tool resides in, you should always extract all files from the zip archive to a location where you have write-access. This should generally not be the ‘Program files’ folder, because that folder restricts write access, but instead a folder like ‘C:\<toolname>’, where <toolname> is the name of the respective tool, ideally without spaces. For frequent use, you may also wish to set up a shortcut to the executable on your desktop. This can normally be achieved by right-clicking the executable in your file manager, holding and dragging it to the desktop, and then, after releasing the right mouse button, selecting ‘Create shortcut here’.

I’ve recently also noticed that installation to a folder containing Chinese or other ‘non-English’ characters appears to cause issues in some programs finding the configuration files, etc. If you encounter such a problem, please move the program files to a folder that only contains basic Latin characters.

The Dialogue Annotation and Research Tool (DART; Ver. 3.0)

Thu May 30 11:36:06 2019: Version 3.0.1 released.

The Dialogue Annotation and Research Tool is an annotation tool and linguistic research environment that not only makes it possible to annotate large numbers of dialogues automatically, but also provides facilities for pre- and post-editing dialogue data, as well as conducting different types of analysis on annotated and un-annotated data in order to improve the annotation process. To decide whether DART may be interesting/useful for you, you can first take a look at the DART Manual. The new version now identifies 162 speech acts automatically, and also has a number of additional functions and improvements to both the interface and the output options from within the individual analysis modules, as well as a completely new pattern counting facility.

A PDF that contains the current speech-act taxonomy used in/by DART is available from here.

To ‘install’, just extract all files to a folder where you have write access, ideally ‘C:\DART’ or something similar. You can then start the program by running dart.exe. Along with the program and its resource files, a complete non-annotated version of the SPAADIA corpus (see below) is provided in the ‘spaadia’ folder for practice, so that you can test the annotation feature yourself and see which annotation features may require post-processing.

If you installed version 3 before 30th May 2019, please replace it with version 3.0.1, which contains some minor bug fixes and improvements to some of the resource files.

Current version: 3.0

Older Versions:

The Text Annotation and Research Tool (TART; work in progress)

The Text Annotation and Research Tool (TART) will be DART’s written-language counterpart for identifying the equivalent of speech acts in written language. Although I already began developing initial design ideas and a simple prototype based on the DART model a few years ago, and also presented on some of these at CL 2015 in Lancaster, various other commitments have delayed the development so far.

Apart from including very similar annotation and analysis features to DART, TART will also contain features more specifically geared towards written-language analysis, some of which will be derived from the features implemented in the Text Feature Analyser, such as measures of lexical density, various ratios based on different units of analysis (whole texts, paragraphs, sentences, and possibly other types of textual divisions), most of which will be calculable with or without stopwords and including a variable norming factor to facilitate comparison between different texts of unequal length.

One major part of the ongoing development consists in modelling all the relevant necessary features that will enable TART to annotate texts from various different genres/text types, as well as providing different means of analysing or filtering by these specific features.

So far, the main interface has already been ported from DART, and a new XML document category for written-text categories defined (including the levels of text, heading, and paragraph) and integrated into the tool. In addition, a number of routines for some automated pre-processing (paragraph splitting based on major punctuation marks, etc.) have been implemented, and a number of conversion and/or extraction tools for converting data from existing reference corpora have been created to make it possible to test the TART routines using various types of data.

The Tagging Optimiser

The Tagging Optimiser (Ver. 1.0; released 02-Oct-2018) helps corpus users to automatically enhance the tagging accuracy and readbility of output from 3 freeware taggers, the TreeTagger, the Stanford POS Tagger, and the Simple PoS Tagger (see below). It does so by diversifying the original tagset, fixing some of the errors caused by the probabilistic engines underlying the taggers, and making the tags more readable by expanding their names. Details about the tagset can be found in the accompanying manual.

The Tagging Optimiser is available as either a 64- or 32bit program:

To install, simply follow the general ‘installation’ instructions above and run ‘tagOpt64.exe’ or ‘tagOpt32.exe’, respectively.

The Simple PoS Tagger

The Simple PoS Tagger (Ver. 1.0) is an interface to a slightly modified version of the Perl Lingua::EN::Tagger module that allows the user to add morpho-syntactic tags to a text automatically, and then post-edit the colour-coded output. To ‘install’, simply extract the files from the zip archive into a folder you have write-access to and run the executable (‘Tagger.exe’). The interface should be relatively intuitive to use, but some basic usage info is provided under the ‘Help’ menu.

The output in my interface differs slightly from the original version produced by the tagger in that I’ve replaced the originaly slashes that separate words and tags by the more ‘traditional’ underscore format that provides better readability.

For future releases, I’m planning to include some more features that will make it possible to explore the tagged text in various ways, e.g. through switching some of the colour-coding on an off to identify structures like NPs, etc., visually.


The Simple Corpus Tool (SCT)

The Simple Corpus Tool (Ver. 2.0 released 19-Jun-2018): A combination of annotation & analysis tool for use with either simple XML files (similar to the SPAADIA/DART format), or basic line-based plain-text files. Now includes concordance, pattern counting, and n-gram analysis analysis modules, as well as a tagging option, based on Lingua::EN::Tagger.

Files selected for opening are displayed as a list on the ‘Input Files’ workspace tab on the left-hand side of the program, where double-clicking the file name will open the corresponding file in the built-in editor. The ‘Output Files’ workspace tab is only populated if a tagging operation has just been run and then contains the results, which can be opened for editing and manual post-correction in the same way.

The tabbed pane in the right-hand window contains the ‘Concordance’ tab, a tab for the ‘Pattern count’ module, which allows you to define features in the form of a label + regex pair to be counted automatically in all files, as well as a tab for the ‘N-grams’ analysis module.

The built-in editor allows the user to add tags and attributes from user-confugurable toolbars and menus. The editor itself is based on the Perl/TK ‘TextUndo’ widget, and provides undo/redo functionality through the keyboard shortcuts Ctrl+Z and Ctrl+Y, respectively, as in many standard text editors.

Version 2.0 now has a proper PDF help file, triggered via the relevant menu or pressing F1.


The SPAADIA concordancer

The SPAADIA concordancer (32bit Windows version): a concordancer (mainly) for use with the SPAADIA corpus (see). Theoretically, the concordancer can handle any plain text-based files, such as .txt, (X)HTML and XML files, though, provided that the right extension is set in the box on the top right-hand side of the interface. The assumed input encoding for files is UTF-8, and the concordancing works best for files where tags and text are separated. The concordancer allows for searches of one or two search strings in combination, using the full set of Perl regular expressions. Any whitespace in an expression needs to be ‘quoted’ via \s and possibly quantified if there may be multiple spaces. Now largely superseded by the above.

The Text Feature Analyser

The Text Feature Analyser (Ver. 2.1; 64bit Windows only; released 07-May-2014): a tool for investigating textual features that may help in identifying & measuring issues related to text complexity. The basic design and usage are described in my original article on the tool, which can be downloaded from my publications page. Please note that some features discussed in that article, such as the concordancing functionality, have already been added since then.

This tool will soon undergo a (major) re-write. The most recent version already contains a bug fix for the syllable count, which had produced some errors in the earlier version, and now also adds output that lists the number of (estimated) syllables in a document. Future versions will not only contain better documentation (which is, admittedly, very sparse at the moment), but probably also a tabbed interface where the analysis for each new document will be listed on a separate tab, etc.


Version 2

ICEWeb is a small & simple utility for compiling & analysing web corpora. The name was chosen because the main intention behind the tool is to allow researchers to augment existing or create new corpora for the International Corpus of English (ICE).

It is designed to be as user-friendly as possible, yet still allows some fairly sophisticated processing, including boiler-plate removal, of whatever web pages are downloaded. For a slightly more comprehensive overview of its features, you can take a look at my ICAME 39 presentation. For detailed information, please consult the ICEweb2 Manual (included in the distribution).

You can download the 64bit version from here. If anyone should really still require a 32bit version, please send me an email, and I’ll compile one and will add it here. To ‘install’, simply follow the ‘installation’ instructions and then run ‘iceWeb2_64bit.exe’.

Fri 06-Mar-2020 09:48:38: Version 2.2 now also allows you to change the language specified when creating queries via the configuration file using the IANA country code.

Please note that there were some bugs in version 2.0, which may have prevented queries from being opened in search engines if you has more than two seed terms & directories for URL files from being created. If you downloaded version 2.0, please replace it with 2.2.

Version 1

Please note that this version has now been superseded by Version 2 above, which has greatly enhanced features. I’m still keeping this version online for the moment, though, as it’s the one described in Section 4.2.4 in my Introduction to Corpus Linguistics textbook, and used for the exercise there.

ICEWeb (Ver. 1; 32bit Windows):

The (Phonetic) Transcription Editor

The (Phonetic) Transcription Editor (32bit Windows version): as the name says, mainly an editor for creating phonetic transcriptions, which allows output to be saved to a UTF-8 encoded text file or (double-spaced) HTML page, suitable for submission of assignments. The program also provides an option for grapheme–to–phoneme conversion, which, however, has some serious limitations, as it ‘knows’ nothing about strong and weak syllables or features of connected speech.