Software for Linguistics

The tools on this page are programs I have developed for various language analysis or annotation purposes over a number of years. They are freely usable for non-commercial purposes under GPL 3.0 licence. Upon request by some colleagues, I’ve recently also created 32bit versions for most programs. These will run on older Windows versions, can easily be carried on a memory stick to be used on any Windows computer, and also be run on Mac OS X and Linux using Wine.

Although the programs generally don’t require any installation, I sometimes provide installers for the sake of convenience for users less experience with extracting from zip archives. The main advantage in extracting from zip files is that you should also be able to run the programs from a memory stick without installation. As most programs are designed to allow more experienced users to change/customise the configuration files, you should always copy them somewhere you have write-access to, i.e. for non-administrator users generally not the ‘Program files’ folder, because that folder restricts write access. I’ve recently also noticed that installation to a folder containing Chinese characters appears to cause issues in some programs finding the configuration files, etc. If you encounter such a problem, please move the program files to a folder that only contains basic Latin characters.

This page will be expanded, e.g. by adding more tools and more extensive descriptions, in due course. Please feel free to send me any comments, bug reports, and/or suggestions you might have.

The Simple PoS Tagger

The Simple PoS Tagger (Ver. 1.0) is an interface to a slightly modified version of the Perl Lingua::EN::Tagger module that allows the user to add morpho-syntactic tags to a text automatically, and then post-edit the colour-coded output. To ‘install’, simply extract the files from the zip archive into a folder you have write-access to and run the executable (‘Tagger.exe’). The interface should be relatively intuitive to use, but some basic usage info is provided under the ‘Help’ menu.

The output in my interface differs slightly from the original version produced by the tagger in that I’ve replaced the originaly slashes that separate words and tags by the more ‘traditional’ underscore format that provides better readability.

For future releases, I’m planning to include some more features that will make it possible to explore the tagged text in various ways, e.g. through switching some of the colour-coding on an off to identify structures like NPs, etc., visually.

Versions:

The Dialogue Annotation and Research Tool (DART; Ver. 1.1)

The Dialogue Annotation and Research Tool is an annotation tool and linguistic research environment that not only makes it possible to annotate large numbers of dialogues automatically, but also provides facilities for pre- and post-editing dialogue data, as well as conducting different types of analysis on annotated and un-annotated data in order to improve the annotation process. To decide whether DART may be interesting/useful for you, you can first take a look at the DART Manual. Please note that the earlier version of the manual did not make it clear that the DART format requires all words, apart from proper nouns, to be lowercased, so if you've been getting unexpectedly bad annotation results, they may be due to this feature ;-) and you'simply need to adjust your data to fix this.

If you’re planning to download & ‘install’ DART, anyway, there’s no need to get the manual first, as it will automatically be copied to the ‘docu’ folder in the DART program folder. Another PDF containing the current speech-act taxonomy used in/by DART is available from here. There’s now also the ‘quick-start guide’ in the form of my recent workshop presentation at the 5th HAAL conference in HK, Pragmatic Annotation & Analysis in DART.

Along with the program and its resource files, some sample files for ‘playing’ will also be installed, one into the ‘test’ folder, and the complete set of original Trainline data, which represents the un-annotated data for the SPAADIA corpus (see below), into the ‘trainline’ folder.

In order to facilitate the distribution, I’ve decided to only distribute 32bit versions in the form of zip archives as of version 1.1. The older installation programs are still available below, but it’s generally advisable to use the latest version if possible.

Current version: DART 1.1 (32bit only zip archive). Released 18-May-2015.

Older Versions:

I’m currently working on version 2.0, but release has been delayed by two factors, a) that the article describing version 1 in Corpus Linguistics and Linguistic Theory was only pulished in October (2016), and it would be confusing for readers to encounter a radically different interface should they try to test it, and b) that I haven’t been able to include all the features I wanted to yet. The new version, which I already used for my workshop at CL2015 in a first 'beta' version, contains an optimised tabbed user interface, shamelessly ‘borrowed’ in style from AntConc, as well as a number of improved analysis options, as well as optimisations in the syntactic analysis and pragmatic inferencing process. Below is screenshot of the latest beta, which I can make available on request, as a preview.

The Simple Corpus Tool

The Simple Corpus Tool (Ver. 1.; released 14-Dec-2015): A combination of viewer/editor, concordancer, and analysis tool for use with either simple XML files (similar to the SPAADIA format), or basic line-based plain-text files. Files selected for opening are displayed as a list of hyperlinks in a window on the left-hand side of the program, where clicking the hyperlink will open a file in the built-in editor. The tabbed pane in the right-hand window contains a tab for ‘Concordance’ output based on the files that are listed on the left, as well as tab (‘Feature definitions’) for defining features to be counted automatically in all files in the form of a label + regex pair. The results of the feature count are displayed in the window below the ‘Feature count’ tab and can be copied and pasted straight into a spreadsheet program, such as MS Excel or OpenOffice Calc, for further analysis.

The built-in editor it allows the user to add tags and attributes from toolbars and menus. These can even be edited to customise the editor to some degree by editing the files in the ‘conf’ folder. The editor itself is based on the Perl/TK TextUndo widget and provides undo/redo functionality through the keyboard shortcuts Ctrl+Z and Ctrl+Y, respectively, as in many standard text editors.

The current version also contains a highly flexible n-gram analysis module, and has some interface enhancements.

Until I can find the time to write a proper manual, please consult the ‘Simple Corpus Tool Help’, triggered via the menu or F1, inside the program.

Versions:

The SPAADIA concordancer

The SPAADIA concordancer (32bit Windows version): a concordancer (mainly) for use with the SPAADIA corpus (see). Theoretically, the concordancer can handle any plain text-based files, such as .txt, (X)HTML and XML files, though, provided that the right extension is set in the box on the top right-hand side of the interface. The assumed input encoding for files is UTF-8, and the concordancing works best for files where tags and text are separated. The concordancer allows for searches of one or two search strings in combination, using the full set of Perl regular expressions. Any whitespace in an expression needs to be ‘quoted’ via \s and possibly quantified if there may be multiple spaces. Now largely superseded by the above.

The Text Feature Analyser

The Text Feature Analyser (Ver. 2.1; 64bit Windows only; released 07-May-2014): a tool for investigating textual features that may help in identifying & measuring issues related to text complexity. The basic design and usage are described in my original article on the tool, which can be downloaded from my publications page. Please note that some features discussed in that article, such as the concordancing functionality, have already been added since then.

This tool is currently undergoing a (major) re-write. The most recent version already contains a bug fix for the syllable count, which had produced some errors in the earlier version, and now also adds output that lists the number of (estimated) syllables in a document. Future versions will not only contain better documentation (which is, admittedly, very sparse at the moment), but probably also a tabbed interface where the analysis for each new document will be listed on a separate tab, etc.

ICEweb

ICEWeb (32bit Windows version): a small & simple utility for compiling, downloading & analysing web corpora. The name was chosen because the original intention was to create corpora that are similar in nature to the International Corpus of English (ICE) data.

Versions:

The (Phonetic) Transcription Editor

The (Phonetic) Transcription Editor (32bit Windows version): as the name says, mainly an editor for creating phonetic transcriptions, which allows output to be saved to a UTF-8 encoded text file or (double-spaced) HTML page, suitable for submission of assignments. The program also provides an option for grapheme–to–phoneme conversion, which, however, has some serious limitations, as it ‘knows’ nothing about strong and weak syllables or features of connected speech.