Software for Linguistics

The tools on this page are programs I have developed for various language analysis or annotation purposes over a number of years. They are freely usable for non-commercial purposes under GPL 3.0 licence. Upon request by some colleagues, I’ve recently also created 32bit versions for most programs. These will run on older Windows versions, can easily be carried on a memory stick to be used on any Windows computer, and also be run on Mac OS X and Linux using Wine.

Although the programs generally don’t require any installation, I sometimes provide installers for the sake of convenience for users less experience with extracting from zip archives. The main advantage in extracting from zip files is that you should also be able to run the programs from a memory stick without installation. As most programs are designed to allow more experienced users to change/customise the configuration files, you should always copy them somewhere you have write-access to, i.e. for non-administrator users generally not the ‘Program files’ folder, because that folder restricts write access. I’ve recently also noticed that installation to a folder containing Chinese characters appears to cause issues in some programs finding the configuration files, etc. If you encounter such a problem, please move the program files to a folder that only contains basic Latin characters.

This page will be expanded, e.g. by adding more tools and more extensive descriptions, in due course. Please feel free to send me any comments, bug reports, and/or suggestions you might have.


The Dialogue Annotation and Research Tool (DART; Ver. 2.0)

The Dialogue Annotation and Research Tool is an annotation tool and linguistic research environment that not only makes it possible to annotate large numbers of dialogues automatically, but also provides facilities for pre- and post-editing dialogue data, as well as conducting different types of analysis on annotated and un-annotated data in order to improve the annotation process. To decide whether DART may be interesting/useful for you, you can first take a look at the old version of the DART Manual. I plan to write a new version of the manual in autumn, but DART 2.0 now contains a built-in help system which explains most of the relevant features and functions, anyway, so the manual will only provide some additional detail and screenshots.

A PDF containing the current speech-act taxonomy used in/by DART is available from here.

To ‘install’, just extract all files to a folder where you have write access, ideally ‘C:\DART’ or something similar. You can then start the program by running dart.exe. Along with the program and its resource files, some sample files for ‘playing’ are also provided, three in the ‘test’ folder, and the complete annotated data for the SPAADIA (v. 2) corpus (see below), in the ‘spaadia’ folder.

Current version: 2.0

Older Versions:


The Text Annotation and Research Tool (TART; work in progress)

The Text Annotation and Research Tool (TART) will be DART’s written-language counterpart for identifying the equivalent of speech acts in written language. Although I already began developing initial design ideas and a simple prototype based on the DART model a few years ago, and also presented on some of these at CL 2015 in Lancaster, various other commitments have prevented me from completing it so far.

Apart from including very similar annotation and analysis features to DART, TART will also contain features more specifically geared towards written-language analysis, some of which will be derived from the features implemented in the Text Feature Analyser, such as measures of lexical density, various ratios based on different units of analysis (whole texts, paragraphs, sentences, and possibly other types of textual divisions), most of which will be calculable with or without stopwords and including a variable norming factor to facilitate comparison between different texts of unequal length.

One major part of the ongoing development consists in modelling all the relevant necessary features that will enable TART to annotate texts from various different genres/text types, as well as providing different means of analysing or filtering by these specific features.


The Simple PoS Tagger

The Simple PoS Tagger (Ver. 1.0) is an interface to a slightly modified version of the Perl Lingua::EN::Tagger module that allows the user to add morpho-syntactic tags to a text automatically, and then post-edit the colour-coded output. To ‘install’, simply extract the files from the zip archive into a folder you have write-access to and run the executable (‘Tagger.exe’). The interface should be relatively intuitive to use, but some basic usage info is provided under the ‘Help’ menu.

The output in my interface differs slightly from the original version produced by the tagger in that I’ve replaced the originaly slashes that separate words and tags by the more ‘traditional’ underscore format that provides better readability.

For future releases, I’m planning to include some more features that will make it possible to explore the tagged text in various ways, e.g. through switching some of the colour-coding on an off to identify structures like NPs, etc., visually.

Versions:


The Simple Corpus Tool

The Simple Corpus Tool (Ver. 1.; released 14-Dec-2015): A combination of viewer/editor, concordancer, and analysis tool for use with either simple XML files (similar to the SPAADIA format), or basic line-based plain-text files. Files selected for opening are displayed as a list of hyperlinks in a window on the left-hand side of the program, where clicking the hyperlink will open a file in the built-in editor. The tabbed pane in the right-hand window contains a tab for ‘Concordance’ output based on the files that are listed on the left, as well as tab (‘Feature definitions’) for defining features to be counted automatically in all files in the form of a label + regex pair. The results of the feature count are displayed in the window below the ‘Feature count’ tab and can be copied and pasted straight into a spreadsheet program, such as MS Excel or OpenOffice Calc, for further analysis.

The built-in editor it allows the user to add tags and attributes from toolbars and menus. These can even be edited to customise the editor to some degree by editing the files in the ‘conf’ folder. The editor itself is based on the Perl/TK TextUndo widget and provides undo/redo functionality through the keyboard shortcuts Ctrl+Z and Ctrl+Y, respectively, as in many standard text editors.

The current version also contains a highly flexible n-gram analysis module, and has some interface enhancements.

Until I can find the time to write a proper manual, please consult the ‘Simple Corpus Tool Help’, triggered via the menu or F1, inside the program.

Versions:


The SPAADIA concordancer

The SPAADIA concordancer (32bit Windows version): a concordancer (mainly) for use with the SPAADIA corpus (see). Theoretically, the concordancer can handle any plain text-based files, such as .txt, (X)HTML and XML files, though, provided that the right extension is set in the box on the top right-hand side of the interface. The assumed input encoding for files is UTF-8, and the concordancing works best for files where tags and text are separated. The concordancer allows for searches of one or two search strings in combination, using the full set of Perl regular expressions. Any whitespace in an expression needs to be ‘quoted’ via \s and possibly quantified if there may be multiple spaces. Now largely superseded by the above.


The Text Feature Analyser

The Text Feature Analyser (Ver. 2.1; 64bit Windows only; released 07-May-2014): a tool for investigating textual features that may help in identifying & measuring issues related to text complexity. The basic design and usage are described in my original article on the tool, which can be downloaded from my publications page. Please note that some features discussed in that article, such as the concordancing functionality, have already been added since then.

This tool will soon undergo a (major) re-write. The most recent version already contains a bug fix for the syllable count, which had produced some errors in the earlier version, and now also adds output that lists the number of (estimated) syllables in a document. Future versions will not only contain better documentation (which is, admittedly, very sparse at the moment), but probably also a tabbed interface where the analysis for each new document will be listed on a separate tab, etc.


ICEweb

ICEWeb (32bit Windows version): a small & simple utility for compiling, downloading & analysing web corpora. The name was chosen because the original intention was to create corpora that are similar in nature to the International Corpus of English (ICE) data.

Versions:


The (Phonetic) Transcription Editor

The (Phonetic) Transcription Editor (32bit Windows version): as the name says, mainly an editor for creating phonetic transcriptions, which allows output to be saved to a UTF-8 encoded text file or (double-spaced) HTML page, suitable for submission of assignments. The program also provides an option for grapheme–to–phoneme conversion, which, however, has some serious limitations, as it ‘knows’ nothing about strong and weak syllables or features of connected speech.