Annotation & Text-Processing Tools

Text Coding/(Manual) Annotation Programs/Text-analysis Tools & Search Engines

Please note that some of these programs produce XML files in standoff format, which separates the text into different linked levels. The advantage of this type of annotation is that it is possible to link various types of annotation to the same set of data, but the disadvantage is that it’s usually not possible to ‘interact’ directly with that data unless this is done through the interface it’s been created with. In other words, creating standoff annotations usually ties one into specific programs and the functionality the provide for merging views.

DART (Dialogue Annotation & Research Tool

An annotation & analysis tool designed for the semi-automatic annotation of spoken (transcribed) dialogues on the levels of syntax, pragmatics (speech acts), (surface) polarity, semantics (topics), & semantico-pragmatics (modes, ‘IFIDs’).

DART produces annotations in what I refer to as ‘Simple XML’, a highly readable format that still allows the corpus user to ‘interact’ with the data easily to perform corrections, add additinal annotations, etc.

As a research, tool, DART also offers facilities for the creation of dialogue corpora & and their associated analysis resources, a built-in concordancer, n-gram analysis, as well as speech-act statistics.

For a detailed description, see my recent article in Corpus Linguistics and Linguistic Theory about version 1.

Version 2, as well as DART’s ‘big brother’, the Text Analysis & Research Tool (TART), are currently under development.

Dexter

Free suite of software tools that enable you to perform qualitative coding of corpus texts; Java-based, so it is cross-platform (Windows, Mac, Linux). Dexter is written specifically with three things in mind: spoken language data, researcher-collected data, and analysis of discourse-level phenomena. Dexter Coder displays your document in a window and allows you to define and add annotations to the document. You can perform complex searches of the text and codes, and certain quantitative analyses, with the Coder. All annotations are saved in a separate standoff XML file. The input data may be in various formats; it is converted to XML to enable stand-off markup, which in turn enables an unlimited number of analyses without affecting the source data.

Embedding Viewer viewer for language modeling based on SketchEngine (no registration).

EXMARaLDA
(Extensible Markup Language for Discourse Annotation)

A system of concepts, data formats and tools for the computer assisted transcription and annotation of spoken language; XML-based data formats; Java-based tools; interoperable with software like Praat, ELAN or the TASX Annotator; based on the annotation graph framework (Bird/Liberman 2001); supports several important transcription systems (HIAT, DIDA, GAT, CHAT) through a number of parameterised functions.

EXMARaLDA includes a facility for concordancing (SQUIRREL("Search and Query Instrument for EXMARaLDA") and also ZECKE ("Ziemlich einfaches Konkordanzwerkzeug für EXMARaLDA") for searching transcribed and annotated phenomena in an EXMARaLDA corpus), but I’ve not yet used these myself, so can’t comment.

Grammar Explorer (grexplorer)

OR the legacy link here.

OR the KPML link here

A tool for learning about the coverage of large generation grammars; aimed currently at grammars written in the systemic-functional style; The operation of the tool is essentially as a coder: you, as the user, should select some sentence, or other grammatical unit, and attempt to `code' that unit using the terms of a grammar. The tool leads you through the grammar presenting the options that are available (& you can ask for examples exhibiting the relevant grammatical choices); it also tells you the syntagmatic consequences of those choices (i.e., what structure is generated). If your coding is correct, then it should be possible to relate the structure you have generated to the original target unit. The Explorer differs from coding, or text annotation/markup, in that it provides access to the structural consequences of coding. This provides a natural check to the accuracy of any coding carried out. The Explorer also differs from an annotated corpus of examples, in that the examples it shows are all generated with the grammar that it contains.

Kura

a multilingual, multi-user, multi-project, open-source linguistic database program especially geared towards language description/ linguists working with fieldwork or manuscript data. Supports the entry, analysis and presentation of linguistic data, be it recordings or manuscripts. All linguistic data is stored in parsed form in a relational database, facilitating quick analysis, and the relations between data can also be stored. Kura consists of 3 main parts: the database with a set of relatively sophisticated components that represent linguistic notions, such as text or lexeme, the desktop client that can be used to enter data and analyses, and the special-purpose webserver, that can present the linguistic data to the outside world. Uses Unicode (currently Basic Multilingual Plane only). Platforms: Windows and Unix/X11 (Windows version might have some limitations and while still free software, some runtime components could lose that status in the future).

NooJ
(by Max Silberztein)

A free corpus-processing tool and linguistic engineering development platform/environment. Can be used as: corpus processor, information extraction system, terminological extractor, Machine Translation development tool, tool for teaching linguistics & computational linguistics.

Allows linguists to formalize several levels of linguistic phenomena: orthography and spelling, lexicons for simple words, multiword units and frozen expressions, inflectional, derivational and productive morphology, local, structural syntax and transformational syntax. As a corpus processing tool, NooJ allows users to apply sophisticated linguistic queries to large corpora in order to build indices and concordances, annotate texts automatically, perform statistical analyses, etc. Linguistic modules can already be freely downloaded for many languages.

Characteristics: (1) can process texts in over 100+ file formats, including HTML, PDF, MS-OFFICE, all variants of UNICODE. (2) can import information from, and export its annotations back to XML documents. (3) annotation system that allows all levels of grammars to be applied to texts without modifying them; this allows linguists to formalize various phenomena independently, and to apply the corresponding grammars in cascade. For instance, by combining inflection, derivation and syntactic data, NooJ can perform Harris-type transformations.

OneClick Terms Simple term extractor interface giving easy access to terminology extraction functionality. Powered by SketchEngine technology (no registration, but some limitations).

RSTTool

or the related Systemic Coder

RSTTool is a graphical interface for marking up the structure of text. While primarily intended to be used for Rhetorical Structure (cf. Rhetorical Structure Theory (RST): Mann & Thompson 1988), the tool also allows the mark-up of constituency-style analysis, as in the Generic Structure Potential (GSP - cf. Hasan 1984; Halliday & Hasan 1985). Windows, Macintosh, UNIX and LINUX operating systems (requires the pre-installation of Tcl/Tk, a scripting language engine). The Tool consists of four interfaces: Text Segmentation: for marking the boundaries between text segments; Text Structuring: for marking the structural relations between these segments; Relation Editor: for maintaining the set of discourse relations, and schemas. Statistics: for deriving simple descriptive statistics based on your analysis.

Systemic Coder is a tool that facilitates the linguistic coding of corpus material, through the prompting of the user for relevant categories. Linguistic features are organised in terms of a systemic network – an inheritance hierarchy – to reduce the amount of coding effort. You first define your feature hierarchy, and then prompted to code the segements of the text according to the hierarchy. These codings can then be statistically analysed, either using the built-in comparative statistics programs, or by exporting the codings in a form readable by statistical packages.

[My comment: Both these tools do not seem to create output or accept input in an exportable format such as XML. If you know otherwise, please let me know.]

Systemics
(Kevin Judd & Kay O’Halloran)

A tool designed to allow efficient and comprehensive discourse analysis of text from the perspective of Systemic Functional Linguistics (SFL); however, as the pre-programmed grammar in Systemics can be modified, this software can incorporate other theoretical perspectives.

SACODEYL Annotator

Free (GNU licence) tool for XML-annotating language corpora in a user-friendly way while complying with TEI guidelines. SACODEYL Annotator can: manage multiple corpora; manage the definition of the tags that can be annotated; Extend the annotation tags; annotate different at different levels (tree-based); work with oral and written texts; show or hide selected annotations.

SysAm
(Macquarie University)

Computational tools for managing linguistic systems, analysing texts, and extracting linguistic patterns from a large corpus of text

TATOE

Free concordancer and text-analysis/text-markup tool for Windows (TATOE = Text Analysis Tool with Object Encoding). [I can’t seem to get this program to work properly for me, but maybe other people will have more luck.]

Text Feature Analyser

A tool for investigating textual features that may help in identifying & measuring issues related to text complexity. The basic design and usage are described in my original article on the tool, which can be downloaded from my publications page.

Tgrep2 (for searching parsed corpora/treebanks)

A search engine for finding structures in a corpus of trees. Used for extracting data from the Penn Treebank corpora of parsed sentences. (Linux program + source code for other platforms)

TIGERSearch treebank query tool

A specialized search engine for syntactically annotated corpora (treebanks). Features: * linguistically motivated query language (similar to typed feature-based grammar formalisms); * sophisticated graphical user interface (TIGERGraphViewer) for browsing query results; * corpus samplers from PennTreebank, NEGRA, TIGER, DEREKO, Susanne, Christine, Penn-Helsinki Parsed Corpus of Middle English, VerbMobil; * graphical registry tool (TIGERRegistry) for easy corpus administration; * XML-import of corpora Import filters; * XML- and SVG-animation-export of query results; Sample XSLT-stylesheets for the creation of other formats are included.
* available for all major platforms which support Java 1.3: Microsoft Windows, Solaris, Linux, and Mac OS X.

UAM Corpus Tool

Annotation & corpus analysis tool using stand-off XML markup.  Features: (1) Annotation of multiple texts using the same annotation schemes, of your design. (2) Annotation of each text at multiple levels (e.g., NP, Clause, Sentence, whole document) (3) Searching for instances across levels, e.g., finite-clause containing company-np, or future-clause in introduction. (4) Comparative statistics across subsets, e.g., contrasting conversational patterns used by male and female speakers. (5) All annotation is in stored in stand-off XML files, meaning that your annotations can more easily be shared with other applications and allows for multiple overlapping analyses of the same text.

XTrans

A multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings


Tools & Resources for Transcribing, Annotating or Analysing texts (inc. speech or audio-visual)

AGTK (Annotation Graph Toolkit)

work pioneered by Steven Bird; 'annotation graphs' are a formal framework for representing linguistic annotations of time series data. Application included in this toolkit are: MultiTrans: transcribing multi-party conversation; TableTrans: observational coding of audio; TreeTrans: syntactic annotation; InterTrans: interlinear text transcription

ATLAS (Architecture and Tools for Linguistic Analysis Systems)

an architecture targeted at facilitating the development of linguistic applications. The principal goal of ATLAS is to provide an abstraction over the diversity of linguistic annotations. The abstraction, which expands on Bird and Liberman’s Annotation Graphs, is able to represent complex annotations on signals of arbitrary dimensionality.

CLaRK

an XML-based system for corpora development and it includes an Unicode XML Editor, XPath language for navigation in XML documents, XSLT engine for tranformation of XML documents, Cascaded Regular Grammars, Constraints over XML documents, Tokenizers, Concordance tool, Extract, Remove and other tools. The system is implemented in JAVA.

ELAN (EUDICO Linguistic Annotator)

an annotation tool that allows you to create, edit, visualize and search annotations for video and audio data. It was developed at the Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands, with the aim to provide a sound technological basis for the annotation and exploitation of multimedia recordings. ELAN is specifically designed for the analysis of languages, sign languages, and gesture, but it can be used by everybody who works with media corpora, i.e., with video and/or audio data, for purposes of annotation, analysis and documentation.

Emu Speech Database System

A system for managing collections of speech data which supports hierarchical labelling of utterances. Emu is freely available and supports a range of file formats.

GATE (General Architecture for Text Engineering)

is an architecture, framework and development environment for language engineering which can be also used to annotate texts. GATE is a domain-specific software architecure and development environment (SDK) that supports researchers in Natural Language Processing and Computational Linguistics and developers who are producing and delivering Language Engineering systems. It supports the full lifecycle of language processing components, from corpus collection and annotation through system evaluation.

Guidelines for ToBI Labelling

ToBI (Tones and Break Indices) is a system for transcribing the intonation patterns and other aspects of the prosody of English utterances

MATE Workbench

a Java program designed to aid in the display, editing and querying of annotated speech corpora. It can also be used for arbitrary sets of hyperlinked XML encoded files.

NITE XML Toolkit

aimed at software developers, to allow them to build the more specialized displays, interfaces, and analyses that are required by end users when working with highly structured or cross-annotated XML data and multimedia data.

PRAAT
(alternative URL here)

free, comprehensive speech analysis, synthesis, and manipulation package; includes general numerical and statistical stuff, is built on a general-purpose GUI (graphical user interface) shell for handling objects, and produces publication-quality graphics. Runs on virtually all platforms (Windows, Macintosh, Unix/Linux, etc.) Mirror sites here and here.

See also: SpeCT - The Speech Corpus Toolkit for Praat

SACODEYL Transcriptor

Transcription tool that can: * Manage multiple videos formats: DIVX,XVID,AVI,MPEG,Quick Time,RM; * Manage multiple audio formats in MP3, WAV, ASF formats; * Use multi-languague support (Unicode); * Import transcriptions from other formats, such as Transana format; * Support metadata information; * Support transcription of spoken language: cuts, comments, trunc words, foreign words, etc.; * Support timestamping linking between video/audio and text.

The output of SACODEYL Transcriptor is used by SACODEYL Annotator.

SignStream

SignStream is a database program for MacOS that facilitates the annotation and analysis of visual language data. It has been designed for study of signed languages and the gestural component of spoken languages, but may be of use for analysis of any video-based data. SignStream is not currently available for Windows or UNIX platforms, but version 3 is being ported to Java to address this issue.

SIL tools

Lots of software relevant to speech data (& field linguistics), including Speech Analyzer (recording & editing speech, pitch tracking and spectrograms).

SoundScriber (Eric Breck, University of Michigan)

free Windows program (associated with the MICASE corpus) that aides in the transcription of digitized sound files. Includes features specifically for transcription: keystrokes to control the program while working in another window (e.g. word processor, SGML editor, etc.); variable speed playback, and a feature called "walking." Walking plays a small stretch of the file several times, then advances to a new piece, overlapping slightly with the previous one (thus facilitating continuous transcription without having to manually pause or rewind). Opens any file Media Player can, including wave audio files (.WAV), Video for Windows files (.AVI), and MPEG Layer 3 (.MP3). Alternative download link is here.

TalkBank software
(Links to various tools supporting different aspects of the process of transcription and analysis)

(i) Transcriber (alternative site here): a tool for assisting the segmenting, labeling and transcribing of speech signals (labeling speech turns, topic changes and acoustic conditions). Requires prior installation of Tcl/Tk
(ii) CLAN (suite of programs aimed at child language analysis)
(iii) AGTK: Annotation Graph Toolkit (toolkit designed to allow programmers to quickly create small applications that conform with the TalkBank Annotation Graph model)
(iv) XML-based Tools (e.g. xCode: a Unicode text editor; able to validate and filter XML through an XSLT sheet and display the editable result as a flat text)

TASX-Annotator
(Bielefeld)

free, cross-platform program (Java-based, released under GNU licence) for the annotation and transcription of video (multi-channel) and audio data. Video and audio playback can be controlled by a foot switch. Different data views are programmed (time-aligned partiture, word-aligned partiture, sequential text view). The system integrates an XSL-T processor (Saxon), making it easy to perform on the fly data transformations. TASX thus takes the function of an interlingua. The import of an XML-file is split into two steps: one simply has to define two XSL-T stylesheets. The first transforms the XML format into TASX, the second transforms TASX back into the XML format.

Transana

Open source, but not free (!) program (Windows & Macintosh) for the transcription and analysis of video data. It provides a way to view video, create a transcript, and link places in the transcript to frames in the video. It provides tools for identifying and organizing analytically interesting portions of videos, as well as for attaching keywords to those video clips. It also features database and file manipulation tools that facilitate the organization and storage of large collections of digitized video. Features: import and view MPEG-1 video and MP3 and WAV format audio files; automatically highlights the relevant portion of the transcript while the video plays; a multi-user version, Transana-MU, allows users to share their data and analyses with other research team members via a LAN.

Transcriber
(or LDC mirror here)

free program. A tool for assisting the manual annotation of speech signals. It provides a user-friendly graphical user interface for segmenting long duration speech recordings, transcribing them, and labeling speech turns, topic changes and acoustic conditions. It is more specifically designed for the annotation of broadcast news recordings, for creating corpora used in the development of automatic broadcast news transcription systems, but its features might be found useful in other areas of speech research.

UCSB Discourse Transcription Software

VoiceWalker (software for stepping through recordings, for easier transcription) and SoundWriter (VoiceWalker + facility for aligning transcripts with sound files via SMPTE time codes). Free downloads.

VOCALE

A tool for the automatic annotation of vocalic and consonantal intervals, based on the probabilistic measurement of relative entropy and a number of phonetic measurements. Vocale takes a wav file as input, then automatically calls up some Praat functions such as creating a spectrogram and gives a Praat label file as output. This can then be used for the calculation of the speech rhythm. The entire programme is open source and can be downloaded.

wavesurfer

Free, Open Source tool for sound visualization and manipulation. Runs on virtually all platforms (Windows, Macintosh, Unix/Linux, etc.).

Winpitch

Windows software. Speech analysis and annotation tool, with fundamental frequency and spectrographic display. Prosodic morphing capability through re-synthesis of natural speech.

* For extensive speech-technology-related links and technical stuff, visit the Speech at CMU Web Page