Taggers (& other tools)

Arabic language tools

A set of Arabic processing tools utilizing the Yamcha SVM tools to tokenize, POS-tag and Base Phrase Chunk Arabic text. (for Linux) can be found on Mona Talat Diab’s page here.

Xerox Arabic Morphological Analyser

Buckwalter Morphological Analyser

Sebawai and Al-Stem (for Arabic) – an Arabic Morphological Analyzer and light Arabic stemmer

AMALGAM Tagger by email
(for English)

Free e-mail tagging service. You have a choice among several tagsets (e.g. Brown, LOB, LLC, SEC, POW, ICE). Emulates several taggers and their tagsets The program is effectively a wrapper for Eric Brill’s Rule-based tagger, retrained at Leeds with 8 alternative tagging schemes. The tagger works by reading in the lexicon, bigram lists and rules from external files.

AUTASYS
(For English)

A menu-driven automatic tagging and lemmatising system that analyses English texts at word-class level with the Lancaster-Oslo-Bergen (LOB) tagset, the International Corpus of English (ICE) tagset, and the “skeleton” tagset (SKELETON), which is the set of base tags from ICE without features. The tagged text can be subsequently lemmatised (reduced to base forms).

Birmingham’s E-mail Tagging Service

Free e-mail tagging service for short texts. Send text as mail to: tagger@clg.bham.ac.uk. Tagset used (similar to the Brown/LOB/Penn set) is listed here.

Brill Tagger
(trainable for any language)

One of the earliest free taggers. Windows versions here or here. On-line/web implementation for German available from Zurich site here.

ChaSen
(For Japanese)

A free Japanese Morphological analyser/POS-tagger from the Nara Institute of Science and Technology (NAIST)

CLAWS
(For English)

‘Constituent Likelihood Automatic Word-tagging System’, developed at UCREL, Lancaster University. Not free, but has a web front end (demo) that allows up to 100,000 words to be tagged for free

DAT ('dialogue annotation tool' from the University of Rochester)

a free tool for discourse-level annotation in the DAMSL format (requires Perl version 5.002 or higher and the Perl/Tk package)

fnTBL
(trainable for any language; UNIX and Windows(Cygwin) platforms)

- free, public domain software designed for large, dynamic classification tasks, such as part-of-speech tagging, base noun phrase chunking or word sense disambiguation, but can be used to perform any classification task with symbolic features. fnTBL improves the running time dramatically compared with the original TBL algorithm proposed by Eric Brill, obtaining a speed-up of up to 2 orders of magnitude, while maintaining the same performance.

- basic NLP tasks for English (part-of-speech tagging, base noun phrase and text chunking) are already trained and are part of the distribution; others (e.g. Swedish part-of-speech) can be downloaded from the web site.

ICTCLAS (POS tagger)

and ICTPROP (parser; for Chinese)

Chinese Lexical Analysis System – a Chinese word segmenter and POS tagger developed by the Institute of Computing Technologies, the Chinese Academia, Beijing. An open source version for ICTCLAS is called freeICTCLAS (source code and Linux port is at http://www.nlp.org.cn).

ICTPROP is a probabilistic Chinese parser, trained using Penn Chinese Treebank (ver1.0); precision & recall both 77%.

JUMAN

(for Japanese)

User-Extensible Morphological Analyzer for Japanese.

LEMMA3
(for English)

Wordclass tagger and lemmatizer for unrestricted German texts. To obtain a free copy of the system, please send a request to dr. Gerd Willée, IKP, University of Bonn, Germany (willee "AT" uni-bonn.de)

LT POS
(for English)

LT POS is a part-of-speech tagger which can handle plain ASCII text and SGML/XML marked-up text. LT POS incorporates a tokeniser which will determine sentence and word boundaries. The LT POS tagger uses a Hidden Markov Model disambiguation strategy. It achieves 95 to 97% accuracy. Indicates Noun Groups and Verb Groups. Has a Demo here.

Machinese Phrase Tagger/Machinese Syntax (Connexor)
(for English, French, German, Italian, Spanish, Swedish, Finnish, soon also Dutch)

Commercial product based on FDG (functional dependency grammar). The Machinese Syntax parser enriches text (plain text, xml, sgml, html) with functional dependencies that show sentence-level relations and functions between words and linguistic structures. Machinese Phrase Tagger is for light morphosyntactic markup (base forms, morphology and phrasal tags).

NOTE: Company was formerly called "Conexor" (with one "n" instead of two) and the base product was initially called EngCG-2 Tagger. That evolved and became embedded in the English version of FDG Lite and the full FDG. (FDG Lite = EngCG-2 + shallow phrasal tags (starting with "&"); the full FDG also produced functional tags and functional dependencies between words.) EngCG-2 was an extended version of the original ENGCG tagger, which assigned morphological and part-of-speech tags to words in English text. It was based on the Constraint Grammar framework advocated by a team of computational linguists in Helsinki, Finland.

MBT
(trainable for any language)

A free(?) memory-based part-of-speech tagger-generator and tagger. Memory-based tagging is based on the idea that words occurring in similar contexts will have the same POS tag. The idea is implemented using the memory-based learning software package TiMBL, version 4.3.1. The MBT software package makes use of TiMBL to implement a Part of Speech (POS) tagger-generator. The software consists of two executables: Mbtg to generate a tagger, and Mbt to use a generated tagger on text data. The package contains the code (C++), the Reference Guide, and some demo data. MBT has been applied to Dutch, English, Spanish, Swedish, and German

MORPHY (for German)

Free tool for German tagging and morphological analysis (no longer supported).

Morfette

A tool for supervised learning of inflectional morphology. Given a corpus of sentences annotated with lemmas & morphological labels, & optionally a lexicon, Morfette learns how to morphologically analyse new sentences, assigning morphological tags & lemmas to words

MTP (Münster Tagging Project) and Xlex/www (for any language)

Xlex is a suite of tools (mostly Unix command line tools written in Perl) for linguistic data processing, with an web-based, graphical front-end, Xlex/www. Free licence for non-commercial purposes. Xlex/www includes: tokenizer, segmenter, POS-tagger, index tool, concordance tools (regexp) and collocation tools. The Xlex suite is easily portable to any platform with Perl and a web server. Any browser with frames, CSS, and JavaScript capability can be used as Xlex/www client. The tools are written in Perl (except the POS tagger, implemented in C++) and normally started from a command line interface and intended for use as filters in Unix-style piped commands. Currently trained for German and English

MMAX /MMAX2 (or the SourceForge page here)
(Multi-Modal Annotation in XML)

Annotation tools that allows stand-off annotation, an arbitrary number of levels of annotation, etc.

MXPOST
(MaXimum Entropy POS-Tagger)

Downloadable Java Version (compatible with JDK1.3). Also: MXTERMINATOR (Sentence Boundary Detector).

nb
(Nota Bene)

An SGML-based discourse annotation tool written by Giovanni Flammia in Tcl7.0/Tk4.0 (runs under Windows with Tcl/Tk interpreter

Persian POS tagger

On-line tagger for Persian (input your own text) based on the Peykare corpus tagset.

PC-KIMMO (for any language)

Designed to generate (produce) and/or recognize (parse) words using a two-level model of word structure in which a word is represented as a correspondence between its lexical level form and its surface level form. PC-Kimmo includes descriptions for English, Finnish, Japanese, Hebrew, Kasem, Tagalog, and Turkish. Several related utilities: + KGEN. A rule compiler for PC-Kimmo, written by Nathan Miles of Ohio State University. + KTEXT. A text processor that uses the PC-KIMMO parser to produce a morphological parse of each word in the text. + Englex. A 20,000 entry morphological parsing lexicon of English intended for use with PC-KIMMO and/or KTEXT.

Pizza Chef

The TEI Guidelines define several hundred elements and associated attributes, which can be combined to make many different DTDs, suitable for many different purposes, either simple or complex. With the aid of the Pizza Chef (free), you can build a DTD that contains just the elements you want, suitable for use with any XML processing system.

The Perl-script version by Sebastian Rahtz, maketeidtd, is available here

POSTAG (for Korean)

Morphological Analyzer/POS tagger for Korean, with generalized unknown morpheme handler.

Qtag
(trainable for any language)

A free, portable (can be used in any operating system), stochastic, language-independent word-class/POS tagger (implemented in JAVA). Qtag is language-independent, but there is currently only an English version available. To use Qtag with other languages, you will need to create your own resource file (instructions given).

Simple PoS Tagger

The Simple PoS Tagger (Ver. 1.0) is an interface to a slightly modified version of the Perl Lingua::EN::Tagger module that allows the user to add morpho-syntactic tags to a text automatically, and then post-edit the colour-coded output. Uses a slightly modified version of the Penn tagset.

Stanford POS tagger

The Java-based Stanford Log-linear Part-Of-Speech Tagger

TATOE

Free text-analysis/text-markup tool and concordancer for Windows (TATOE = Text Analysis Tool with Object Encoding)

TnT tagger
(by Thorsten Brants; TNT=Trigrams’n'Tags)

Statistical part-of-speech tagger that is optimized for training on a large variety of tagged corpora in different languages and virtually any tagset, and incorporates methods of smoothing and of handling unknown words. Free for non-commercial use.

TOSCA-ICLE Tagger
(for English)

Tagger & lemmatiser developed originally for the ICLE and ICE-GB projects. Has 17 major word classes, + features for subclasses and additional semantic, syntactic and morphological information (total number of different tags is 220).

[The older (and not as refined) TOSCA-LOB tagger is an MS-DOS program which produces output in the LOB (London/Oslo/Bergen) Corpus format. For more info, see here.]

Tree Tagger (Stuttgart)
(trainable for any language)

A language-independent tagger and lemmatiser developed at Stuttgart. Free for research, education and evaluation. Parameter files for tagging English, German, Italian and French are available.

For Windows, several graphical interfaces exist, such as Laurence Anthony’s TagAnt, or the one developed by Ciarán Ó Duibhín.

Xerox Tagger (trainable for any language)

implemented in Common Lisp and tested on UNIX and Macintosh. Source code available from the ftp site.

Web-Based TAGGER

a Web interface to the TAGGER program. Enter some English text in the input area and click the "OK" button. Parsed text is returned either in XML format or in an easy-to-read marked form.

Wmatrix

a web-based environment which allows access to some of UCREL’s corpus annotation and retrieval tools. All processing is done on the remote web server so users gain access from any platform that provides a browser. Tools included in Wmatrix are CLAWS (part-of-speech tagger), USAS (semantic field tagger) and a lemmatiser. Wmatrix also provides production of frequency lists and statistical comparison of those lists. Wmatrix/Matrix a new kind of method and tool for advancing the statistical analysis of electronic corpora. By integrating part-of-speech tagging and lexical semantic tagging in a profiling tool, the Matrix technique extends the keywords procedure to produce key grammatical categories and key concepts. It has been shown to be applicable in the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties, vocabulary studies in sociolinguistics, studies of language learners, information extraction and content analysis. Currently, it has been tested on restricted levels of annotation and only on English language data.

Free XML Tools & Software

Lars Marius Garshol’s index of free XML tools