NLP/Computational Linguistics Resources (incl. parsers, SGML/XML stuff)

Most of these descriptions are taken from the respective web sites and do not represent my views. For an introduction to parsing methods and types of parser, click here.

ACL NLP-CL Universe (Association for Computational Linguistics)

bookmark site; pointers to more than 1,500 computational linguistics resources on the Web

Apple Pie Parser

probabilistic syntactic parser (for UNIX and Windows) developed by Satoshi Sekine at NYU.


a toolkit for statistical language modeling, text retrieval, classification and clustering -- a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs.


an XML-based system for corpora development and it includes an Unicode XML Editor, XPath language for navigation in XML documents, XSLT engine for tranformation of XML documents, Cascaded Regular Grammars, Constraints over XML documents, Tokenizers, Concordance tool, Extract, Remove and other tools. The system is implemented in JAVA.

N-gram Statistics Package (NSP)

an easy-to-use suite of Perl tools for counting and analyzing word n-grams in text. It provides a number of standard tests of association that can be used to identify word n-grams in large corpora, and also allows users to easily implement other tests without knowing very much about Perl at all. Supports user-defined tokenization using regular expressions, stop lists, and an extensive collection of test/sample scripts.

CMU Statistical Language Modeling Toolkit

a suite of UNIX software tools to facilitate the construction and testing of statistical language models

Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit

tools are used to process general textual data into:

  • word frequency lists and vocabularies
  • word bigram and trigram counts
  • vocabulary-specific word bigram and trigram counts
  • bigram- and trigram-related statistics
  • various Backoff bigram and trigram language models

Dan Melamed’s NLP tools

An assortment of tools, including XTAG morpholyzer post-processors for English Stemming, 170 general text processing tools (mostly in PERL5) and 75 text statistics tools (mostly in PERL5)

Downloads at CSTR

Centre for Speech Technology Research, University of Edinburgh


mainly XML/SGML and computational resources, including LT POS, a Part-of-Speech (POS) tagger. See also the Edinburgh XML Workshop

EngCG Parser

Constraint Grammar Parser of English, performs morphosyntactic analysis (tagging) of running English text. The parser employs a morphological ("part-of-speech") disambiguator that makes 93-97% of all running-text words in Written Standard English unambiguous while 99.7% of all words retain the correct analysis. The corresponding figures for the shallow syntactic parser are 75-85% and 97-98%. Available from Lingsoft, Inc.

Euralex 2000 Tutorial

Lots of useful links.

(TALP Research Center, Universitat Politècnica de Catalunya).

an open-source C++ library providing language analysis services (such as morphological analysis, date recognition,POS tagging, etc.).Provides tokenizing, sentence splitting, morphological analysis, NE detection, date/number/currency recognition and PoS tagging. Future versions will improve performance in existing functionalities, as well as incorporate new features, such as chuncking, NE classification, document classification, etc.

Functional Grammar Workbench (by Juan C. Ruiz-Antón)

language generation/grammar-writing software which allows the user to write grammars for different languages, using rules of the type devised in Simon Dik’s Functional Grammar (expression rules, morphological templates and morphological rules), and test these grammars on predicate-argument formulas introduced by the user. The abstract semantic structure of a sentence is represented through a logico-semantic predication, and the surface form for a particular language is then produced by this semantic representation by applying a set of rules (expression rules, placement rules and morphological rules) to that underlying predication. [Windows program] Not, strictly speaking, directly relevant to corpus-based linguistics, but something nice to have anyway.

Infomap NLP Semantic Learning Software

uses a variant of Latent Semantic Analysis (LSA) on free-text corpora to learn vectors representing the meanings of words in a vector-space known as WordSpace. It indexes the documents in the corpora it processes, and can perform information retrieval and word-word semantic similarity computations using the resulting model. Performs two basic functions: building models by learning them from a free-text corpus using certain learning parameters specified by the user, and searching an existing model to find the words or documents that best match a query according to that model. After a model has built, it can also be installed to make searching it more convenient, and to allow other users to search it conveniently.


a Windows program which can be used to explore the unsupervised learning of natural language, with primary focus on morphology. Given an input corpus, it figures out where the morpheme breaks are in the words, and what are the stems, what are the suffixes, and so forth, based on no knowledge whatsoever of the language from which the words are drawn.

Link Grammar Parser

a free syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. Works on a variety of platforms, including Windows.


a syntactic chunk parser from the Language Technology Group at Edinburgh

GATE (General Architecture for Text Engineering)

is an architecture, framework and development environment for language engineering which can be also used to annotate texts. GATE is a domain-specific software architecure and development environment (SDK) that supports researchers in Natural Language Processing and Computational Linguistics and developers who are producing and delivering Language Engineering systems. It supports the full lifecycle of language processing components, from corpus collection and annotation through system evaluation.


a free broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. On a Pentium II 300 with 128MB memory, it parses about 300 words per second.

Morphix-NLP (CD-ROM)

a Live CD Linux distribution with a rich collection of Natural Language Processing (NLP) applications, all on a single CD. Includes: Tokenizers (Qtoken, MXTERMINATOR, Chinese word segmenters); POS Taggers (Brill’s TBL Tagger, MXPOST, fnTBL tagger, QTag, Tree-Tagger, Memory-based Tagger); Parsers (Collins' Parser, Link Parser, LoPar); Language Modeling Tools (CMU SLM toolkit, Trigger Toolkit, Ngram Statistics Package); Speech Software (Festival Speech Synthesis); System Development Tools (SVM-light, Maxent, SNoW, TiMBL, fnTBL); Other software (WordNet Browser 2.0, Word Concordance program (antconc), unaccent, and others.

Multext tools

Multext is developing a series of tools for accessing and manipulating corpora, including corpora encoded in SGML, and for accomplishing a series of corpus annotation tasks, including token and sentence boundary recognition, morphosyntactic tagging, parallel text alignment, and prosody markup. Annotation results may also be generated in SGML format. Upon completion, all tools will be publicly available for non-commercial, non-military use.

Natural Language Software Registry

gives a concise summary of the capabilities and sources of a large amount of natural language processing (NLP) software available to the NLP community.


a fast and accurate system for extracting noun phrases from English texts e.g. for the purposes of information retrieval, translation unit discovery and corpus studies

Paai’s Text Utilities

collection of programs and Unix-scripts for doing things with text files (lists, bigrams, various statistical measures to do with Information Retrieval.)


free Java-based tokeniser ("a piece of software that splits a text into its component elements (tokens). These are typically individual words, but also punctuation marks and other symbols which are not normally considered to be words.... This is usually done by inserting separators, either blank spaces or linebreaks, so that subsequent programs (like a parts-of-speech tagger) can easily read in the tokens and process them further")

RASP (Robust Accurate Statistical Parsing)

Part-of-speech tagging and parsing; XML input and output; free for non-commercial use.


converts files between character sets and usages. It recognises or produces more than 300 different character sets and transliterates files between almost any pair. When exact transliteration are not possible, it gets rid of offending characters or falls back on approximations.


Full query and programming language for SGML documents. Command-line tools, no GUI available.
Source code available (tested on Sparc/Solaris and i386/Linux).


sgrep (structured grep) is a tool for searching and indexing text, SGML,XML and HTML files and filtering text streams using structural criteria

(SHALlow seMANtic

a system for automatic sense assignment and semantic role labeling; comes with pre-trained FrameNet classifiers for English and German. Word sense disambiguation for predicates, plus semantic role labeling - Input: plain text. Syntactic processing integrated. Classifiers available: trained on FrameNet data for English and German (System also applicable to other frameworks) - System output can be viewed graphically in the SALTO viewer. System realized as a toolchain of independent modules communicating through a common XML format -- hence extensible by further modules. Interfaces for addition/exchange of parsers, learners, features.

SHARES (System of Hypermatrix Analysis, Retrieval, Evaluation and Summarisation)

an intertextual mechanism for the identification and ranking of documents in terms of their relatedness to one or more exemplar texts. The SHARES approach is novel in taking the degree of Lexical Cohesion (Hoey, 1991) between texts as the primary criterion for document similarity. A hypermatrix structure has been created, which identifies links between repeated words, and bonds between two closely linked sentences, in two texts. According to our hypothesis, links and bonds will be strong between texts which are similar in content, and weak or non-existent between dissimilar texts.

SIGLEX (Special Interest Group on the Lexicon)

an umbrella for a variety of research interests ranging from lexicography and the use of online dictionaries to computational lexical semantics. Part of ACL. Lexical resources here or here

SPARSE (Student PARSing Environment) by Michael Covington

intended audience is syntax or NLP students unfamiliar with Prolog (the language in which SPARSE II is written)

Speech at CMU Web Page

extensive speech-technology-related links and technical stuff SIGLEX (Special Interest Group on the Lexicon)

SRI Language Modeling Toolkit

a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation

Statistical Language Modeling Toolkit

a suite of UNIX software tools to facilitate the construction and testing of statistical language models.

Statistical natural language processing and corpus-based computational linguistics

Chris Manning’s annotated list of resources; good collection of bookmarks for tools, corpora, etc.

Software for Systemic-Functional Linguistics.

set of tools including sentence generators, based on Systemic-Functional Grammar..

TASX-environment (Time Aligned Signal data eXchange)

a set of tools forming an XML-based environment which enables scientists to set up multimodal corpora. The technical basis of TASX is an XML-based annotation format. TASX is based on the concept that a re-implementation of functionalities already available in other speech processing software is not necessary. Established speech software such as Praat or ESPS/waves+ do not need to be duplicated. The TASX-environment therefore focuses only on the development of transcoding filters from and into various formats. These include: Praat/freq, Praat/label, ESPS/waves+, ESPS/F0-analysis, Transcriber, annotation graphs stored in XML, SyncWriter and basic text formats. In addition, filters for data import and export of the Exmaralda system are available. Most of these components are implemented in Java, transformations are defined in XSL-T and a smaller number of additional tools is written in Perl (mainly to transform non-XML data).

Ted Pedersen’s software page

includes N-gram Statistics Package (NSP), perl scripts for identifying and statistically testing/ranking n-grams (recurring phrases/collocations) in texts. Plus Senseval and WordNet-related packages.

TextTiling by Marti Hearst
(Java implementation by Freddy Choi is here; Perl version by David James is here)

"TextTiling is a technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics; a method for partitioning full-length text documents into coherent multi-paragraph units that correspond to a sequence of subtopical passages. The algorithm assumes that a set of words is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well. The approach uses quantitative lexical analyses to determine the segmentation of the documents. The tiles have been found to correspond well to human judgements of the major subtopic boundaries of science magazine articles."

TigerSearch (Treebank search tool)

specialized search engine for syntactically annotated corpora (treebanks), developed for the Tiger Project (German treebank), but in theory can be used on other treebanks. Query language very similar to that for the IMS CorpusWorkBench/Xkwic/CQP. (Windows, Linux, Solaris and Mac OS X)

Unicode web site

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode enables a single software product or a single web site to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.


a mobile translation system for the translation of spontaneous speech (German, English, and Japanese)

XML Corpus Encoding Standard (XCES)

schemas for XML-encoded corpora, covering annotations, aligned data, etc.


tool for XML application developers, schema designers, and XSL style sheet creators; XML Schema driven document and content editing for both developers and end-users. See also XCES (XML Corpus Encoding Standard) for XML-encoded corpora (above).

* Want free XML tools (editors, parsers, browsers, etc.)? Try Free XML tools, maintained by Lars Marius Garshol