Converters & Code Strippers

Format conversion Tools

Replace Text (formerly called BK ReplaceEm)

a free text search-and-replace program that operates in batch mode across multiple files at once. Can do multiple search-replace operations per file; supports regular expressions; creates a log file and you can specify output location. Note: Replace Text is no longer supported and has known problems with some Windows 7 installations.

HTML TIDY

Dave Raggett’s free tool for fixing HTML mistakes automatically and tidying up sloppy editing into nicely layed out markup (performs wonders on HTML saved from Microsoft Word). Also outputs/converts to XML and XHTML, and can be used to validate, correct, and pretty-print XML-files

OpenJade/OpenSP

The osx program, part of the OpenSP package (a successor to James Clark’s sp package) can automatically convert SGML files to corresponding XML files. OpenSP is maintained along with OpenJade


HTML code strippers

These tools can be used for removing HTML tags from a saved web page, to feed into concordancers.

Web2Text

HTML to ASCII text converter. "Unlike most others, however, this one not only has an easy to use graphical interface but it actually produces a nicely laid out text version, and keeps URLs visible. A minimum of post-conversion editing required."

HTMASC

(Shareware)

NOTETAB LIGHT

(Freeware. A 'Pro' version is also available for purchase.)

StripTags

a basic SGML/HTML tag stripper for Windows by William Fletcher. It removes everything between pairs of < >, so it can fail in those rare cases in which a > is embedded within a comment or an attribute. It also does not translate HTML entities (e.g. "&eacute;" --> é).

Web Snaggers/Crawlers for corpus-building

These tools can be used for grabbing web pages/entire sites for offline reading/processing.

Corpus Builder (Carnegie Mellon) A system for automatically constructing corpora for a minority language from the web; requires Perl 5.0 or greater and Lynx, running on Unix.
An Crúbadán (Kevin P. Scannell) Similar to Corpus Builder; aims at automatic development of large text corpora for minority languages.
ICEweb A small & simple utility for compiling, downloading & analysing web corpora. The name was chosen because the original intention was to create corpora that are similar in nature to the International Corpus of English (ICE) data.
WGET for Windows (free) Powerful, but command line-based tool for retrieving or mirroring whole websites.
HTTrack Free GUI-based tool for retrieving or mirroring whole websites.
WebWhacker
(educational, but not free, version)
GUI-based tool for retrieving whole websites.
Webspiders – Tennyson Maxwell Information Systems, Inc. Commercial product.