Format conversion Tools
|
a free text search-and-replace program that operates in batch mode across multiple files at once. Can do multiple search-replace operations per file; supports regular expressions; creates a log file and you can specify output location. Note: Replace Text is no longer supported and has known problems with some Windows 7 installations. |
|
|
Dave Raggett’s free tool for fixing HTML mistakes automatically and tidying up sloppy editing into nicely layed out markup (performs wonders on HTML saved from Microsoft Word). Also outputs/converts to XML and XHTML, and can be used to validate, correct, and pretty-print XML-files |
|
|
The osx program, part of the OpenSP package (a successor to James Clark’s sp package) can automatically convert SGML files to corresponding XML files. OpenSP is maintained along with OpenJade |
HTML code strippers
These tools can be used for removing HTML tags from a saved web page, to feed into concordancers.
|
HTML to ASCII text converter. "Unlike most others, however, this one not only has an easy to use graphical interface but it actually produces a nicely laid out text version, and keeps URLs visible. A minimum of post-conversion editing required." |
|
|
(Shareware) |
|
|
(Freeware. A 'Pro' version is also available for purchase.) |
|
|
a basic SGML/HTML tag stripper for Windows by William Fletcher. It removes everything between pairs of < >, so it can fail in those rare cases in which a > is embedded within a comment or an attribute. It also does not translate HTML entities (e.g. "é" --> é). |
Web Snaggers/Crawlers for corpus-building
These tools can be used for grabbing web pages/entire sites for offline reading/processing.
| Corpus Builder (Carnegie Mellon) | A system for automatically constructing corpora for a minority language from the web; requires Perl 5.0 or greater and Lynx, running on Unix. |
| An Crúbadán (Kevin P. Scannell) | Similar to Corpus Builder; aims at automatic development of large text corpora for minority languages. |
| ICEweb | A small & simple utility for compiling, downloading & analysing web corpora. The name was chosen because the original intention was to create corpora that are similar in nature to the International Corpus of English (ICE) data. |
| WGET for Windows (free) | Powerful, but command line-based tool for retrieving or mirroring whole websites. |
| HTTrack | Free GUI-based tool for retrieving or mirroring whole websites. |
| WebWhacker (educational, but not free, version) |
GUI-based tool for retrieving whole websites. |
| Webspiders – Tennyson Maxwell Information Systems, Inc. | Commercial product. |