Web as Corpus Resources

This page introduces you to various resources that allow you to query/concordance on ‘live’ pages on the Web rather than your own local texts. Warning – Quality control issues:

  1. Not everything on the Web is the kind of language you will want to learn/emulate (many native speakers of English write (& type) English rather badly);
  2. non-native speakers of English put up web pages too (albeit they may be better than those of some native speakers ;-));
  3. different varieties of language suit different genres & purposes, and most search engines are not genre-aware;
  4. Search engines such as Google give different results on different days, and have gaps, omissions & inclusions that are hard to explain (due to copyrighted, proprietary technology).

Therefore, use the tools listed below with caution. The best point about searching the Web? It’s one of the few places where many recently-coined words, jargon & slang can be found in print, Nowadays, many innovations in the language, particularly technical or computing-related terms, appear on-line first/only on-line.

KWiCFinder & WebAsCorpus.org Web Concordancer
(by Bill Fletcher)

KWiCFinder (Key Word in Context Finder) is a free stand-alone Web search concordancer optimized for multilingual searches. It builds on Yahoo! search engine support for complex Boolean searches. Displays the search words in their textual contexts.

In contrast, Web Concordancer complements Google.com's popular search engine to simplify & accelerate the task of online research. Both programs automate the process of evaluating documents matching your search terms. Each has strengths & weaknesses which reflect characteristics of the search engines they rely on & the reporting technology they implement.

WebCONC
(by Matthias Hüning)

A tool for generating KWIK-concordances based on webpages (KWIC = Keyword in context). There are two options for defining your corpus: let Google search the relevant webpages for you or specify a set of URLs yourself

WebCorp

Concordances the Web. You enter a word or phrase, choose options from the menus provided & then press the `Submit' button. WebCorp works 'on top of' the search engine of your choice, taking the list of URLs returned by that search engine & extracting concordance lines from each of those pages. All of the concordance lines are presented on a single results page, with links to the sites from which they came. * Also does a frequency listing of words on a web page (from your chosen URL).

The Linguist’s Search Engine

Can be used to perform syntactic searches (done graphically via parse trees) on Internet data. Currently available are a three-million-sentence corpus of sentences from the Internet Archive as well as facilities to build & search corpora based around search results from AltaVista queries.

Spaceless.com’s Web Concordancer

Takes the text of a web page you specify & creates a list of sentences that contain the search term. Selecting various options can also produce a concordance of all the words that appear on the page in either alphabetical or frequency order.

GlossaNet

Retrieves words or sequences of words from a pre-selected pool of daily newspapers (French, English, Spanish, Italian, Portuguese). If any match occurs, a concordance is sent to the user by email (this is a list of the retrieved occurrences presented in their context (by default, 40 characters to the right & 40 characters to the left) in text or HTML format). You can set up GlossaNet so that concordances are sent to you on a weekly basis.

HighBeam Library Research

Search an archive of more than 35 million documents from over 3,000 sources – a vast collection of articles from leading publications, updated daily & going back as far as 20 years. Can restrict to: (1) Documents (from Newspapers, Magazines, Journals, Transcripts & Books), (2) Images & Maps , (3) Reference books (Encyclopedias, Dictionaries & Almanacs)

Grammar Safari

Tips on using the web as a corpus for lexical/grammatical (or lexicogrammatical) searches

World-Wide Web English Corpus (Leeds)

200,000-word web-text samples of National Englishes, compiled from English-language websites in each WWW national domain


Did you find this web site/page useful? Do let me know if you want to encourage me to keep updating the site, or if you have a new corpus or resource (or something I’ve missed) for me to link to, please drop me a line.