Regular Expressions

Below, you can see a short test passage that will be used for displaying the results of all the basic exercises that follow. The exercise sections below them are scrollable, so that you should always be able to see the results of any exercises immediately. If nothing at all gets highlighted in the search box once you have specified what you think is a valid a pattern, then there’s probably a syntax error in your pattern.

This is a short test paragraph. It will allow us to explore and test different regular expression features and concepts, such as character classes, quantification, and grouping/alternation, by displaying them in a separate colour. For good measure, and to be ‘well-rounded’, this also contains some numbers, punctuation, and special characters here (1,2, 3, 10, ! ;), as well as some unusual words like ‘theme’, ‘rheme’, ‘phatic’, ‘thistle’, ‘chisel’, or ‘phishing’, and a bit of fəˈnətɪkˌtɹənˈskɹɪpʃn̩. Can you understand all the results, based on your own intuitions concerning different types of characters and how they make up words?

Character Classes

Examples for testing:

[a-z] (all English lowercase letters)
[A-Z] (all English uppercase letters)
[0-9] (all digits, also often abbreviated \d)
[aeiouy] (all lowercase vowel letters for English)
[Tt] (either <T> or <t>)
[A-E0-3 ] (all uppercase letters between A and E, all digits between 0 and 3, and a space)

Test the character classes discussed in the book (listed above again for convenience) on the sample paragraph and observe the effects. Feel free to make your own changes to the classes, too.
Character class:

Predefined shorthands/abbreviations:

\w for all word characters, usually including hyphens
\W for all non-word characters, such as e.g. punctuation
\s for (white)space characters (but sometimes a whitespace itself only)
a single . usually stands for any arbitrary character, unless it occurs inside a character class, in which case it simply means a dot

Test the character class shorthands shown above on the sample paragraph and observe the effects.
Shorthand:

Negative Character Classes

Try to think of some negative character classes and test them on the sample paragraph. If you still have difficulties thinking of any sensible ones yourself, just negate the positive character classes we saw above and try to understand what is happening.
Negative class:

Quantification

The basic options for quantification are:

a * following a character (class)/group means it may occur from 0 to an unlimited number of times
a ? following a character (class)/group means it may be optional or can occur at most once
a + following a character (class)/group means it has to occur at least once but up to an unlimited number of times
a curly bracket {} following a character (class)/group specifies a more exact quantification
- {5} matches exactly five times
- {5, } matches at least 5 times or up to an unlimited number of times
- {5,10} matches between 5 and 10 times

At least in this way, we can already specify that we may want to look for something like \s\w+\s, i.e. only words – although, of course, we need to bear in mind that not all words are actually delimited by two whitespaces –, or allow us to cater for dialectal differences such as the British or American versions of the word colou?r.

Test the two quantifcation examples shown above. For the example of the whitespace-bounded words, also experiment with the curly-bracket type to practise more exact quantification. Can you already detect any practical use in this?
Quantification:

Due to the implementation of regexes in JavaScript, which I used in order to achieve the highlighting of the examples, when searching for whitespace-delimited words, the (leading) whitespace before the word will not be highlighted along with the word form and the trailing space.

Anchoring, Grouping & Alternation

Anchoring

Types of anchoring:

word boundaries: indicated via \b
string/line boundaries: ^ for beginning, $ for end

First, try the example of and from above, with and without boundary markers, then, using appropriate quantification, look for words of different length.
Words with boundaries:

Grouping & Alternation

Grouping is achieved by enclosing items in round brackets, e.g. (\sapples\s)* to look for the (whitespace-bounded) word apples, and specify that the word may occur zero or any number of times in a row

Specifying alternatives can be done by separating grouped items via the pipe symbol: (apples|bananas|grapes|pears).

Try something similar by specifying a regex that will find all the verbs in the sample text above, then do the same thing for all nouns. Are there any unexpected results or difficulties in specifying the patterns?
Alternation: