3.2 Regular expressions

Authors

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Thomas Brunner

Catholic University of Eichstätt-Ingolstadt

Abstract

This handout introduces regular expressions for advanced corpus queries in R and on corpus platforms, showing how to construct, refine, and apply search patterns to linguistic data.

Preparation

Script

You can find the full R script associated with this unit here.

Regular expressions

Regular expressions (or ‘regex’) enable us to identify arbitrarily complex patterns in strings of text. Suppose we are interested in finding all inflectional forms of the lemma PROVIDE in a corpus, i.e., provide, provides, providing and provided. Insteading of searching for all forms individually, we can construct a regular expression of the form

\[ \text{provid(e(s)? | ing | ed)} \] which can be read as

‘Match the sequence of letters <provid> as well as when it is followed by the groups of letters <es> or <ing> or <ed>. Also make the <s> in <es> optional.’

Notice how optionality is signified by the ? operator and alternatives by |.

To activate regular expression in a kwic() query, first ensure you have loaded quanteda:

# Load library and corpus
library(quanteda)
ICE_GB <- readRDS("ICE_GB.RDS")

Now, the valuetype argument must be set to "regex":

# Perform query
kwic_provide <- kwic(ICE_GB,
                     "provid(e(s)?|ing|ed)",
                     valuetype = "regex",
                     window = 20)

If we apply the nrow() function to kwic_provide, we will see that the data frame contains 419 observations. However, upon closer inspection, there appear to be quite a few false positives (providential, provider and providers):

table(kwic_provide$keyword)


     provide     provided     Provided    Provident providential     provider 
         165          118            5            1            1            1 
   providers     provides    providing    Providing 
           3           72           52            1

There are two ways to handle this kind of output:

Refine the search expression further until you only match relevant instances.
Manually sort out irrelevant cases during qualitative annotation in a spreadsheet software.

As a rule of thumb, you should consider improving your search expression if you obtain hundreds or even thousands of false hits. Should there be only few false positives, it’s usually easier to simply mark them as “irrelevant” in your spreadsheet. A full workflow is demonstrated in this unit on data annotation.

Understanding true and false positives

When evaluating the accuracy of a regular expression (or a machine learning model such as binary/softmax classifier), we can use a confusion matrix to categorise the results:

	Actual Positive (should match)	Actual Negative (should not match)
Predicted Positive (pattern matches)	True Positive (TP) ✓	False Positive (FP) ✗
Predicted Negative (pattern does not match)	False Negative (FN) ✗	True Negative (TN) ✓

Using the current pattern provid(e(s)?|ing|ed):

True Positives (TP): provide, provides, providing, provided (correctly matched relevant word forms)
False Positives (FP): providential, provider, providers (incorrectly matched irrelevant words)
False Negatives (FN): None in this case (we probably didn’t miss any relevant cases)
True Negatives (TN): proximity, product etc. (correctly excluded irrelevant words)

Two very common accuracy metrics are precision and recall. Precision answers the question: ‘Of all matches, how many are correct?’.

\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \] By contrast, recall tells us: ‘Of all correct instances in the corpus, how many did we find?’.

\[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]

The choice between one or the other metrics depends on which cell in the confusion matrix is considered more costly when predicted/matched incorrectly.

High precision means a small proportion of false positives (e.g., a doctor who rarely misdiagnoses their patients; few false alarms).
Analogously, maximising recall corresponds to missing as few true positives as possible (e.g., ensuring that all true safety hazards on a construction site are detected, regardless of false alarms).

If both precision and recall are important, the F1 score offers a balanced measure.

A RegEx Cheatsheet

Basic functions

Command	Definition	Example	Finds
		`python`	python
`.`	Any character	`.ython`	aython, bython…

Character classes and alternatives

Command	Definition	Example	Finds
`[abc]`	Class of characters	`[jp]ython`	jython, python
`[^pP]`	Excluded class of characters	`[^pP]ython`	everything but python, Python
`(...\|...)`	Alternatives linked by logical operator `or`	`P(ython\|eter)`	Python, Peter

Quantifiers

Command	Definition	Example	Finds
`?`	One or zero instances of the preceding symbol	`Py?thon`	Python, Pthon
`*`	No matter how many times — also zero	`Py*thon`	Python, Pthon, Pyyyython…
		`P[Yy]*thon`	Python, Pthon, PyYYython…
`+`	No matter how many times but at least once	`Py+thon`	Python, Pyyython, Pyyyython
`{1,3}`	`{min, max}`	`Py{1,3}thon`	Python, Pyython, Pyyython

Pre-defined character classes

Note

The double backslashes (\\) shown here are specific to the quanteda R package. In most other programming languages including Python, you only need a single backslash (e.g., \w, \d, \s). This double-escaping is an R-specific requirement due to how R handles string literals.

Command	Definition	Example	Finds
`\\w`	All alphanumeric characters (A-Z, a-z, 0-9)	`\\w+ing`	walking, running, 42ing
`\\W`	All non-alphanumeric characters	`hello\\W+world`	hello world, hello!!!world
`\\d`	All decimal numbers (0-9)	`\\d{3}-\\d{4}`	555-1234, 867-5309
`\\D`	Everything which is not a decimal number	`\\D+`	Hello!, Python_code
`\\s`	Empty space	`word\\s+word`	word word
`\\b`	Word boundary	`\\bpython\\b`	Matches python as a whole word

Querying parsed corpora

The range of linguistic patterns to be matched can be extended further if the corpus contains additional metadata, such as the part of speech (POS) of a token. POS-tagged corpora open up the option of looking for more abstract patterns, such as all instances of the verb eat that are followed by a pronoun or noun. Load the parsed corpus first:

# Load library and corpus
ICE_GB_POS <- readRDS("ICE_GB_POS.RDS")

Proceed by providing a regular expression that captures all delimited inflectional forms of eat that are followed by any unit bearing the tag _PRON or _NOUN.

# Perform query
kwic_provide_POS2 <- kwic(ICE_GB_POS,
                     phrase("\\b(ate|eat(s|ing|en)?)_VERB\\b _(PRON|NOUN)"),
                     valuetype = "regex",
                     window = 5)

head(kwic_provide_POS2)

docname	from	to	pre	keyword	post	pattern
S1A-009.txt	1198	1199	I_PRON must_AUX < , >	eat_VERB them_PRON	< ICE-GB:S1A-009 #71 : 1	\b(ate\|eat(s\|ing\|en)?)_VERB\b _(PRON\|NOUN)
S1A-010.txt	958	959	to < , > actually_ADV	eat_VERB it_PRON	for_ADP one_NOUN ' s_PART own_ADJ	\b(ate\|eat(s\|ing\|en)?)_VERB\b _(PRON\|NOUN)
S1A-011.txt	3245	3246	: A > I_PRON have_AUX	eaten_VERB my_PRON	way_NOUN round_ADP the_DET Yorkshire_PROPN Dales_PROPN	\b(ate\|eat(s\|ing\|en)?)_VERB\b _(PRON\|NOUN)
S1A-011.txt	4159	4160	I_PRON ended_VERB up_ADP uhm_NOUN just_ADV	eating_VERB sort_NOUN	of_ADP_ADP lumps_NOUN of chicken_NOUN and_CCONJ	\b(ate\|eat(s\|ing\|en)?)_VERB\b _(PRON\|NOUN)
S1A-018.txt	455	456	order_VERB on_ADPe_NUM first_ADJ and_CCONJ_CCONJ then_ADV_ADV	eat_VERB it_PRON	and then sort_ADV of_ADV carry_VERB	\b(ate\|eat(s\|ing\|en)?)_VERB\b _(PRON\|NOUN)
S1A-019.txt	1038	1039	A > and_CCONJ everybody_PRON was_AUX	eating_VERB something_PRON	< ICE-GB:S1A-019 #76 : 1	\b(ate\|eat(s\|ing\|en)?)_VERB\b _(PRON\|NOUN)

One R package that supplies functions for tokenisation, POS-tagging and even dependency parsing for dozens of languages is udpipe. They all rely on one common set of tags known as Universal Dependencies, which are listed here:

Universial dependencies – Tagset

Cf. https://universaldependencies.org/u/pos/.

POS Tag	Description
ADJ	Adjective: describes a noun (e.g., big, old, green, first)
ADP	Adposition: prepositions and postpositions (e.g., in, to, over)
ADV	Adverb: modifies verbs, adjectives, or other adverbs (e.g., quickly, very)
AUX	Auxiliary: helps form verb tenses, moods, or voices (e.g., is, have, will)
CCONJ	Coordinating conjunction: links words, phrases, or clauses (e.g., and, or, but)
DET	Determiner: introduces nouns (e.g., the, a, some, my)
INTJ	Interjection: expresses emotion or reaction (e.g., oh, wow, hello)
NOUN	Noun: person, place, thing, or concept (e.g., cat, city, idea)
NUM	Numeral: expresses a number or ranking (e.g., one, two, second)
PART	Particle: adds meaning without being an independent word class (e.g., not, to as in to run)
PRON	Pronoun: replaces nouns (e.g., he, she, they, it)
PROPN	Proper noun: names specific entities (e.g., London, John)
PUNCT	Punctuation: marks boundaries in text (. , ! ?)
SCONJ	Subordinating conjunction: links clauses, often indicating dependency (e.g., if, because, although)
SYM	Symbol: non-alphanumeric symbol (e.g., %, &, #)
VERB	Verb: action or state (e.g., run, be, have)
X	Other: used when a word doesn’t fit into other categories

Exercises

Solutions

You can find the solutions to the exercises here.

Exercise 1 How could you refine the search expression for PROVIDE "provid(e(s)?|ing|ed)" to get rid of the irrelevant cases?

Exercise 2 Write elegant regular expressions which find all inflectional forms of the following verbs:

accept
attach
swim
know
forget

Exercise 3 Find all nouns ending in -er.

Exercise 4 Find all four-digit numbers.

Exercise 5 Find all verbs that are followed by a preposition.

References

Lange, Claudia, and Sven Leuckert. 2020. Corpus Linguistics for World Englishes: A Guide for Research. New York: Taylor; Francis.

Preparation

Recommended reading

Regular expressions

A RegEx Cheatsheet

Basic functions

Character classes and alternatives

Quantifiers

Pre-defined character classes

Querying parsed corpora

Exercises

References