12  Regular expressions

Authors
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Thomas Brunner

Catholic University of Eichstätt-Ingolstadt

12.1 Preparation

Script

You can find the full R script associated with this unit here.

12.2 Suggested reading

Lange and Leuckert (2020): Chapter 3.7

Detailed cheatsheet (DataCamp)

12.3 Regular expressions

Regular expressions (or ‘regex’) help us find more complex patterns in strings of text. Suppose we are interested in finding all inflectional forms of the lemma PROVIDE in a corpus, i.e., provide, provides, providing and provided. Insteading of searching for all forms individually, we can construct a regular expression of the form

\[ \text{provid(e(s)? | ing | ed)} \] which can be read as

‘Match the sequence of letters <provid> as well as when it is followed by the groups of letters <es> or <ing> or <ed>. Also make the <s> in <es> optional.’

Notice how optionality is signified by the ? operator and alternatives by |.

To activate regular expression in a kwic() query, the valuetype argument has to be set to "regex":

# Load library and corpus
library(quanteda)
ICE_GB <- readRDS("ICE_GB.RDS")

# Perform query
kwic_provide <- kwic(ICE_GB,
                     "provid(e(s)?|ing|ed)",
                     valuetype = "regex",
                     window = 20)

The number of hits has more than doubled. However, upon closer inspection, we’ll notice a few false positives, namely providential, provider and providers:

table(kwic_provide$keyword)

     provide     provided     Provided    Provident providential     provider 
         165          118            5            1            1            1 
   providers     provides    providing    Providing 
           3           72           52            1 

There are two ways to handle the output:

  1. Refine the search expression further to only match those cases of interest.
  2. Manually sort out irrelevant cases during qualitative annotation in a spreadsheet software.

As a rule of thumb, you should consider improving your search expression if you obtain hundreds or even thousands of false hits. Should there be only few false positives, it’s usually easier to simply mark them as “irrelevant” in your spreadsheet. A full workflow is demonstrated in unit 13. Data annotation.

12.4 A RegEx Cheatsheet

12.4.1 Basic functions

Command Definition Example Finds
python python
. Any character .ython aython, bython…

12.4.2 Character classes and alternatives

Command Definition Example Finds
[abc] Class of characters [jp]ython jython, python
[^pP] Excluded class of characters [^pP]ython everything but python, Python
(...|...) Alternatives linked by logical operator or P(ython|eter) Python, Peter

12.4.3 Quantifiers

Command Definition Example Finds
? One or zero instances of the preceding symbol Py?thon Python, Pthon
* No matter how many times — also zero Py*thon Python, Pthon, Pyyyython…
P[Yy]*thon Python, Pthon, PyYYython…
+ No matter how many times but at least once Py+thon Python, Pyyython, Pyyyython
{1,3} {min, max} Py{1,3}thon Python, Pyython, Pyyython

12.4.4 Pre-defined character classes

Note

The double backslashes (\\) shown here are specific to the quanteda R package. In most other programming languages including Python, you only need a single backslash (e.g., \w, \d, \s). This double-escaping is an R-specific requirement due to how R handles string literals.

Command Definition Example Finds
\\w All alphanumeric characters (A-Z, a-z, 0-9) \\w+ing walking, running, 42ing
\\W All non-alphanumeric characters hello\\W+world hello world, hello!!!world
\\d All decimal numbers (0-9) \\d{3}-\\d{4} 555-1234, 867-5309
\\D Everything which is not a decimal number \\D+ Hello!, Python_code
\\s Empty space word\\s+word word word
\\b Word boundary \\bpython\\b Matches python as a whole word

12.5 Querying parsed corpora

The range of linguistic patterns to be matched can be extended further if the corpus contains additional metadata, such as the part of speech (POS) of a token. POS-tagged corpora open up the option of looking for more abstract patterns, such as all instances of the verb eat that are followed by a pronoun or noun:

# Load library and corpus
ICE_GB_POS <- readRDS("ICE_GB_POS.RDS")

# Perform query
kwic_provide_POS2 <- kwic(ICE_GB_POS,
                     phrase("\\b(ate|eat(s|ing|en)?)_VERB\\b _(PRON|NOUN)"),
                     valuetype = "regex",
                     window = 5)

head(kwic_provide_POS2)
docname from to pre keyword post pattern
S1A-009.txt 1198 1199 I_PRON must_AUX < , > eat_VERB them_PRON < ICE-GB:S1A-009 #71 : 1 \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN)
S1A-010.txt 958 959 to < , > actually_ADV eat_VERB it_PRON for_ADP one_NOUN ' s_PART own_ADJ \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN)
S1A-011.txt 3245 3246 : A > I_PRON have_AUX eaten_VERB my_PRON way_NOUN round_ADP the_DET Yorkshire_PROPN Dales_PROPN \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN)
S1A-011.txt 4159 4160 I_PRON ended_VERB up_ADP uhm_NOUN just_ADV eating_VERB sort_NOUN of_ADP_ADP lumps_NOUN of chicken_NOUN and_CCONJ \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN)
S1A-018.txt 455 456 order_VERB on_ADPe_NUM first_ADJ and_CCONJ_CCONJ then_ADV_ADV eat_VERB it_PRON and then sort_ADV of_ADV carry_VERB \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN)
S1A-019.txt 1038 1039 A > and_CCONJ everybody_PRON was_AUX eating_VERB something_PRON < ICE-GB:S1A-019 #76 : 1 \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN)

One R package that supplies functions for tokenisation, POS-tagging and even dependency parsing for dozens of languages is udpipe. They all rely on one common set of tags known as Universal Dependencies, which are listed here:

Universial dependencies – Tagset

Cf. https://universaldependencies.org/u/pos/.

POS Tag Description
ADJ Adjective: describes a noun (e.g., big, old, green, first)
ADP Adposition: prepositions and postpositions (e.g., in, to, over)
ADV Adverb: modifies verbs, adjectives, or other adverbs (e.g., quickly, very)
AUX Auxiliary: helps form verb tenses, moods, or voices (e.g., is, have, will)
CCONJ Coordinating conjunction: links words, phrases, or clauses (e.g., and, or, but)
DET Determiner: introduces nouns (e.g., the, a, some, my)
INTJ Interjection: expresses emotion or reaction (e.g., oh, wow, hello)
NOUN Noun: person, place, thing, or concept (e.g., cat, city, idea)
NUM Numeral: expresses a number or ranking (e.g., one, two, second)
PART Particle: adds meaning without being an independent word class (e.g., not, to as in to run)
PRON Pronoun: replaces nouns (e.g., he, she, they, it)
PROPN Proper noun: names specific entities (e.g., London, Vladimir)
PUNCT Punctuation: marks boundaries in text (. , ! ?)
SCONJ Subordinating conjunction: links clauses, often indicating dependency (e.g., if, because, although)
SYM Symbol: non-alphanumeric symbol (e.g., %, &, #)
VERB Verb: action or state (e.g., run, be, have)
X Other: used when a word doesn’t fit into other categories

12.6 Working with other platforms: CQP

Some corpus platforms such as BNCweb, CQPweb or our local KU corpora1 rely on a special syntax known as CQP (Corpus Query Processor). CQP syntax is very powerful in that enables users to query for strings with specific meta-attributes (text category, age, gender etc.). For more details, see the CQP documentation.

  • 1 Note that the corpus platform of the Catholic University of Eichstätt-Ingolstadt, which was set up by Dr. Thomas Brunner, is only accessible to students or staff via the local eduroam network or a VPN client which holds their credentials.

  • CQP query of PROVIDE in the GloWbE corpus

    12.7 Exercises

    Solutions

    You can find the solutions to the exercises here.

    Exercise 12.1 How could you refine the search expression for PROVIDE "provid(e(s)?|ing|ed)" to get rid of the irrelevant cases?

    Exercise 12.2 Write elegant regular expressions which find all inflectional forms of the following verbs:

    • accept

    • attach

    • swim

    • know

    • forget

    Exercise 12.3 Find all nouns ending in -er.

    Exercise 12.4 Find all four-digit numbers.

    Exercise 12.5 Find all verbs that are followed by a preposition.