# Load library and corpus
library(quanteda)
ICE_GB <- readRDS("ICE_GB.RDS")3.2 Regular expressions
Preparation
You can find the full R script associated with this unit here.
Recommended reading
Lange and Leuckert (2020): Chapter 3.7
Detailed cheatsheet (DataCamp)
Regular expressions
Regular expressions (or ‘regex’) enable us to identify arbitrarily complex patterns in strings of text. Suppose we are interested in finding all inflectional forms of the lemma PROVIDE in a corpus, i.e., provide, provides, providing and provided. Insteading of searching for all forms individually, we can construct a regular expression of the form
\[ \text{provid(e(s)? | ing | ed)} \] which can be read as
‘Match the sequence of letters <provid> as well as when it is followed by the groups of letters <es> or <ing> or <ed>. Also make the <s> in <es> optional.’
Notice how optionality is signified by the ? operator and alternatives by |.
To activate regular expression in a kwic() query, first ensure you have loaded quanteda:
Now, the valuetype argument must be set to "regex":
# Perform query
kwic_provide <- kwic(ICE_GB,
"provid(e(s)?|ing|ed)",
valuetype = "regex",
window = 20)The number of hits has more than doubled. However, upon closer inspection, we’ll notice a few false positives, namely providential, provider and providers:
table(kwic_provide$keyword)
provide provided Provided Provident providential provider
165 118 5 1 1 1
providers provides providing Providing
3 72 52 1
There are two ways to handle the output:
- Refine the search expression further to only match those cases of interest.
- Manually sort out irrelevant cases during qualitative annotation in a spreadsheet software.
As a rule of thumb, you should consider improving your search expression if you obtain hundreds or even thousands of false hits. Should there be only few false positives, it’s usually easier to simply mark them as “irrelevant” in your spreadsheet. A full workflow is demonstrated in this unit on data annotation.
A RegEx Cheatsheet
Basic functions
| Command | Definition | Example | Finds |
|---|---|---|---|
python |
python | ||
. |
Any character | .ython |
aython, bython… |
Character classes and alternatives
| Command | Definition | Example | Finds |
|---|---|---|---|
[abc] |
Class of characters | [jp]ython |
jython, python |
[^pP] |
Excluded class of characters | [^pP]ython |
everything but python, Python |
(...|...) |
Alternatives linked by logical operator or |
P(ython|eter) |
Python, Peter |
Quantifiers
| Command | Definition | Example | Finds |
|---|---|---|---|
? |
One or zero instances of the preceding symbol | Py?thon |
Python, Pthon |
* |
No matter how many times — also zero | Py*thon |
Python, Pthon, Pyyyython… |
P[Yy]*thon |
Python, Pthon, PyYYython… | ||
+ |
No matter how many times but at least once | Py+thon |
Python, Pyyython, Pyyyython |
{1,3} |
{min, max} |
Py{1,3}thon |
Python, Pyython, Pyyython |
Pre-defined character classes
The double backslashes (\\) shown here are specific to the quanteda R package. In most other programming languages including Python, you only need a single backslash (e.g., \w, \d, \s). This double-escaping is an R-specific requirement due to how R handles string literals.
| Command | Definition | Example | Finds |
|---|---|---|---|
\\w |
All alphanumeric characters (A-Z, a-z, 0-9) | \\w+ing |
walking, running, 42ing |
\\W |
All non-alphanumeric characters | hello\\W+world |
hello world, hello!!!world |
\\d |
All decimal numbers (0-9) | \\d{3}-\\d{4} |
555-1234, 867-5309 |
\\D |
Everything which is not a decimal number | \\D+ |
Hello!, Python_code |
\\s |
Empty space | word\\s+word |
word word |
\\b |
Word boundary | \\bpython\\b |
Matches python as a whole word |
Querying parsed corpora
The range of linguistic patterns to be matched can be extended further if the corpus contains additional metadata, such as the part of speech (POS) of a token. POS-tagged corpora open up the option of looking for more abstract patterns, such as all instances of the verb eat that are followed by a pronoun or noun. Load the parsed corpus first:
# Load library and corpus
ICE_GB_POS <- readRDS("ICE_GB_POS.RDS")Proceed by providing a regular expression that captures all delimited inflectional forms of eat that are followed by any unit bearing the tag _PRON or _NOUN.
# Perform query
kwic_provide_POS2 <- kwic(ICE_GB_POS,
phrase("\\b(ate|eat(s|ing|en)?)_VERB\\b _(PRON|NOUN)"),
valuetype = "regex",
window = 5)
head(kwic_provide_POS2)| docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
| S1A-009.txt | 1198 | 1199 | I_PRON must_AUX < , > | eat_VERB them_PRON | < ICE-GB:S1A-009 #71 : 1 | \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN) |
| S1A-010.txt | 958 | 959 | to < , > actually_ADV | eat_VERB it_PRON | for_ADP one_NOUN ' s_PART own_ADJ | \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN) |
| S1A-011.txt | 3245 | 3246 | : A > I_PRON have_AUX | eaten_VERB my_PRON | way_NOUN round_ADP the_DET Yorkshire_PROPN Dales_PROPN | \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN) |
| S1A-011.txt | 4159 | 4160 | I_PRON ended_VERB up_ADP uhm_NOUN just_ADV | eating_VERB sort_NOUN | of_ADP_ADP lumps_NOUN of chicken_NOUN and_CCONJ | \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN) |
| S1A-018.txt | 455 | 456 | order_VERB on_ADPe_NUM first_ADJ and_CCONJ_CCONJ then_ADV_ADV | eat_VERB it_PRON | and then sort_ADV of_ADV carry_VERB | \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN) |
| S1A-019.txt | 1038 | 1039 | A > and_CCONJ everybody_PRON was_AUX | eating_VERB something_PRON | < ICE-GB:S1A-019 #76 : 1 | \b(ate|eat(s|ing|en)?)_VERB\b _(PRON|NOUN) |
One R package that supplies functions for tokenisation, POS-tagging and even dependency parsing for dozens of languages is udpipe. They all rely on one common set of tags known as Universal Dependencies, which are listed here:
Cf. https://universaldependencies.org/u/pos/.
| POS Tag | Description |
|---|---|
| ADJ | Adjective: describes a noun (e.g., big, old, green, first) |
| ADP | Adposition: prepositions and postpositions (e.g., in, to, over) |
| ADV | Adverb: modifies verbs, adjectives, or other adverbs (e.g., quickly, very) |
| AUX | Auxiliary: helps form verb tenses, moods, or voices (e.g., is, have, will) |
| CCONJ | Coordinating conjunction: links words, phrases, or clauses (e.g., and, or, but) |
| DET | Determiner: introduces nouns (e.g., the, a, some, my) |
| INTJ | Interjection: expresses emotion or reaction (e.g., oh, wow, hello) |
| NOUN | Noun: person, place, thing, or concept (e.g., cat, city, idea) |
| NUM | Numeral: expresses a number or ranking (e.g., one, two, second) |
| PART | Particle: adds meaning without being an independent word class (e.g., not, to as in to run) |
| PRON | Pronoun: replaces nouns (e.g., he, she, they, it) |
| PROPN | Proper noun: names specific entities (e.g., London, John) |
| PUNCT | Punctuation: marks boundaries in text (. , ! ?) |
| SCONJ | Subordinating conjunction: links clauses, often indicating dependency (e.g., if, because, although) |
| SYM | Symbol: non-alphanumeric symbol (e.g., %, &, #) |
| VERB | Verb: action or state (e.g., run, be, have) |
| X | Other: used when a word doesn’t fit into other categories |
Exercises
You can find the solutions to the exercises here.
Exercise 1 How could you refine the search expression for PROVIDE "provid(e(s)?|ing|ed)" to get rid of the irrelevant cases?
Exercise 2 Write elegant regular expressions which find all inflectional forms of the following verbs:
accept
attach
swim
know
forget
Exercise 3 Find all nouns ending in -er.
Exercise 4 Find all four-digit numbers.
Exercise 5 Find all verbs that are followed by a preposition.