Regular expressions (or ‘regex’) help us find more complex patterns in strings of text. Suppose we are interested in finding all inflectional forms of the lemma PROVIDE in a corpus, i.e., provide, provides, providing and provided. Insteading of searching for all forms individually, we can construct a regular expression of the form
\[
\text{provid(e(s)? | ing | ed)}
\] which can be read as
‘Match the sequence of letters <provid> as well as when it is followed by the groups of letters <es> or <ing> or <ed>. Also make the <s> in <es> optional.’
Notice how optionality is signified by the ? operator and alternatives by |.
To activate regular expression in a kwic() query, the valuetype argument has to be set to "regex":
The number of hits has more than doubled. However, upon closer inspection, we’ll notice a few false positives, namely providential, provider and providers:
Refine the search expression further to only match those cases of interest.
Manually sort out irrelevant cases during qualitative annotation in a spreadsheet software.
As a rule of thumb, you should consider improving your search expression if you obtain hundreds or even thousands of false hits. Should there be only few false positives, it’s usually easier to simply mark them as “irrelevant” in your spreadsheet. A full workflow is demonstrated in unit 13. Data annotation.
A RegEx Cheatsheet
Basic functions
Command
Definition
Example
Finds
python
python
.
Any character
.ython
aython, bython…
Character classes and alternatives
Command
Definition
Example
Finds
[abc]
Class of characters
[jp]ython
jython, python
[^pP]
Excluded class of characters
[^pP]ython
everything but python, Python
(...|...)
Alternatives linked by logical operator or
P(ython|eter)
Python, Peter
Quantifiers
Command
Definition
Example
Finds
?
One or zero instances of the preceding symbol
Py?thon
Python, Pthon
*
No matter how many times — also zero
Py*thon
Python, Pthon, Pyyyython…
P[Yy]*thon
Python, Pthon, PyYYython…
+
No matter how many times but at least once
Py+thon
Python, Pyyython, Pyyyython
{1,3}
{min, max}
Py{1,3}thon
Python, Pyython, Pyyython
Pre-defined character classes
Note
The double backslashes (\\) shown here are specific to the quanteda R package. In most other programming languages including Python, you only need a single backslash (e.g., \w, \d, \s). This double-escaping is an R-specific requirement due to how R handles string literals.
Command
Definition
Example
Finds
\\w
All alphanumeric characters (A-Z, a-z, 0-9)
\\w+ing
walking, running, 42ing
\\W
All non-alphanumeric characters
hello\\W+world
hello world, hello!!!world
\\d
All decimal numbers (0-9)
\\d{3}-\\d{4}
555-1234, 867-5309
\\D
Everything which is not a decimal number
\\D+
Hello!, Python_code
\\s
Empty space
word\\s+word
word word
\\b
Word boundary
\\bpython\\b
Matches python as a whole word
Querying parsed corpora
The range of linguistic patterns to be matched can be extended further if the corpus contains additional metadata, such as the part of speech (POS) of a token. POS-tagged corpora open up the option of looking for more abstract patterns, such as all instances of the verb eat that are followed by a pronoun or noun:
One R package that supplies functions for tokenisation, POS-tagging and even dependency parsing for dozens of languages is udpipe. They all rely on one common set of tags known as Universal Dependencies, which are listed here:
Other: used when a word doesn’t fit into other categories
Working with other corpora
Some corpus platforms such as BNCweb, CQPweb or our local KU corpora1 support a specialised query syntax known as CQP (Corpus Query Processor), which enables users to query for strings with specific meta-attributes (text category, age, gender etc.).
1 Note that the corpus platform of the Catholic University of Eichstätt-Ingolstadt, which was set up by Dr. Thomas Brunner, is only accessible to students or staff via the local eduroam network or a VPN client which holds their credentials.
Basic use
Once signed in on any of these platforms, the user interface will generally follow this layout:
CQP query of PROVIDE in the GloWbE corpus
To select a different corpus, navigate to CQPweb main menu under About CQPweb in the left menu bar.
To restrict search results to specific parts of the corpus, select Restricted query and choose the relevant text categories, varieties, etc.
You can now enter a search expression into the white box using the following pattern:
[attribute ="property"]
For example, to retrieve all inflectional forms2 of the verb provide, enter:
2 Some corpora do not support the lemma attribute and require manual listing of all inflectional forms via regular expressions, such as [word = "provid(e(s)?|d|ing)"] for provide.
[lemma ="provide"]
To search for a word with a specific part-of-speech (POS) tag – such as like used as a preposition – use:
If the exact word string is not important and you are only interested in part-of-speech sequences, you can supply the regular expression .+ to the word attribute, which means ‘match everything’. For example, to find all common nouns (NN) that are followed by a general preposition io, we can write
[word =".+"& pos ="nn"][word =".+"& pos ="io"]
Regular expressions also allow you to match multiple POS-tags, such as all noun categories (nn, nn1, nn2, nna, etc.) or all prepositions (if, ii, io). The search term below generalises the previous example, making it more inclusive:
[word =".+"& pos ="n.+"][word =".+"& pos ="i.+"]
However, be cautious with overly broad search patterns, as they might return (nearly) the entire corpus! The last one yields 22,294,282 matches.
Exporting and importing your results
To export your KWIC-hits, select Download in the top-right corner and specify your output options (e.g., the size of the search window or the speaker metadata). While you can immediately read the downloaded concordance.txt file into R, it’s advisable to first perform some manual clean-up in spreadsheet software (e.g., MS Excel). This ensures all columns and rows are complete and properly formatted.
To do this in Excel:
Navigate to File > Import > Text file.
Select your file and choose Delimited.
Click Next, select Tab as the delimiter, then click Next > Finish.
Make sure to save your file, ideally with the extension .xlsx.
From here, please refer to the unit Import/export data for further steps.