library(quanteda) # Package for Natural Language Processing in R
library(lattice) # for dotplots
11 Concordancing
11.1 Suggested reading
In-depth introduction to concordancing with R:
Schweinberger (2024)
Naturale Language Processing (NLP) with quanteda
:
Benoit et al. (2018)
On corpus-linguistic theory:
Wulff and Baker (2020)
Lange and Leuckert (2020)
McEnery, Xiao, and Yukio (2006)
11.2 Preparation
You can find the full R script associated with this unit here.
In order for R to be able to recognise the data, it is crucial to set up the working directory correctly.
Make sure your R-script and the corpus (e.g., ‘ICE-GB’) are stored in the same folder on your computer.
In RStudio, go to the
Files
pane (usually in the bottom-right corner) and navigate to the location of your script. Alternatively, you can click on the three dots...
and use the file browser instead.Once you’re in the correct folder, click on the blue ⚙️ icon.
Select
Set As Working Directory
. This action will update your working directory to the folder where the file is located.
In addition, make sure you have installed quanteda
. Load it at the beginning of your script:
To load a corpus object into R, place it in your working directory and read it into your working environment with readRDS()
.1
1 The ICE-GB.RDS
file you’ve been provided with has been pre-processed and saved in this specific format for practical reasons.
# Load corpus from directory
<- readRDS("ICE_GB.RDS") ICE_GB
If you encounter any error messages at this stage, ensure you followed steps 1 and 2 in the callout box above.
11.3 Concordancing
A core task in corpus-linguistic research involves finding occurrences of a single word or multi-word sequence in the corpus. Lange & Leuckert (2020: 55) explain that specialised software typically “provide[s] the surrounding context as well as the name of the file in which the word could be identified.” Inspecting the context is particularly important in comparative research, as it may be indicative of distinct usage patterns.
11.3.1 Simple queries
To obtain such a keyword in context (KWIC) in R, we use the kwic()
function. We supply the corpus as well as the keyword we’re interested in:
<- kwic(ICE_GB, "eat") query1
The output in query1
contains concordance lines that list all occurrences of the keyword, including the document, context to the left, the keyword itself, and the context to the right. The final column reiterates our search expression.
head(query1)
Keyword-in-context with 6 matches.
[ICE_GB/S1A-006.txt, 785] So I' d rather | eat |
[ICE_GB/S1A-009.txt, 1198] I must <, > | eat |
[ICE_GB/S1A-010.txt, 958] to <, > actually | eat |
[ICE_GB/S1A-018.txt, 455] order one first and then | eat |
[ICE_GB/S1A-018.txt, 498] A > The bargain hunting | eat |
[ICE_GB/S1A-023.txt, 1853] B > Oh name please | eat |
beforehand just to avoid uh
them < ICE-GB:S1A-009#71:
it for one' s
it and then sort of
< ICE-GB:S1A-018#29: 1
something <,, >
For a full screen display of the KWIC data frame, try View()
:
View(query1)
docname | from | to | pre | keyword | post | pattern |
---|---|---|---|---|---|---|
ICE_GB/S1A-006.txt | 785 | 785 | So I ' d rather | eat | beforehand just to avoid uh | eat |
ICE_GB/S1A-009.txt | 1198 | 1198 | I must < , > | eat | them < ICE-GB:S1A-009 #71 : | eat |
ICE_GB/S1A-010.txt | 958 | 958 | to < , > actually | eat | it for one ' s | eat |
ICE_GB/S1A-018.txt | 455 | 455 | order one first and then | eat | it and then sort of | eat |
ICE_GB/S1A-018.txt | 498 | 498 | A > The bargain hunting | eat | < ICE-GB:S1A-018 #29 : 1 | eat |
ICE_GB/S1A-023.txt | 1853 | 1853 | B > Oh name please | eat | something < , , > | eat |
11.3.2 Multi-word queries
If the search expression exceeds a single word, we need to mark it as a multi-word sequence by means of the phrase()
function. For instance, if we were interested in the pattern eat a, we’d have to adjust the code as follows:
<- kwic(ICE_GB, phrase("eat a")) query2
View(query2)
docname | from | to | pre | keyword | post | pattern |
---|---|---|---|---|---|---|
ICE_GB/S1A-059.txt | 2230 | 2231 | 1 : B > I | eat a | < , > very balanced | eat a |
ICE_GB/W2B-014.txt | 1045 | 1046 | : 1 > We can't | eat a | lot of Welsh or Scottish | eat a |
ICE_GB/W2B-022.txt | 589 | 590 | have few labour-saving devices , | eat a | diet low in protein , | eat a |
11.3.3 Multiple simultaneous queries
A very powerful advantage of quanteda
over traditional corpus software is that we can query a corpus for a multitude of keywords at the same time. Say, we need our output to contain hits for eat, drink as well as sleep. Instead of a single keyword, we supply a character vector containing the strings of interest.
<- kwic(ICE_GB, c("eat", "drink", "sleep")) query3
View(query3)
docname | from | to | pre | keyword | post | pattern |
---|---|---|---|---|---|---|
ICE_GB/S1A-006.txt | 785 | 785 | So I ' d rather | eat | beforehand just to avoid uh | eat |
ICE_GB/S1A-009.txt | 869 | 869 | : A > Do you | drink | quite a lot of it | drink |
ICE_GB/S1A-009.txt | 1198 | 1198 | I must < , > | eat | them < ICE-GB:S1A-009 #71 : | eat |
ICE_GB/S1A-010.txt | 958 | 958 | to < , > actually | eat | it for one ' s | eat |
ICE_GB/S1A-014.txt | 3262 | 3262 | you were advised not to | drink | water in Leningrad because they | drink |
ICE_GB/S1A-016.txt | 3290 | 3290 | > I couldn't I couldn't | sleep | if I didn't read < | sleep |
11.3.4 Window size
Some studies require more detailed examination of the preceding or following context of the keyword. We can easily adjust the window
size to suit our needs:
<- kwic(ICE_GB, "eat", window = 20) query4
docname | from | to | pre | keyword | post | pattern |
---|---|---|---|---|---|---|
ICE_GB/S1A-006.txt | 785 | 785 | #49 : 1 : A > Yeah < ICE-GB:S1A-006 #50 : 1 : A > So I ' d rather | eat | beforehand just to avoid uh < , , > any problems there < ICE-GB:S1A-006 #51 : 1 : B > | eat |
ICE_GB/S1A-009.txt | 1198 | 1198 | < , > in in the summer < ICE-GB:S1A-009 #70 : 1 : A > I must < , > | eat | them < ICE-GB:S1A-009 #71 : 1 : A > Yes < ICE-GB:S1A-009 #72 : 1 : B > You ought | eat |
ICE_GB/S1A-010.txt | 958 | 958 | 1 : B > You know I mean it would seem to be squandering it to < , > actually | eat | it for one ' s own enjoyment < , , > < ICE-GB:S1A-010 #49 : 1 : A > Mm | eat |
ICE_GB/S1A-018.txt | 455 | 455 | s so < ICE-GB:S1A-018 #27 : 1 : A > What you should do is order one first and then | eat | it and then sort of carry on from there < laughter > < , > by which time you wouldn't | eat |
ICE_GB/S1A-018.txt | 498 | 498 | second anyway so < laugh > < , > < ICE-GB:S1A-018 #28 : 1 : A > The bargain hunting | eat | < ICE-GB:S1A-018 #29 : 1 : B > So all right what did I have < ICE-GB:S1A-018 #30 : 1 | eat |
ICE_GB/S1A-023.txt | 1853 | 1853 | > I can't bear it < , , > < ICE-GB:S1A-023 #121 : 1 : B > Oh name please | eat | something < , , > < ICE-GB:S1A-023 #122 : 1 : A > Oh actually Dad asked me if < | eat |
11.3.5 Saving your output
You can store your results in a spreadsheet file just as described in the unit on importing and exporting data.
- Microsoft Excel (.xlsx)
library(writexl) # required for writing files to MS Excel
write_xlsx(query1, "myresults1.xlsx")
- LibreOffice (.csv)
write.csv(query1, "myresults1.csv")
As soon as you have annotated your data, you can load .xlsx files back into R with read_xlsx()
from the readxl
package and .csv files using the Base R function read.csv()
.
11.4 Characterising the output
Recall our initial query of the eat, whose output we stored in query1
:
docname | from | to | pre | keyword | post | pattern |
---|---|---|---|---|---|---|
ICE_GB/S1A-006.txt | 785 | 785 | So I ' d rather | eat | beforehand just to avoid uh | eat |
ICE_GB/S1A-009.txt | 1198 | 1198 | I must < , > | eat | them < ICE-GB:S1A-009 #71 : | eat |
ICE_GB/S1A-010.txt | 958 | 958 | to < , > actually | eat | it for one ' s | eat |
ICE_GB/S1A-018.txt | 455 | 455 | order one first and then | eat | it and then sort of | eat |
ICE_GB/S1A-018.txt | 498 | 498 | A > The bargain hunting | eat | < ICE-GB:S1A-018 #29 : 1 | eat |
ICE_GB/S1A-023.txt | 1853 | 1853 | B > Oh name please | eat | something < , , > | eat |
First, we may be interested in obtaining some general information on our results, such as …
- … how many tokens (= individual hits) does the query return?
The nrow()
function counts the number of rows in a data frame — these always correspond to the number of observations in our sample (here: 53).
nrow(query1)
[1] 53
- … how many types (= distinct hits) does the query return?
Apparently, there are 52 counts of eat in lower case and 1 in upper case. Their sum corresponds to our 53 observations in total.
table(query1$keyword)
eat Eat
52 1
- … how is the keyword distributed across corpus files?
This question relates to the notion of dispersion: Is a keyword spread relatively evenly across corpus files or does it only occur in specific ones?
# Frequency of keyword by docname
<- table(query1$docname, query1$keyword)
query_distrib
# Show first few rows
head(query_distrib)
eat Eat
ICE_GB/S1A-006.txt 1 0
ICE_GB/S1A-009.txt 1 0
ICE_GB/S1A-010.txt 1 0
ICE_GB/S1A-018.txt 2 0
ICE_GB/S1A-023.txt 1 0
ICE_GB/S1A-025.txt 1 0
# Create a simple dot plot
dotplot(query_distrib, auto.key = list(columns = 2, title = "Tokens", cex.title = 1))
# Create a fancy plot (requires tidyverse)
ggplot(query1, aes(x = keyword)) +
geom_bar() +
facet_wrap(~docname)
It seems that eat occurs at least once in most text categories (both spoken and written), but seems to be much more common in face-to-face conversations (S1A). This is not surprising: It is certainly more common to discuss food in a casual chat with friends than in an academic essay (unless, of course, its main subject matter is food). Dispersion measures can thus be viewed as indicators of contextual preferences associated with lexemes or more grammatical patterns.
The empirical study of dispersion has attracted a lot of attention in recent years Gries (2020). A reason for this is the necessity of finding a dispersion measure that is minimally correlated with token frequency. One such measure is the Kullback-Leibler divergence \(KLD\), which comes from the field of information theory and is closely related to entropy.
Mathematically, \(KLD\) measures the difference between two probability distributions \(p\) and \(q\).
\[ KLD(p \parallel q) = \sum\limits_{x \in X} p(x) \log \frac{p(x)}{q(x)} \tag{11.1}\]
Let \(f\) denote the overall frequency of a keyword in the corpus, \(v\) its frequency in each corpus part, \(s\) the sizes of each corpus part (as fractions) and \(n\) the total number of corpus parts. We thus compare the posterior (= “actual”) distribution of keywords \(\frac{v_i}{f}\) for \(i = 1, ..., n\) with their prior distribution, which assumes all words are spread evenly across corpus parts (hence the division by \(s_i\)).
\[ KLD = \sum\limits_{i=1}^n \frac{v_i}{f} \times \log_2\left({\frac{v_i}{f} \times \frac{1}{s_i}}\right) \tag{11.2}\]
In R, let’s calculate the dispersion of the verbs eat, drink, and sleep from query3
.
# Let's filter out the upper-case variants:
<- query3[query3$keyword %in% c("eat", "drink", "sleep"),]
query3_reduced table(query3_reduced$keyword)
drink eat sleep
48 52 41
# Extract text categories
<- separate_wider_delim(query3_reduced, cols = docname, delim = "-", names = c("Text_category", "File_number"))
query_registers
# Get separate data frames for each verb
<- filter(query_registers, keyword == "eat")
eat <- filter(query_registers, keyword == "drink")
drink <- filter(query_registers, keyword == "sleep")
sleep
## Get frequency distribution across files
<- table(eat$Text_category)
v_eat <- table(drink$Text_category)
v_drink <- table(sleep$Text_category)
v_sleep
## Get total frequencies
<- nrow(eat)
f_eat <- nrow(drink)
f_drink <- nrow(sleep)
f_sleep
# The next step is a little trickier. First we need to find out how many distinct corpus parts there are in the ICE corpus.
## Check ICE-corpus structure and convert to data frame
<- as.data.frame(summary(ICE_GB))
ICE_GB_str
## Separate files from text categores
<- separate_wider_delim(ICE_GB_str, cols = Var1, delim = "-", names = c("Text_category", "File"))
ICE_GB_texts
## Get number of distinct text categories
<- length(unique(ICE_GB_texts$Text_category))
n
## Get proportions of distinct text categories (s)
<- table(ICE_GB_texts$Text_category)/sum(table(ICE_GB_texts$Text_category))
s
## Unfortunately not all of these corpus parts are represented in our queries. We need to correct the proportions in s for the missing ones!
## Store unique ICE text categories
<- unique(ICE_GB_texts$Text_category)
ICE_unique_texts
## Make sure only those text proportions are included where the keywords actually occur
<- s[match(names(v_eat), ICE_unique_texts)]
s_eat <- s[match(names(v_drink), ICE_unique_texts)]
s_drink <- s[match(names(v_sleep), ICE_unique_texts)]
s_sleep
# Compute KLD for each verb
<- sum(v_eat/f_eat * log2(v_eat/f_eat * 1/s_eat)); kld_eat kld_eat
[1] 0.6747268
<- sum(v_drink/f_drink * log2(v_drink/f_drink * 1/s_drink)); kld_drink kld_drink
[1] 0.8463608
<- sum(v_sleep/f_sleep * log2(v_sleep/f_sleep * 1/s_sleep)); kld_sleep kld_sleep
[1] 0.7047421
# Plot
<- data.frame(kld_eat, kld_drink, kld_sleep)
kld_df
barplot(as.numeric(kld_df), names.arg = names(kld_df), col = "steelblue",
xlab = "Variable", ylab = "KLD Value (= deviance from even distribution)", main = "Dispersion of 'eat', 'drink', and 'sleep'")
The plot indicates that drink is the most unevenly distributed verb out of the three considered (high KDL \(\sim\) low dispersion), whereas eat appears to be slightly more evenly distributed across corpus files. The verb sleep assumes an intermediary position.
11.5 “I need a proper user interface”: Some alternatives
There is a wide variety of concordancing software available, both free and paid. Among the most popular options are AntConc (Anthony 2020) and SketchEngine (Kilgarriff et al. 2004). However, as Schweinberger (2024) notes, the exact processes these tools use to generate output are not always fully transparent, making them something of a “black box.” In contrast, programming languages like R or Python allow researchers to document each step of their analysis clearly, providing full transparency from start to finish.
The following apps attempt to reconcile the need for an intuitive user interface with transparent data handling. The full source code is documented in the respective GitHub repositories.
QuantedaApp
is an interface for the R packagequanteda
(Benoit et al. 2018).PyConc
is an interface for the Python packagespaCy
(Honnibal and Montani 2017).