# Load library and corpus
library(quanteda)
library(writexl)
<- readRDS("ICE_GB.RDS")
ICE_GB
# Perform query
<- kwic(ICE_GB,
kwic_eat phrase("\\b(eat(s|ing|en)?|ate)\\b"),
valuetype = "regex",
window = 15)
# Store results
write_xlsx(kwic_eat, "kwic_eat.xlsx")
13 Data annotation
13.1 Recommended reading
S. T. Gries (2013): Chapter 1.3.3
13.2 Sample study
13.2.1 Theoretical background
Let’s assume we are interested in the object realisation patterns of the verb eat in the British ICE component. A quick review of the literature tells us that …
… argument realisation may be related to the aspectual structure of a verbal action (cf. Goldberg 2001), but …
… there is a stronger focus on situation aspect (telicity/atelicity; i.e., logical endpoints of actions) than on grammatical aspect (i.e., perfective/progressive).
Since grammatical aspect is also concerned with the temporal construal of actions, it raises the question of whether or not it can also influence object realisation. To investigate the relationship between aspect and object realisation, we will perform an exemplary analysis on the verb lemma EAT.
13.2.2 Obtaining data
We load all necessary libraries to query the ICE-GB corpus and run a KWIC-search using the regular expression \\b(eat(s|ing|en)?|ate)\\b
, which finds all inflection forms of EAT. We then store the results in a spreadsheet file kwic_eat.xlsx
.
When you open kwic_eat.xlsx
in a spreadsheet software, the file will contain 7 columns by default (docname, from, to, keyword, post, pattern). Each row corresponds to a match of your search expression in the corpus, which is equal to 113 here. This is your raw output.
13.3 Data annotation
Whenever you decide to work on your corpus results, it is good practice to duplicate your file and append the current date to the filename. Re-save it as, for instance, kwic_eat_09_09_2024.xlsx
and open it again. This way you’re performing basic version control, which will allow you to return to previous stages of your analysis with ease.
In your spreadsheet software, you can now assign your variables of interest to the empty columns next to your output data. For our specific example, we will need one that captures object realisation and one the type of verb aspect. Let’s simply call them object_realisation
and verb_aspect
.
Of course, you could also opt for a different column name, as long it has no spaces or special characters (e.g., !?%#). You could also name it object_realisation
or, even more plainly, object
, but not direct object
or object realisation
with spaces. Otherwise you are bound to encounter a surge of cryptic error messages in your R console.
Now, you are ready to annotate your data! An easy coding scheme would involve classifying rows where eat occurs with an object as yes
. Conversely, rows where the direct object is not realised syntactically are assigned the column value no
. In the aspect column, verbal aspect will be coded as either perfective
, progressive
or neutral
, following S. Gries and Deshors (2014): 118.
13.3.1 Dealing with problematic cases
However, things are not always that clear-cut. What if you encounter a false positive, i.e., an erroneous hit in your dataset? Further down in the spreadsheet the keyword ate is actually part of the preceding word, inappropriate.
docname | from | to | pre | keyword | post | pattern | |
---|---|---|---|---|---|---|---|
113 | ICE_GB/W2F-019.txt | 696 | 696 | : 1 > Too much colour on her face would be inappropri < l > | ate | , she feels , but she wears a light foundation . < ICE-GB:W2F-019 #28 : | \b(eat(s|ing|en)?|ate)\b |
Short answer: Do not delete irrelevant rows or columns. Essentially, from the moment you’ve obtained your corpus output, you should withstand the temptation to delete anything from it. Instead, adopt the practice of indicating missing values or irrelevant rows by an NA
in a separate column. In later analyses, these can be easily filtered out!
This also minimises the risk of accidentally getting rid of data that could have proven important at a later point in time.
13.3.2 Getting the data back into R
Import the Excel file via
# Load library
library(readxl)
# Read file contents into the variable "kwic_data"
<- read_xlsx("kwic_eat_09_09_2024.xlsx")
kwic_data
# Print the first six lines of "kwic_data"
print(head(kwic_data))
# A tibble: 6 × 9
docname from to pre keyword post pattern object_realisation aspect_verb
<chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 ICE_GB… 458 458 had … eaten anyw… "\\b(e… no perfective
2 ICE_GB… 478 478 : 1 … eating will… "\\b(e… no progressive
3 ICE_GB… 785 785 > Ye… eat befo… "\\b(e… no neutral
4 ICE_GB… 1198 1198 the … eat them… "\\b(e… yes neutral
5 ICE_GB… 4529 4529 > Ye… ate in t… "\\b(e… no neutral
6 ICE_GB… 958 958 know… eat it f… "\\b(e… yes neutral
13.3.3 Adding a case list
S. T. Gries (2013) recommends setting up the first column of the data frame such that it “numbers all n cases from \(1\) to \(n\) so that every row can be uniquely identified and so that you always restore one particular ordering (e.g., the original one)” (S. T. Gries 2013: 26). This is very easy to do: We specify a numeric vector ranging from 1 to the total number of rows in the data frame.
# Create a new Case column (which, by default, is moved to the very end of the data frame)
$Case <- 1:nrow(kwic_data)
kwic_data
# Move the Case column to the front of the data frame
library(tidyverse)
<- relocate(kwic_data, Case)
kwic_data
# Print reordered data frame
print(head(kwic_data))
# A tibble: 6 × 10
Case docname from to pre keyword post pattern object_realisation
<int> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 1 ICE_GB/S1A-0… 458 458 had … eaten anyw… "\\b(e… no
2 2 ICE_GB/S1A-0… 478 478 : 1 … eating will… "\\b(e… no
3 3 ICE_GB/S1A-0… 785 785 > Ye… eat befo… "\\b(e… no
4 4 ICE_GB/S1A-0… 1198 1198 the … eat them… "\\b(e… yes
5 5 ICE_GB/S1A-0… 4529 4529 > Ye… ate in t… "\\b(e… no
6 6 ICE_GB/S1A-0… 958 958 know… eat it f… "\\b(e… yes
# ℹ 1 more variable: aspect_verb <chr>
13.4 Where do I go from here?
As soon as you’ve fully annotated your dataset and reviewed it for potential coding errors, the next step involves analysing your data statistically to uncover potential patterns that are not visible to the naked eye. These patterns can include (subtle to major) differences in frequency of occurrence, relationships with other linguistically relevant features (e.g., register/genre) or probabilities (e.g., the probability that an object is omitted), among many others. The following units offer an introduction to statistics, where “statistics” is best understood as “a collection of methods which help us to describe, summarize, interpret, and analyse data” (Heumann, Schomaker, and Shalabh 2022).