library(tidyverse)
library(readxl)
31 ICE: Extract register data
You can find the full R script associated with this unit here.
31.1 Preparation
This supplementary unit illustrates how to extract register information from kwic()
queries performed on data from the International Corpus of English (cf. Concordancing). We begin by loading the relevant libraries:
Next we load some sample data to work with, which you can download here. If you have data of your own, you may skip this step.
# Load data
<- read_xlsx("eat_obj_aspect.xlsx") data_eat
The dataset follows the default kwic()
structure with the document name and the immediate context before and after the keyword. The columns object_realisation
(dependent variable) and verb_aspect
(indepdent variable) are the result of manual annotation in Microsoft Excel.
# First few lines
head(data_eat)
# A tibble: 6 × 9
docname from to pre keyword post pattern object_realisation verb_aspect
<chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 ICE_GB… 458 458 had … eaten anyw… "\\b(e… no perfective
2 ICE_GB… 478 478 : 1 … eating will… "\\b(e… no progressive
3 ICE_GB… 785 785 > Ye… eat befo… "\\b(e… no neutral
4 ICE_GB… 1198 1198 the … eat them… "\\b(e… yes neutral
5 ICE_GB… 4529 4529 > Ye… ate in t… "\\b(e… no neutral
6 ICE_GB… 958 958 know… eat it f… "\\b(e… yes neutral
31.2 Fine-grained: Text categories
After querying the ICE corpora, information on the texts (and, therefore, the register) is stored in the docname
column.
head(data_eat[,"docname"])
# A tibble: 6 × 1
docname
<chr>
1 ICE_GB/S1A-006.txt
2 ICE_GB/S1A-006.txt
3 ICE_GB/S1A-006.txt
4 ICE_GB/S1A-009.txt
5 ICE_GB/S1A-009.txt
6 ICE_GB/S1A-010.txt
We want to split up this column into two separate ones. One should contain the text category labels (S1A, S1B, S2A etc.) and one the file numbers (001, 002, 003 etc.). The tidyverse function seperate_wider_delim()
offers a method to handle that. Let’s call the new columns text_category
and file_number
, respectively.
If you apply the function to your own data, simply exchange the data =
argument with your search results, and leave the rest as is.
<- separate_wider_delim(
data_eat_reg # Original data frame
data = data_eat,
# Which column to split
cols = docname,
# Where to split it (at the hyphen)
delim = "-", # where to split exactly
# Names of the new columns
names = c("text_category", "file_number")
)
The data frame now have the desired format, as the first two columns indicate:
head(data_eat_reg)
# A tibble: 6 × 10
text_category file_number from to pre keyword post pattern
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 ICE_GB/S1A 006.txt 458 458 had lunch < ICE-G… eaten anyw… "\\b(e…
2 ICE_GB/S1A 006.txt 478 478 : 1 : A > Well I … eating will… "\\b(e…
3 ICE_GB/S1A 006.txt 785 785 > Yeah < ICE-GB:S… eat befo… "\\b(e…
4 ICE_GB/S1A 009.txt 1198 1198 the summer < ICE-… eat them… "\\b(e…
5 ICE_GB/S1A 009.txt 4529 4529 > Yeah < , > < IC… ate in t… "\\b(e…
6 ICE_GB/S1A 010.txt 958 958 know I mean it wo… eat it f… "\\b(e…
# ℹ 2 more variables: object_realisation <chr>, verb_aspect <chr>
31.3 More general: Spoken vs. written
If there is no need for the fine-grained register distinctions illustrated above, we might as well group them into macro-categories instead, such as spoken data and written data.
Category | Subcategory | Type | Code Range |
---|---|---|---|
SPOKEN (S) | DIALOGUE (S1) | PRIVATE (S1A) | |
Direct Conversations | S1A-001 to S1A-090 | ||
Telephone Calls | S1A-091 to S1A-100 | ||
PUBLIC (S1B) | |||
Class Lessons | S1B-001 to S1B-020 | ||
Broadcast Discussions | S1B-021 to S1B-040 | ||
Broadcast Interviews | S1B-041 to S1B-050 | ||
Parliamentary Debates | S1B-051 to S1B-060 | ||
Legal Cross-examinations | S1B-061 to S1B-070 | ||
Business Transactions | S1B-071 to S1B-080 | ||
MONOLOGUE (S2) | UNSCRIPTED (S2A) | ||
Spontaneous Commentaries | S2A-001 to S2A-020 | ||
Unscripted Speeches | S2A-021 to S2A-050 | ||
Demonstrations | S2A-051 to S2A-060 | ||
Legal Presentations | S2A-061 to S2A-070 | ||
SCRIPTED (S2B) | |||
Broadcast News | S2B-001 to S2B-020 | ||
Broadcast Talks | S2B-021 to S2B-040 | ||
Non-broadcast Talks | S2B-041 to S2B-050 | ||
WRITTEN (W) | NON-PRINTED (W1) | ||
NON-PROFESSIONAL WRITING (W1A) | |||
Student Essays | W1A-001 to W1A-010 | ||
Examination Scripts | W1A-011 to W1A-020 | ||
CORRESPONDENCE (W1B) | |||
Social Letters | W1B-001 to W1B-015 | ||
Business Letters | W1B-016 to W1B-030 | ||
PRINTED (W2) | ACADEMIC WRITING (W2A) | ||
Humanities | W2A-001 to W2A-010 | ||
Social Sciences | W2A-011 to W2A-020 | ||
Natural Sciences | W2A-021 to W2A-030 | ||
Technology | W2A-031 to W2A-040 | ||
NON-ACADEMIC WRITING (W2B) | |||
Humanities | W2B-001 to W2B-010 | ||
Social Sciences | W2B-011 to W2B-020 | ||
Natural Sciences | W2B-021 to W2B-030 | ||
Technology | W2B-031 to W2B-040 | ||
REPORTAGE (W2C) | Press News Reports | W2C-001 to W2C-020 | |
INSTRUCTIONAL WRITING (W2D) | |||
Administrative Writing | W2D-001 to W2D-010 | ||
Skills & Hobbies | W2D-011 to W2D-020 | ||
PERSUASIVE WRITING (W2E) | Press Editorials | W2E-001 to W2E-010 | |
CREATIVE WRITING (W2F) | Novels & Stories | W2F-001 to W2F-020 |
What all spoken files have in common is that they begin with the upper-case letter "S"
, and written files with "W"
.
# Classify text files as either "spoken" or "written"
1%>%
data_eat_reg 2mutate(medium = case_when(
3grepl("ICE_GB/S", text_category) ~ "spoken",
4grepl("ICE_GB/W", text_category) ~ "written"
5-> data_sw ))
- 1
-
Take the data frame
data_eat_reg
and pass it on to the next function via the pipe operator%>%
. Make surelibrary(tidyverse)
is loaded. - 2
-
Create a new column with
mutate
and call itmedium
. The column values are assigned conditionally withcase_when()
: - 3
-
If
text_category
contains the string"ICE_GB/S"
, classifymedium
as"spoken"
. - 4
-
If
text_category
contains the string"ICE_GB/W"
, classifymedium
as"written"
. - 5
-
Store the new data frame in the variable
data_sw
.
31.4 What next? A few sample analyses
It is now possible to investigate associations between register (i.e., text_category
or medium
) and other variables more comfortably.
object_realisation
andtext_category
# Contingency table <- xtabs(~ object_realisation + text_category, data_eat_reg) obj_reg_freq # Percentage table <- prop.table(obj_reg_freq, margin = 2) * 100 object_reg_prop print(object_reg_prop)
text_category object_realisation ICE_GB/S1A ICE_GB/S1B ICE_GB/S2A ICE_GB/S2B ICE_GB/W1B no 61.22449 60.00000 50.00000 66.66667 90.00000 yes 38.77551 40.00000 50.00000 33.33333 10.00000 text_category object_realisation ICE_GB/W2B ICE_GB/W2C ICE_GB/W2D ICE_GB/W2E ICE_GB/W2F no 0.00000 100.00000 50.00000 100.00000 50.00000 yes 100.00000 0.00000 50.00000 0.00000 50.00000
# Simple barplot barplot(object_reg_prop, beside = TRUE, legend = TRUE, cex.names = 0.8)
# Statistical analysis fisher.test(obj_reg_freq)
Fisher's Exact Test for Count Data data: obj_reg_freq p-value = 5.15e-05 alternative hypothesis: two.sided
object_realisation
andmedium
# Contingency table <- xtabs(~ object_realisation + medium, data_sw) obj_sw_freq # Percentage table <- prop.table(obj_sw_freq, margin = 2) * 100 obj_sw_prop print(obj_sw_prop)
medium object_realisation spoken written no 60.29412 47.05882 yes 39.70588 52.94118
# Simple barplot barplot(obj_sw_prop, beside = TRUE, legend = TRUE, cex.names = 0.8)
# Statistical analysis chisq.test(obj_sw_freq)
Pearson's Chi-squared test with Yates' continuity correction data: obj_sw_freq X-squared = 1.1184, df = 1, p-value = 0.2903