31 ICE: Extract register data

Author

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Script

You can find the full R script associated with this unit here.

31.1 Preparation

This supplementary unit illustrates how to extract register information from kwic() queries performed on data from the International Corpus of English (cf. Concordancing). We begin by loading the relevant libraries:

library(tidyverse)
library(readxl)

Next we load some sample data to work with, which you can download here. If you have data of your own, you may skip this step.

# Load data
data_eat <- read_xlsx("eat_obj_aspect.xlsx")

The dataset follows the default kwic() structure with the document name and the immediate context before and after the keyword. The columns object_realisation (dependent variable) and verb_aspect (indepdent variable) are the result of manual annotation in Microsoft Excel.

# First few lines
head(data_eat)

# A tibble: 6 × 9
  docname  from    to pre   keyword post  pattern object_realisation verb_aspect
  <chr>   <dbl> <dbl> <chr> <chr>   <chr> <chr>   <chr>              <chr>      
1 ICE_GB…   458   458 had … eaten   anyw… "\\b(e… no                 perfective 
2 ICE_GB…   478   478 : 1 … eating  will… "\\b(e… no                 progressive
3 ICE_GB…   785   785 > Ye… eat     befo… "\\b(e… no                 neutral    
4 ICE_GB…  1198  1198 the … eat     them… "\\b(e… yes                neutral    
5 ICE_GB…  4529  4529 > Ye… ate     in t… "\\b(e… no                 neutral    
6 ICE_GB…   958   958 know… eat     it f… "\\b(e… yes                neutral

31.2 Fine-grained: Text categories

After querying the ICE corpora, information on the texts (and, therefore, the register) is stored in the docname column.

head(data_eat[,"docname"])

# A tibble: 6 × 1
  docname           
  <chr>             
1 ICE_GB/S1A-006.txt
2 ICE_GB/S1A-006.txt
3 ICE_GB/S1A-006.txt
4 ICE_GB/S1A-009.txt
5 ICE_GB/S1A-009.txt
6 ICE_GB/S1A-010.txt

We want to split up this column into two separate ones. One should contain the text category labels (S1A, S1B, S2A etc.) and one the file numbers (001, 002, 003 etc.). The tidyverse function seperate_wider_delim() offers a method to handle that. Let’s call the new columns text_category and file_number, respectively.

If you apply the function to your own data, simply exchange the data = argument with your search results, and leave the rest as is.

data_eat_reg <- separate_wider_delim(
                                # Original data frame
                                data = data_eat,
                                # Which column to split
                                cols = docname,
                                # Where to split it (at the hyphen)
                                delim = "-", # where to split exactly
                                # Names of the new columns
                                names = c("text_category", "file_number")
                                )

The data frame now have the desired format, as the first two columns indicate:

head(data_eat_reg)

# A tibble: 6 × 10
  text_category file_number  from    to pre                keyword post  pattern
  <chr>         <chr>       <dbl> <dbl> <chr>              <chr>   <chr> <chr>  
1 ICE_GB/S1A    006.txt       458   458 had lunch < ICE-G… eaten   anyw… "\\b(e…
2 ICE_GB/S1A    006.txt       478   478 : 1 : A > Well I … eating  will… "\\b(e…
3 ICE_GB/S1A    006.txt       785   785 > Yeah < ICE-GB:S… eat     befo… "\\b(e…
4 ICE_GB/S1A    009.txt      1198  1198 the summer < ICE-… eat     them… "\\b(e…
5 ICE_GB/S1A    009.txt      4529  4529 > Yeah < , > < IC… ate     in t… "\\b(e…
6 ICE_GB/S1A    010.txt       958   958 know I mean it wo… eat     it f… "\\b(e…
# ℹ 2 more variables: object_realisation <chr>, verb_aspect <chr>

31.3 More general: Spoken vs. written

If there is no need for the fine-grained register distinctions illustrated above, we might as well group them into macro-categories instead, such as spoken data and written data.

Summary of text categories in the ICE corpora

Category	Subcategory	Type	Code Range
SPOKEN (S)	DIALOGUE (S1)	PRIVATE (S1A)
		Direct Conversations	S1A-001 to S1A-090
		Telephone Calls	S1A-091 to S1A-100
	PUBLIC (S1B)
		Class Lessons	S1B-001 to S1B-020
		Broadcast Discussions	S1B-021 to S1B-040
		Broadcast Interviews	S1B-041 to S1B-050
		Parliamentary Debates	S1B-051 to S1B-060
		Legal Cross-examinations	S1B-061 to S1B-070
		Business Transactions	S1B-071 to S1B-080
MONOLOGUE (S2)	UNSCRIPTED (S2A)
		Spontaneous Commentaries	S2A-001 to S2A-020
		Unscripted Speeches	S2A-021 to S2A-050
		Demonstrations	S2A-051 to S2A-060
		Legal Presentations	S2A-061 to S2A-070
	SCRIPTED (S2B)
		Broadcast News	S2B-001 to S2B-020
		Broadcast Talks	S2B-021 to S2B-040
		Non-broadcast Talks	S2B-041 to S2B-050
WRITTEN (W)	NON-PRINTED (W1)
	NON-PROFESSIONAL WRITING (W1A)
		Student Essays	W1A-001 to W1A-010
		Examination Scripts	W1A-011 to W1A-020
	CORRESPONDENCE (W1B)
		Social Letters	W1B-001 to W1B-015
		Business Letters	W1B-016 to W1B-030
PRINTED (W2)	ACADEMIC WRITING (W2A)
		Humanities	W2A-001 to W2A-010
		Social Sciences	W2A-011 to W2A-020
		Natural Sciences	W2A-021 to W2A-030
		Technology	W2A-031 to W2A-040
	NON-ACADEMIC WRITING (W2B)
		Humanities	W2B-001 to W2B-010
		Social Sciences	W2B-011 to W2B-020
		Natural Sciences	W2B-021 to W2B-030
		Technology	W2B-031 to W2B-040
	REPORTAGE (W2C)	Press News Reports	W2C-001 to W2C-020
	INSTRUCTIONAL WRITING (W2D)
		Administrative Writing	W2D-001 to W2D-010
		Skills & Hobbies	W2D-011 to W2D-020
	PERSUASIVE WRITING (W2E)	Press Editorials	W2E-001 to W2E-010
	CREATIVE WRITING (W2F)	Novels & Stories	W2F-001 to W2F-020

What all spoken files have in common is that they begin with the upper-case letter "S", and written files with "W".

# Classify text files as either "spoken" or "written"
1data_eat_reg %>%
2  mutate(medium = case_when(
3    grepl("ICE_GB/S", text_category) ~ "spoken",
4    grepl("ICE_GB/W", text_category) ~ "written"
5  )) -> data_sw

1: Take the data frame data_eat_reg and pass it on to the next function via the pipe operator %>%. Make sure library(tidyverse) is loaded.
2: Create a new column with mutate and call it medium. The column values are assigned conditionally with case_when():
3: If text_category contains the string "ICE_GB/S", classify medium as "spoken".
4: If text_category contains the string "ICE_GB/W", classify medium as "written".
5: Store the new data frame in the variable data_sw.

31.4 What next? A few sample analyses

It is now possible to investigate associations between register (i.e., text_category or medium) and other variables more comfortably.

object_realisation and text_category

# Contingency table
obj_reg_freq <- xtabs(~ object_realisation + text_category, data_eat_reg)

# Percentage table
object_reg_prop <- prop.table(obj_reg_freq, margin = 2) * 100

print(object_reg_prop)

                  text_category
object_realisation ICE_GB/S1A ICE_GB/S1B ICE_GB/S2A ICE_GB/S2B ICE_GB/W1B
               no    61.22449   60.00000   50.00000   66.66667   90.00000
               yes   38.77551   40.00000   50.00000   33.33333   10.00000
                  text_category
object_realisation ICE_GB/W2B ICE_GB/W2C ICE_GB/W2D ICE_GB/W2E ICE_GB/W2F
               no     0.00000  100.00000   50.00000  100.00000   50.00000
               yes  100.00000    0.00000   50.00000    0.00000   50.00000

# Simple barplot
barplot(object_reg_prop, 
        beside = TRUE, 
        legend = TRUE,
        cex.names = 0.8)

# Statistical analysis
fisher.test(obj_reg_freq)


    Fisher's Exact Test for Count Data

data:  obj_reg_freq
p-value = 5.15e-05
alternative hypothesis: two.sided

object_realisation and medium

# Contingency table
obj_sw_freq <- xtabs(~ object_realisation + medium, data_sw)

# Percentage table
obj_sw_prop <- prop.table(obj_sw_freq, margin = 2) * 100

print(obj_sw_prop)

                  medium
object_realisation   spoken  written
               no  60.29412 47.05882
               yes 39.70588 52.94118

# Simple barplot
barplot(obj_sw_prop, 
        beside = TRUE, 
        legend = TRUE,
        cex.names = 0.8)

# Statistical analysis
chisq.test(obj_sw_freq)


    Pearson's Chi-squared test with Yates' continuity correction

data:  obj_sw_freq
X-squared = 1.1184, df = 1, p-value = 0.2903