31  ICE: Extract register data

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Script

You can find the full R script associated with this unit here.

31.1 Preparation

This supplementary unit illustrates how to extract register information from kwic() queries performed on data from the International Corpus of English (cf. Concordancing). We begin by loading the relevant libraries:

library(tidyverse)
library(readxl)

Next we load some sample data to work with, which you can download here. If you have data of your own, you may skip this step.

# Load data
data_eat <- read_xlsx("eat_obj_aspect.xlsx")

The dataset follows the default kwic() structure with the document name and the immediate context before and after the keyword. The columns object_realisation (dependent variable) and verb_aspect (indepdent variable) are the result of manual annotation in Microsoft Excel.

# First few lines
head(data_eat)
# A tibble: 6 × 9
  docname  from    to pre   keyword post  pattern object_realisation verb_aspect
  <chr>   <dbl> <dbl> <chr> <chr>   <chr> <chr>   <chr>              <chr>      
1 ICE_GB…   458   458 had … eaten   anyw… "\\b(e… no                 perfective 
2 ICE_GB…   478   478 : 1 … eating  will… "\\b(e… no                 progressive
3 ICE_GB…   785   785 > Ye… eat     befo… "\\b(e… no                 neutral    
4 ICE_GB…  1198  1198 the … eat     them… "\\b(e… yes                neutral    
5 ICE_GB…  4529  4529 > Ye… ate     in t… "\\b(e… no                 neutral    
6 ICE_GB…   958   958 know… eat     it f… "\\b(e… yes                neutral    

31.2 Fine-grained: Text categories

After querying the ICE corpora, information on the texts (and, therefore, the register) is stored in the docname column.

head(data_eat[,"docname"])
# A tibble: 6 × 1
  docname           
  <chr>             
1 ICE_GB/S1A-006.txt
2 ICE_GB/S1A-006.txt
3 ICE_GB/S1A-006.txt
4 ICE_GB/S1A-009.txt
5 ICE_GB/S1A-009.txt
6 ICE_GB/S1A-010.txt

We want to split up this column into two separate ones. One should contain the text category labels (S1A, S1B, S2A etc.) and one the file numbers (001, 002, 003 etc.). The tidyverse function seperate_wider_delim() offers a method to handle that. Let’s call the new columns text_category and file_number, respectively.

If you apply the function to your own data, simply exchange the data = argument with your search results, and leave the rest as is.

data_eat_reg <- separate_wider_delim(
                                # Original data frame
                                data = data_eat,
                                # Which column to split
                                cols = docname,
                                # Where to split it (at the hyphen)
                                delim = "-", # where to split exactly
                                # Names of the new columns
                                names = c("text_category", "file_number")
                                ) 

The data frame now have the desired format, as the first two columns indicate:

head(data_eat_reg)
# A tibble: 6 × 10
  text_category file_number  from    to pre                keyword post  pattern
  <chr>         <chr>       <dbl> <dbl> <chr>              <chr>   <chr> <chr>  
1 ICE_GB/S1A    006.txt       458   458 had lunch < ICE-G… eaten   anyw… "\\b(e…
2 ICE_GB/S1A    006.txt       478   478 : 1 : A > Well I … eating  will… "\\b(e…
3 ICE_GB/S1A    006.txt       785   785 > Yeah < ICE-GB:S… eat     befo… "\\b(e…
4 ICE_GB/S1A    009.txt      1198  1198 the summer < ICE-… eat     them… "\\b(e…
5 ICE_GB/S1A    009.txt      4529  4529 > Yeah < , > < IC… ate     in t… "\\b(e…
6 ICE_GB/S1A    010.txt       958   958 know I mean it wo… eat     it f… "\\b(e…
# ℹ 2 more variables: object_realisation <chr>, verb_aspect <chr>

31.3 More general: Spoken vs. written

If there is no need for the fine-grained register distinctions illustrated above, we might as well group them into macro-categories instead, such as spoken data and written data.

Category Subcategory Type Code Range
SPOKEN (S) DIALOGUE (S1) PRIVATE (S1A)
Direct Conversations S1A-001 to S1A-090
Telephone Calls S1A-091 to S1A-100
PUBLIC (S1B)
Class Lessons S1B-001 to S1B-020
Broadcast Discussions S1B-021 to S1B-040
Broadcast Interviews S1B-041 to S1B-050
Parliamentary Debates S1B-051 to S1B-060
Legal Cross-examinations S1B-061 to S1B-070
Business Transactions S1B-071 to S1B-080
MONOLOGUE (S2) UNSCRIPTED (S2A)
Spontaneous Commentaries S2A-001 to S2A-020
Unscripted Speeches S2A-021 to S2A-050
Demonstrations S2A-051 to S2A-060
Legal Presentations S2A-061 to S2A-070
SCRIPTED (S2B)
Broadcast News S2B-001 to S2B-020
Broadcast Talks S2B-021 to S2B-040
Non-broadcast Talks S2B-041 to S2B-050
WRITTEN (W) NON-PRINTED (W1)
NON-PROFESSIONAL WRITING (W1A)
Student Essays W1A-001 to W1A-010
Examination Scripts W1A-011 to W1A-020
CORRESPONDENCE (W1B)
Social Letters W1B-001 to W1B-015
Business Letters W1B-016 to W1B-030
PRINTED (W2) ACADEMIC WRITING (W2A)
Humanities W2A-001 to W2A-010
Social Sciences W2A-011 to W2A-020
Natural Sciences W2A-021 to W2A-030
Technology W2A-031 to W2A-040
NON-ACADEMIC WRITING (W2B)
Humanities W2B-001 to W2B-010
Social Sciences W2B-011 to W2B-020
Natural Sciences W2B-021 to W2B-030
Technology W2B-031 to W2B-040
REPORTAGE (W2C) Press News Reports W2C-001 to W2C-020
INSTRUCTIONAL WRITING (W2D)
Administrative Writing W2D-001 to W2D-010
Skills & Hobbies W2D-011 to W2D-020
PERSUASIVE WRITING (W2E) Press Editorials W2E-001 to W2E-010
CREATIVE WRITING (W2F) Novels & Stories W2F-001 to W2F-020

What all spoken files have in common is that they begin with the upper-case letter "S", and written files with "W".

# Classify text files as either "spoken" or "written"
1data_eat_reg %>%
2  mutate(medium = case_when(
3    grepl("ICE_GB/S", text_category) ~ "spoken",
4    grepl("ICE_GB/W", text_category) ~ "written"
5  )) -> data_sw
1
Take the data frame data_eat_reg and pass it on to the next function via the pipe operator %>%. Make sure library(tidyverse) is loaded.
2
Create a new column with mutate and call it medium. The column values are assigned conditionally with case_when():
3
If text_category contains the string "ICE_GB/S", classify medium as "spoken".
4
If text_category contains the string "ICE_GB/W", classify medium as "written".
5
Store the new data frame in the variable data_sw.

31.4 What next? A few sample analyses

It is now possible to investigate associations between register (i.e., text_category or medium) and other variables more comfortably.

  • object_realisation and text_category

    # Contingency table
    obj_reg_freq <- xtabs(~ object_realisation + text_category, data_eat_reg)
    
    # Percentage table
    object_reg_prop <- prop.table(obj_reg_freq, margin = 2) * 100
    
    print(object_reg_prop)
                      text_category
    object_realisation ICE_GB/S1A ICE_GB/S1B ICE_GB/S2A ICE_GB/S2B ICE_GB/W1B
                   no    61.22449   60.00000   50.00000   66.66667   90.00000
                   yes   38.77551   40.00000   50.00000   33.33333   10.00000
                      text_category
    object_realisation ICE_GB/W2B ICE_GB/W2C ICE_GB/W2D ICE_GB/W2E ICE_GB/W2F
                   no     0.00000  100.00000   50.00000  100.00000   50.00000
                   yes  100.00000    0.00000   50.00000    0.00000   50.00000
    # Simple barplot
    barplot(object_reg_prop, 
            beside = TRUE, 
            legend = TRUE,
            cex.names = 0.8)

    # Statistical analysis
    fisher.test(obj_reg_freq)
    
        Fisher's Exact Test for Count Data
    
    data:  obj_reg_freq
    p-value = 5.15e-05
    alternative hypothesis: two.sided
  • object_realisation and medium

    # Contingency table
    obj_sw_freq <- xtabs(~ object_realisation + medium, data_sw)
    
    # Percentage table
    obj_sw_prop <- prop.table(obj_sw_freq, margin = 2) * 100
    
    print(obj_sw_prop)
                      medium
    object_realisation   spoken  written
                   no  60.29412 47.05882
                   yes 39.70588 52.94118
    # Simple barplot
    barplot(obj_sw_prop, 
            beside = TRUE, 
            legend = TRUE,
            cex.names = 0.8)

    # Statistical analysis
    chisq.test(obj_sw_freq)
    
        Pearson's Chi-squared test with Yates' continuity correction
    
    data:  obj_sw_freq
    X-squared = 1.1184, df = 1, p-value = 0.2903