15  Categorical data

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

15.1 Preparation

Script

You can find the full R script associated with this unit here.

Please download the file Paquot_Larsson_2020_data.xlsx (Paquot and Larsson 2020)1 and store it in the same folder as your currently active R-script. Then run the code lines below:

  • 1 The original supplementary materials can be downloaded from the publisher’s website [Last accessed April 28, 2024].

  • # Libraries
    library(readxl)
    library(tidyverse)
    # For publication-ready tables
    library(crosstable)
    library(flextable)
    
    # Load data from working directory
    cl.order <- read_xlsx("Paquot_Larsson_2020_data.xlsx")
    
    # Check the structure of the data frame
    head(cl.order)
    # A tibble: 6 × 8
       CASE ORDER SUBORDTYPE LEN_MC LEN_SC LENGTH_DIFF CONJ     MORETHAN2CL
      <dbl> <chr> <chr>       <dbl>  <dbl>       <dbl> <chr>    <chr>      
    1  4777 sc-mc temp            4     10          -6 als/when no         
    2  1698 mc-sc temp            7      6           1 als/when no         
    3   953 sc-mc temp           12      7           5 als/when yes        
    4  1681 mc-sc temp            6     15          -9 als/when no         
    5  4055 sc-mc temp            9      5           4 als/when yes        
    6   967 sc-mc temp            9      5           4 als/when yes        

    15.2 Describing categorical data

    A categorical variable is made up of two or more discrete values. An intuitive way to describe categorical data would be to count how often each category occurs in the sample. These counts are then typically summarised in frequency tables and accompanied by suitable graphs (e.g., barplots).

    15.2.1 Frequency tables (one variable)

    Assume we are interested in how often each clause ordering type ( "mc-sc" vs. "sc-mc") is attested in our data. In R, we can obtain their frequencies by inspecting the ORDER column of the cl.order dataset. Since manual counting isn’t really an option, we will make use of the convenient functions table() and xtabs().

    This function requires a character vector. We use the notation cl.order$ORDER to subset the cl.order data frame according to the column ORDER (cf. data frames). We store the results in the variable order_freq1 (you may choose a different name if you like) and display the output by applying to it the print() function.

    # Count occurrences of ordering types ("mc-sc" and "sc-mc") in the data frame
    order_freq1 <- table(cl.order$ORDER) 
    
    # Print table
    print(order_freq1)
    
    mc-sc sc-mc 
      275   128 

    Alternatively, you could use xtabs() to achieve the same result. The syntax is a little different, but it returns a slightly more more detailed table with explicit variable label(s).

    # Count occurrences of ordering types ("mc-sc" and "sc-mc")
    order_freq2 <- xtabs(~ ORDER, cl.order)
    
    # Print table
    print(order_freq2)
    ORDER
    mc-sc sc-mc 
      275   128 

    15.2.2 Frequency tables (\(\geq\) 2 variables)

    If we are interested in the relationship between multiple categorical variables, we can cross-tabulate the frequencies of their categories. For example, what is the distribution of clause order depending on the type of subordinate clause? The output is also referred to as a contingency table.

    # Get frequencies of ordering tpyes ("mc-sc" vs. "sc-mc") depending on the type of subordinate clause ("caus" vs. "temp")
    order_counts1 <- table(cl.order$ORDER, cl.order$SUBORDTYPE)
    
    # Print contingency table
    print(order_counts1)
           
            caus temp
      mc-sc  184   91
      sc-mc   15  113
    # Cross-tabulate ORDER and SUBORDTYPE
    order_counts2 <- xtabs(~ ORDER + SUBORDTYPE, cl.order)
    
    # Print cross-table
    print(order_counts2)
           SUBORDTYPE
    ORDER   caus temp
      mc-sc  184   91
      sc-mc   15  113

    15.2.3 Percentage tables

    There are several ways to compute percentages for your cross-tables, but by far the simplest is via the prop.table() function. As it only provides proportions, you can multiply the output by 100 to obtain real percentages.

    # Convert to % using the prop.table() function
    pct1 <- prop.table(order_counts1) * 100
    
    # Print percentages
    print(pct1)
           
                 caus      temp
      mc-sc 45.657568 22.580645
      sc-mc  3.722084 28.039702
    # Convert to % using the prop.table() function
    pct2 <- prop.table(order_counts2) * 100
    
    # Print percentages
    print(pct2)
           SUBORDTYPE
    ORDER        caus      temp
      mc-sc 45.657568 22.580645
      sc-mc  3.722084 28.039702

    Notice how pct2 still carries the variable labels SUBORDTYPE and ORDER, which is very convenient.

    15.3 Plotting categorical data

    This section demonstrates both the in-built plotting functions of R (‘Base R’) as well as the more modern versions provided by the tidyverse package.

    A straightforward way to visualise a contingency table is the mosaicplot:

    # Works with raw counts and percentages
    # Using the output of xtabs() as input
    mosaicplot(order_counts2, color = TRUE)

    The workhorse of categorical data analysis is the barplot. Base R functions usually require a table object as input, whereas ggplot2 can operate on the raw dataset.

    15.3.1 One variable

    • Base R barplot with barplot(); requires the counts as computed by tables() or xtabs()
    # Generate cross-table
    order_freq1 <- table(cl.order$ORDER)
    
    # Create barplot
    barplot(order_freq1)

    • Barplot with geom_bar() using the raw input data
    # Requirement: library(tidyverse)
    
    # Raw input data
    head(cl.order)
    # A tibble: 6 × 8
       CASE ORDER SUBORDTYPE LEN_MC LEN_SC LENGTH_DIFF CONJ     MORETHAN2CL
      <dbl> <chr> <chr>       <dbl>  <dbl>       <dbl> <chr>    <chr>      
    1  4777 sc-mc temp            4     10          -6 als/when no         
    2  1698 mc-sc temp            7      6           1 als/when no         
    3   953 sc-mc temp           12      7           5 als/when yes        
    4  1681 mc-sc temp            6     15          -9 als/when no         
    5  4055 sc-mc temp            9      5           4 als/when yes        
    6   967 sc-mc temp            9      5           4 als/when yes        
    # Create barplot
    ggplot(cl.order, aes(x = ORDER)) +
      geom_bar()

    15.3.2 Two variables

    Bivariate barplots can be obtained by either supplying a contingency table (Base R) or by mapping the second variable onto the fill argument using the raw data.

    # Generate cross-table with two variables
    order_counts2 <- xtabs(~ ORDER + SUBORDTYPE, cl.order)
    
    # Create simple barplot
    barplot(order_counts2, 
            beside = TRUE,  # Make bars side-by-side
            legend = TRUE)  # Add a legend

    # Generate cross-table with two variables
    order_counts2 <- xtabs(~ ORDER + SUBORDTYPE, cl.order)
    
    # Customise barplot with axis labels, colours and legend
    barplot(order_counts2, 
            beside = TRUE,  # Make bars dodged (i.e., side by side)
            main = "Distribution of ORDER by SUBORDTYPE (Base R)", 
            xlab = "ORDER", 
            ylab = "Frequency", 
            col = c("lightblue", "lightgreen"), # Customize colors
            legend = TRUE,  # Add a legend
            args.legend = list(title = "SUBORDTYPE", x = "topright"))

    # Requirement: library(tidyverse)
    
    # Create simple barplot with the ggplot() function
    ggplot(cl.order, aes(x = ORDER, fill = SUBORDTYPE)) +
      geom_bar(position = "dodge")

    # Requirement: library(tidyverse)
    
    # Fully customised ggplot2 object
    ggplot(cl.order, aes(x = ORDER, fill = SUBORDTYPE)) +
      geom_bar(position = "dodge") +
      labs(
        title = "Clause order by subordinate clause type",
        x = "Clause order",
        y = "Frequency",
        fill = "Type of subordinate clause"
      ) +
      theme_bw()

    In very much the same way as with the raw counts:

    # Create simple barplot with a percentage table as input
    barplot(pct1, 
            beside = TRUE,  # Make bars side-by-side
            legend = TRUE)  # Add a legend

    Here, a few tweaks are necessary. Because the ggplot() function prefers to works with data frames rather than cross-tables, we’ll have to coerce it into one first:

    # Convert a percentage table to a data frame
    # My recommendation: Use the pct2 object, which was generated using xtabs() because it will keep the variable names
    pct2_df <- as.data.frame(pct2)
    
    print(pct2_df)
      ORDER SUBORDTYPE      Freq
    1 mc-sc       caus 45.657568
    2 sc-mc       caus  3.722084
    3 mc-sc       temp 22.580645
    4 sc-mc       temp 28.039702

    Now we can plot the percentages with geom_col(). This geom (= ‘geometric object’) allows us to manually specify what should be mapped onto the y-axis:

    # Requirement: library(tidyverse)
    
    # Create barplot with user-defined y-axis, which requires geom_col() rather than geom_bar()
    ggplot(pct2_df, aes(x = ORDER, y = Freq, fill = SUBORDTYPE)) +
      geom_col(position = "dodge") +
      labs(y = "Frequency (in %)")

    # Requirement: library(tidyverse)
    
    # Bubble plot
    ggplot(pct2_df, aes(x = ORDER, y = SUBORDTYPE, size = Freq)) +
      geom_point(color = "skyblue", alpha = 0.7) +
      scale_size_continuous(range = c(5, 20)) +  # Adjust bubble size range
      labs(title = "Bubble Plot of ORDER by SUBORDTYPE",
           x = "ORDER",
           y = "SUBORDTYPE",
           size = "Percentage") +
      theme_minimal()

    # Make sure to install this library prior to running the code below 
    library(ggalluvial)
    
    ggplot(pct2_df,
           aes(axis1 = ORDER, axis2 = SUBORDTYPE, y = Freq)) +
      geom_alluvium(aes(fill = ORDER)) +
      geom_stratum(fill = "gray") +
      geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
      labs(title = "Alluvial Plot of ORDER by SUBORDTYPE",
           x = "Categories", y = "Percentage") +
      theme_minimal()

    15.4 Exporting tables to MS Word

    The crosstable and flextable packages make it very easy to export elegant tables to MS Word.

    This is perhaps the most elegant solution. Generate a crosstable() object by supplying at the very least …

    • the original dataset (data = cl.order),
    • the dependent variable (cols = ORDER), and
    • the independent variable (by = SUBORDTYPE).

    You can further specify …

    • whether to include column totals, row totals or both (here: total = both),
    • the rounding scheme (here: percent_digits = 2),
    # Required libraries:
    # library(crosstable)
    # library(flextable)
    
    # Create the cross table
    output1 <- crosstable(data = cl.order,
                          cols = ORDER, 
                          by = SUBORDTYPE, 
                          total = "both",
                          percent_digits = 2)
    
    # Generate file
    as_flextable(output1)

    label

    variable

    SUBORDTYPE

    Total

    caus

    temp

    ORDER

    mc-sc

    184 (66.91%)

    91 (33.09%)

    275 (68.24%)

    sc-mc

    15 (11.72%)

    113 (88.28%)

    128 (31.76%)

    Total

    199 (49.38%)

    204 (50.62%)

    403 (100.00%)

    It also possible to use as_flextable() without pre-processing the data with crosstable(); supplying a table preferably created with xtabs() is sufficient. Without any doubt, the output is extremely informative, yet it is everything but reader-friendly.

    For this reason, I recommend relying on the less overwhelming crosstable() option above if a plain and easy result is desired. However, readers who would like to leverage the full capabilities of the flextable() package and familiarise themselves with the abundant options for customisation, can find the detailed documentation here.

    # Requires the following library:
    # library(flextable)
    
    # Create a table
    tab1 <- xtabs(~ ORDER + SUBORDTYPE, cl.order)
    
    # Directly convert a table to a flextable with as_flextable()
    output_1 <- as_flextable(tab1)
    
    # Print output
    print(output_1)

    15.5 Workflow exercises

    Solutions

    You can find the solutions to the exercises here.

    Exercise 15.1 Download the dataset objects.xlsx from https://osf.io/j2mnx. Load it into R and store it in a variable objects. Make sure to load all the necessary libraries.

    Exercise 15.2 Many rows from are irrelevant for the analysis. Exclude all rows from that are marked as containing passive clauses (see Clause_voice column). Store this reduced subset in a new variable objects_filtered.

    Exercise 15.3 Investigate the relationship between

    • Object_realisation and Register as well as

    • Object_realisation and Lemma

    by computing frequency tables and percentages based on objects_filtered. Plot your results and export your tables and figures to a Microsoft Word document.

    Exercise 15.4 Which verb has the highest \(\frac{\text{null}}{\text{null} + \text{overt}}\) ratio and which one has the lowest?