4.3 Categorical data

Author

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Preparation

Script

You can find the full R script associated with this unit here.

Please download the file Paquot_Larsson_2020_data.xlsx (Paquot and Larsson 2020)¹ and store it in the same folder as your currently active R-script. Then run the code lines below:

¹ The original supplementary materials can be downloaded from the publisher’s website [Last accessed April 28, 2024].

# Libraries
library(readxl)
library(tidyverse)
# For publication-ready tables
library(crosstable)
library(flextable)

# Load data from working directory
cl.order <- read_xlsx("../datasets/Paquot_Larsson_2020_data.xlsx")

# Check the structure of the data frame
head(cl.order)

Describing categorical data

A categorical variable is made up of two or more discrete values. An intuitive way to describe categorical data would be to count how often each category occurs in the sample. These counts are then typically summarised in frequency tables and accompanied by suitable graphs (e.g., barplots).

Frequency tables (one variable)

Assume we are interested in how often each clause ordering type ( "mc-sc" vs. "sc-mc") is attested in our data. In R, we can obtain their frequencies by inspecting the ORDER column of the cl.order dataset. Since manual counting isn’t really an option, we will make use of the convenient functions table() and xtabs().

The workhorse: table()

This function requires a character vector. We use the notation cl.order$ORDER to subset the cl.order data frame according to the column ORDER (cf. data frames). We store the results in the variable order_freq1 (you may choose a different name if you like) and display the output by applying to it the print() function.

# Count occurrences of ordering types ("mc-sc" and "sc-mc") in the data frame
order_freq1 <- table(cl.order$ORDER) 

# Print table
print(order_freq1)


mc-sc sc-mc 
  275   128

More detailed: xtabs()

Alternatively, you could use xtabs() to achieve the same result. The syntax is a little different, but it returns a slightly more more detailed table with explicit variable label(s).

# Count occurrences of ordering types ("mc-sc" and "sc-mc")
order_freq2 <- xtabs(~ ORDER, cl.order)

# Print table
print(order_freq2)

ORDER
mc-sc sc-mc 
  275   128

Frequency tables ($\geq$ 2 variables)

If we are interested in the relationship between multiple categorical variables, we can cross-tabulate the frequencies of their categories. For example, what is the distribution of clause order depending on the type of subordinate clause? The output is also referred to as a contingency table.

The table() way

# Get frequencies of ordering tpyes ("mc-sc" vs. "sc-mc") depending on the type of subordinate clause ("caus" vs. "temp")
order_counts1 <- table(cl.order$ORDER, cl.order$SUBORDTYPE)

# Print contingency table
print(order_counts1)

       
        caus temp
  mc-sc  184   91
  sc-mc   15  113

The xtabs() way

# Cross-tabulate ORDER and SUBORDTYPE
order_counts2 <- xtabs(~ ORDER + SUBORDTYPE, cl.order)

# Print cross-table
print(order_counts2)

       SUBORDTYPE
ORDER   caus temp
  mc-sc  184   91
  sc-mc   15  113

Percentage tables

There are several ways to compute percentages for your cross-tables, but by far the simplest is via the prop.table() function. As it only provides proportions, you can multiply the output by 100 to obtain real percentages.

Get percentages for a table() object

# Convert to % using the prop.table() function
pct1 <- prop.table(order_counts1) * 100

# Print percentages
print(pct1)

       
             caus      temp
  mc-sc 45.657568 22.580645
  sc-mc  3.722084 28.039702

Get percentages for an xtabs() object

# Convert to % using the prop.table() function
pct2 <- prop.table(order_counts2) * 100

# Print percentages
print(pct2)

       SUBORDTYPE
ORDER        caus      temp
  mc-sc 45.657568 22.580645
  sc-mc  3.722084 28.039702

Notice how pct2 still carries the variable labels SUBORDTYPE and ORDER, which is very convenient.

Plotting categorical data

This section demonstrates both the in-built plotting functions of R (‘Base R’) as well as the more modern versions provided by the tidyverse package.

Mosaicplots (raw counts)

A straightforward way to visualise a contingency table is the mosaicplot:

# Works with raw counts and percentages
# Using the output of xtabs() as input
mosaicplot(order_counts2, color = TRUE)

Barplots (raw counts)

The workhorse of categorical data analysis is the barplot. Base R functions usually require a table object as input, whereas ggplot2 can operate on the raw dataset.

Base R barplot with barplot(); requires the counts as computed by tables() or xtabs()

# Generate cross-table
order_freq1 <- table(cl.order$ORDER)

# Create barplot
barplot(order_freq1)

Barplot with geom_bar() using the raw input data

# Requirement: library(tidyverse)

# Raw input data
head(cl.order)

# A tibble: 6 × 8
   CASE ORDER SUBORDTYPE LEN_MC LEN_SC LENGTH_DIFF CONJ     MORETHAN2CL
  <dbl> <chr> <chr>       <dbl>  <dbl>       <dbl> <chr>    <chr>      
1  4777 sc-mc temp            4     10          -6 als/when no         
2  1698 mc-sc temp            7      6           1 als/when no         
3   953 sc-mc temp           12      7           5 als/when yes        
4  1681 mc-sc temp            6     15          -9 als/when no         
5  4055 sc-mc temp            9      5           4 als/when yes        
6   967 sc-mc temp            9      5           4 als/when yes

# Create barplot
ggplot(cl.order, aes(x = ORDER)) +
  geom_bar()

Two variables

Bivariate barplots can be obtained by either supplying a contingency table (Base R) or by mapping the second variable onto the fill argument using the raw data.

# Generate cross-table with two variables
order_counts2 <- xtabs(~ ORDER + SUBORDTYPE, cl.order)

# Create simple barplot
barplot(order_counts2, 
        beside = TRUE,  # Make bars side-by-side
        legend = TRUE)  # Add a legend

# Generate cross-table with two variables
order_counts2 <- xtabs(~ ORDER + SUBORDTYPE, cl.order)

# Customise barplot with axis labels, colours and legend
barplot(order_counts2, 
        beside = TRUE,  # Make bars dodged (i.e., side by side)
        main = "Distribution of ORDER by SUBORDTYPE (Base R)", 
        xlab = "ORDER", 
        ylab = "Frequency", 
        col = c("lightblue", "lightgreen"), # Customize colors
        legend = TRUE,  # Add a legend
        args.legend = list(title = "SUBORDTYPE", x = "topright"))

# Requirement: library(tidyverse)

# Create simple barplot with the ggplot() function
ggplot(cl.order, aes(x = ORDER, fill = SUBORDTYPE)) +
  geom_bar(position = "dodge")

# Requirement: library(tidyverse)

# Fully customised ggplot2 object
ggplot(cl.order, aes(x = ORDER, fill = SUBORDTYPE)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Clause order by subordinate clause type",
    x = "Clause order",
    y = "Frequency",
    fill = "Type of subordinate clause"
  ) +
  theme_bw()

Barplots (percentages)

Base R
ggplot2

In very much the same way as with the raw counts:

# Create simple barplot with a percentage table as input
barplot(pct1, 
        beside = TRUE,  # Make bars side-by-side
        legend = TRUE)  # Add a legend

Here, a few tweaks are necessary. Because the ggplot() function prefers to works with data frames rather than cross-tables, we’ll have to coerce it into one first:

# Convert a percentage table to a data frame
# My recommendation: Use the pct2 object, which was generated using xtabs() because it will keep the variable names
pct2_df <- as.data.frame(pct2)

print(pct2_df)

  ORDER SUBORDTYPE      Freq
1 mc-sc       caus 45.657568
2 sc-mc       caus  3.722084
3 mc-sc       temp 22.580645
4 sc-mc       temp 28.039702

Now we can plot the percentages with geom_col(). This geom (= ‘geometric object’) allows us to manually specify what should be mapped onto the y-axis:

# Requirement: library(tidyverse)

# Create barplot with user-defined y-axis, which requires geom_col() rather than geom_bar()
ggplot(pct2_df, aes(x = ORDER, y = Freq, fill = SUBORDTYPE)) +
  geom_col(position = "dodge") +
  labs(y = "Frequency (in %)")

Bubble plot (percentages)

# Requirement: library(tidyverse)

# Bubble plot
ggplot(pct2_df, aes(x = ORDER, y = SUBORDTYPE, size = Freq)) +
  geom_point(color = "skyblue", alpha = 0.7) +
  scale_size_continuous(range = c(5, 20)) +  # Adjust bubble size range
  labs(title = "Bubble Plot of ORDER by SUBORDTYPE",
       x = "ORDER",
       y = "SUBORDTYPE",
       size = "Percentage") +
  theme_minimal()

Alluvial plot (percentages)

# Make sure to install this library prior to running the code below 
library(ggalluvial)

ggplot(pct2_df,
       aes(axis1 = ORDER, axis2 = SUBORDTYPE, y = Freq)) +
  geom_alluvium(aes(fill = ORDER)) +
  geom_stratum(fill = "gray") +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  labs(title = "Alluvial Plot of ORDER by SUBORDTYPE",
       x = "Categories", y = "Percentage") +
  theme_minimal()

Exporting tables to MS Word

The crosstable and flextable packages make it very easy to export elegant tables to MS Word.

Clean and to the point: crosstable()

This is perhaps the most elegant solution. Generate a crosstable() object by supplying at the very least …

the original dataset (data = cl.order),
the dependent variable (cols = ORDER), and
the independent variable (by = SUBORDTYPE).

You can further specify …

whether to include column totals, row totals or both (here: total = both),
the rounding scheme (here: percent_digits = 2),
…

# Required libraries:
# library(crosstable)
# library(flextable)

# Create the cross table
output1 <- crosstable(data = cl.order,
                      cols = ORDER, 
                      by = SUBORDTYPE, 
                      total = "both",
                      percent_digits = 2)

# Generate file
as_flextable(output1)

label	variable	SUBORDTYPE		Total
label	variable	caus	temp	Total
ORDER	mc-sc	184 (66.91%)	91 (33.09%)	275 (68.24%)
	sc-mc	15 (11.72%)	113 (88.28%)	128 (31.76%)
	Total	199 (49.38%)	204 (50.62%)	403 (100.00%)

How much info do you need? Yes.

It also possible to use as_flextable() without pre-processing the data with crosstable(); supplying a table preferably created with xtabs() is sufficient. Without any doubt, the output is extremely informative, yet it is everything but reader-friendly.

For this reason, I recommend relying on the less overwhelming crosstable() option above if a plain and easy result is desired. However, readers who would like to leverage the full capabilities of the flextable() package and familiarise themselves with the abundant options for customisation, can find the detailed documentation here.

# Requires the following library:
# library(flextable)

# Create a table
tab1 <- xtabs(~ ORDER + SUBORDTYPE, cl.order)

# Directly convert a table to a flextable with as_flextable()
output_1 <- as_flextable(tab1)

# Print output
print(output_1)

Workflow exercises

Solutions

You can find the solutions to the exercises here.

Exercise 1 Download the dataset objects.xlsx from https://osf.io/j2mnx. Load it into R and store it in a variable objects. Make sure to load all the necessary libraries.

Exercise 2 Many rows from are irrelevant for the analysis. Exclude all rows from that are marked as containing passive clauses (see Clause_voice column). Store this reduced subset in a new variable objects_filtered.

Exercise 3 Investigate the relationship between

Object_realisation and Register as well as
Object_realisation and Lemma

by computing frequency tables and percentages based on objects_filtered. Plot your results and export your tables and figures to a Microsoft Word document.

Exercise 4 Which verb has the highest $\frac{\text{null}}{\text{null} + \text{overt}}$ ratio and which one has the lowest?

References

Paquot, Magali, and Tove Larsson. 2020. “Descriptive Statistics and Visualization with r.” In A Practical Handbook of Corpus Linguistics, edited by Magali Paquot and Stefan Thomas Gries, 375–99. Cham: Springer.

Preparation

Describing categorical data

Frequency tables (one variable)

Frequency tables (\(\geq\) 2 variables)

Percentage tables

Plotting categorical data

One variable

Two variables

Exporting tables to MS Word

Workflow exercises

References