# Libraries
library(readxl)
library(tidyverse)
# For publication-ready tables
library(crosstable)
library(flextable)
# Load data from working directory
<- read_xlsx("../datasets/Paquot_Larsson_2020_data.xlsx")
cl.order
# Check the structure of the data frame
head(cl.order)
4.3 Categorical data
Preparation
You can find the full R script associated with this unit here.
Please download the file Paquot_Larsson_2020_data.xlsx
(Paquot and Larsson 2020)1 and store it in the same folder as your currently active R-script. Then run the code lines below:
1 The original supplementary materials can be downloaded from the publisher’s website [Last accessed April 28, 2024].
Describing categorical data
A categorical variable is made up of two or more discrete values. An intuitive way to describe categorical data would be to count how often each category occurs in the sample. These counts are then typically summarised in frequency tables and accompanied by suitable graphs (e.g., barplots).
Frequency tables (one variable)
Assume we are interested in how often each clause ordering type ( "mc-sc"
vs. "sc-mc"
) is attested in our data. In R, we can obtain their frequencies by inspecting the ORDER
column of the cl.order
dataset. Since manual counting isn’t really an option, we will make use of the convenient functions table()
and xtabs()
.
table()
This function requires a character vector. We use the notation cl.order$ORDER
to subset the cl.order
data frame according to the column ORDER
(cf. data frames). We store the results in the variable order_freq1
(you may choose a different name if you like) and display the output by applying to it the print()
function.
# Count occurrences of ordering types ("mc-sc" and "sc-mc") in the data frame
<- table(cl.order$ORDER)
order_freq1
# Print table
print(order_freq1)
mc-sc sc-mc
275 128
xtabs()
Alternatively, you could use xtabs()
to achieve the same result. The syntax is a little different, but it returns a slightly more more detailed table with explicit variable label(s).
# Count occurrences of ordering types ("mc-sc" and "sc-mc")
<- xtabs(~ ORDER, cl.order)
order_freq2
# Print table
print(order_freq2)
ORDER
mc-sc sc-mc
275 128
Frequency tables (\(\geq\) 2 variables)
If we are interested in the relationship between multiple categorical variables, we can cross-tabulate the frequencies of their categories. For example, what is the distribution of clause order depending on the type of subordinate clause? The output is also referred to as a contingency table.
table()
way
# Get frequencies of ordering tpyes ("mc-sc" vs. "sc-mc") depending on the type of subordinate clause ("caus" vs. "temp")
<- table(cl.order$ORDER, cl.order$SUBORDTYPE)
order_counts1
# Print contingency table
print(order_counts1)
caus temp
mc-sc 184 91
sc-mc 15 113
xtabs()
way
# Cross-tabulate ORDER and SUBORDTYPE
<- xtabs(~ ORDER + SUBORDTYPE, cl.order)
order_counts2
# Print cross-table
print(order_counts2)
SUBORDTYPE
ORDER caus temp
mc-sc 184 91
sc-mc 15 113
Percentage tables
There are several ways to compute percentages for your cross-tables, but by far the simplest is via the prop.table()
function. As it only provides proportions, you can multiply the output by 100 to obtain real percentages.
table()
object
# Convert to % using the prop.table() function
<- prop.table(order_counts1) * 100
pct1
# Print percentages
print(pct1)
caus temp
mc-sc 45.657568 22.580645
sc-mc 3.722084 28.039702
xtabs()
object
# Convert to % using the prop.table() function
<- prop.table(order_counts2) * 100
pct2
# Print percentages
print(pct2)
SUBORDTYPE
ORDER caus temp
mc-sc 45.657568 22.580645
sc-mc 3.722084 28.039702
Notice how pct2
still carries the variable labels SUBORDTYPE
and ORDER
, which is very convenient.
Plotting categorical data
This section demonstrates both the in-built plotting functions of R (‘Base R’) as well as the more modern versions provided by the tidyverse
package.
A straightforward way to visualise a contingency table is the mosaicplot:
# Works with raw counts and percentages
# Using the output of xtabs() as input
mosaicplot(order_counts2, color = TRUE)
The workhorse of categorical data analysis is the barplot. Base R functions usually require a table
object as input, whereas ggplot2
can operate on the raw dataset.
One variable
- Base R barplot with
barplot()
; requires the counts as computed bytables()
orxtabs()
# Generate cross-table
<- table(cl.order$ORDER)
order_freq1
# Create barplot
barplot(order_freq1)
- Barplot with
geom_bar()
using the raw input data
# Requirement: library(tidyverse)
# Raw input data
head(cl.order)
# A tibble: 6 × 8
CASE ORDER SUBORDTYPE LEN_MC LEN_SC LENGTH_DIFF CONJ MORETHAN2CL
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 4777 sc-mc temp 4 10 -6 als/when no
2 1698 mc-sc temp 7 6 1 als/when no
3 953 sc-mc temp 12 7 5 als/when yes
4 1681 mc-sc temp 6 15 -9 als/when no
5 4055 sc-mc temp 9 5 4 als/when yes
6 967 sc-mc temp 9 5 4 als/when yes
# Create barplot
ggplot(cl.order, aes(x = ORDER)) +
geom_bar()
Two variables
Bivariate barplots can be obtained by either supplying a contingency table (Base R) or by mapping the second variable onto the fill
argument using the raw data.
# Generate cross-table with two variables
<- xtabs(~ ORDER + SUBORDTYPE, cl.order)
order_counts2
# Create simple barplot
barplot(order_counts2,
beside = TRUE, # Make bars side-by-side
legend = TRUE) # Add a legend
# Generate cross-table with two variables
<- xtabs(~ ORDER + SUBORDTYPE, cl.order)
order_counts2
# Customise barplot with axis labels, colours and legend
barplot(order_counts2,
beside = TRUE, # Make bars dodged (i.e., side by side)
main = "Distribution of ORDER by SUBORDTYPE (Base R)",
xlab = "ORDER",
ylab = "Frequency",
col = c("lightblue", "lightgreen"), # Customize colors
legend = TRUE, # Add a legend
args.legend = list(title = "SUBORDTYPE", x = "topright"))
# Requirement: library(tidyverse)
# Create simple barplot with the ggplot() function
ggplot(cl.order, aes(x = ORDER, fill = SUBORDTYPE)) +
geom_bar(position = "dodge")
# Requirement: library(tidyverse)
# Fully customised ggplot2 object
ggplot(cl.order, aes(x = ORDER, fill = SUBORDTYPE)) +
geom_bar(position = "dodge") +
labs(
title = "Clause order by subordinate clause type",
x = "Clause order",
y = "Frequency",
fill = "Type of subordinate clause"
+
) theme_bw()
In very much the same way as with the raw counts:
# Create simple barplot with a percentage table as input
barplot(pct1,
beside = TRUE, # Make bars side-by-side
legend = TRUE) # Add a legend
Here, a few tweaks are necessary. Because the ggplot()
function prefers to works with data frames rather than cross-tables, we’ll have to coerce it into one first:
# Convert a percentage table to a data frame
# My recommendation: Use the pct2 object, which was generated using xtabs() because it will keep the variable names
<- as.data.frame(pct2)
pct2_df
print(pct2_df)
ORDER SUBORDTYPE Freq
1 mc-sc caus 45.657568
2 sc-mc caus 3.722084
3 mc-sc temp 22.580645
4 sc-mc temp 28.039702
Now we can plot the percentages with geom_col()
. This geom (= ‘geometric object’) allows us to manually specify what should be mapped onto the y-axis:
# Requirement: library(tidyverse)
# Create barplot with user-defined y-axis, which requires geom_col() rather than geom_bar()
ggplot(pct2_df, aes(x = ORDER, y = Freq, fill = SUBORDTYPE)) +
geom_col(position = "dodge") +
labs(y = "Frequency (in %)")
# Requirement: library(tidyverse)
# Bubble plot
ggplot(pct2_df, aes(x = ORDER, y = SUBORDTYPE, size = Freq)) +
geom_point(color = "skyblue", alpha = 0.7) +
scale_size_continuous(range = c(5, 20)) + # Adjust bubble size range
labs(title = "Bubble Plot of ORDER by SUBORDTYPE",
x = "ORDER",
y = "SUBORDTYPE",
size = "Percentage") +
theme_minimal()
# Make sure to install this library prior to running the code below
library(ggalluvial)
ggplot(pct2_df,
aes(axis1 = ORDER, axis2 = SUBORDTYPE, y = Freq)) +
geom_alluvium(aes(fill = ORDER)) +
geom_stratum(fill = "gray") +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
labs(title = "Alluvial Plot of ORDER by SUBORDTYPE",
x = "Categories", y = "Percentage") +
theme_minimal()
Exporting tables to MS Word
The crosstable
and flextable
packages make it very easy to export elegant tables to MS Word.
crosstable()
This is perhaps the most elegant solution. Generate a crosstable()
object by supplying at the very least …
- the original dataset (
data = cl.order
), - the dependent variable (
cols = ORDER
), and - the independent variable (
by = SUBORDTYPE
).
You can further specify …
- whether to include column totals, row totals or both (here:
total = both
), - the rounding scheme (here:
percent_digits = 2
), - …
# Required libraries:
# library(crosstable)
# library(flextable)
# Create the cross table
<- crosstable(data = cl.order,
output1 cols = ORDER,
by = SUBORDTYPE,
total = "both",
percent_digits = 2)
# Generate file
as_flextable(output1)
label | variable | SUBORDTYPE | Total | |
---|---|---|---|---|
caus | temp | |||
ORDER | mc-sc | 184 (66.91%) | 91 (33.09%) | 275 (68.24%) |
sc-mc | 15 (11.72%) | 113 (88.28%) | 128 (31.76%) | |
Total | 199 (49.38%) | 204 (50.62%) | 403 (100.00%) |
It also possible to use as_flextable()
without pre-processing the data with crosstable()
; supplying a table preferably created with xtabs()
is sufficient. Without any doubt, the output is extremely informative, yet it is everything but reader-friendly.
For this reason, I recommend relying on the less overwhelming crosstable()
option above if a plain and easy result is desired. However, readers who would like to leverage the full capabilities of the flextable()
package and familiarise themselves with the abundant options for customisation, can find the detailed documentation here.
# Requires the following library:
# library(flextable)
# Create a table
<- xtabs(~ ORDER + SUBORDTYPE, cl.order)
tab1
# Directly convert a table to a flextable with as_flextable()
<- as_flextable(tab1)
output_1
# Print output
print(output_1)
Workflow exercises
You can find the solutions to the exercises here.
Exercise 1 Download the dataset objects.xlsx
from https://osf.io/j2mnx. Load it into R and store it in a variable objects
. Make sure to load all the necessary libraries.
Exercise 2 Many rows from are irrelevant for the analysis. Exclude all rows from that are marked as containing passive clauses (see Clause_voice
column). Store this reduced subset in a new variable objects_filtered
.
Exercise 3 Investigate the relationship between
Object_realisation
andRegister
as well asObject_realisation
andLemma
by computing frequency tables and percentages based on objects_filtered
. Plot your results and export your tables and figures to a Microsoft Word document.
Exercise 4 Which verb has the highest \(\frac{\text{null}}{\text{null} + \text{overt}}\) ratio and which one has the lowest?