You can find the full R script associated with this unit here.
Please download the file Paquot_Larsson_2020_data.xlsx(Paquot and Larsson 2020)1 and store it in the same folder as your currently active R-script. Then run the code lines below:
1 The original supplementary materials can be downloaded from the publisher’s website [Last accessed April 28, 2024].
# Librarieslibrary(readxl)library(tidyverse)# For publication-ready tableslibrary(crosstable)library(flextable)# Load data from working directorycl.order <-read_xlsx("Paquot_Larsson_2020_data.xlsx")# Check the structure of the data framehead(cl.order)
A categorical variable is made up of two or more discrete values. An intuitive way to describe categorical data would be to count how often each category occurs in the sample. These counts are then typically summarised in frequency tables and accompanied by suitable graphs (e.g., barplots).
15.2.1 Frequency tables (one variable)
Assume we are interested in how often each clause ordering type ( "mc-sc" vs. "sc-mc") is attested in our data. In R, we can obtain their frequencies by inspecting the ORDER column of the cl.order dataset. Since manual counting isn’t really an option, we will make use of the convenient functions table() and xtabs().
The workhorse: table()
This function requires a character vector. We use the notation cl.order$ORDER to subset the cl.order data frame according to the column ORDER (cf. data frames). We store the results in the variable order_freq1 (you may choose a different name if you like) and display the output by applying to it the print() function.
# Count occurrences of ordering types ("mc-sc" and "sc-mc") in the data frameorder_freq1 <-table(cl.order$ORDER) # Print tableprint(order_freq1)
mc-sc sc-mc
275 128
More detailed: xtabs()
Alternatively, you could use xtabs() to achieve the same result. The syntax is a little different, but it returns a slightly more more detailed table with explicit variable label(s).
# Count occurrences of ordering types ("mc-sc" and "sc-mc")order_freq2 <-xtabs(~ ORDER, cl.order)# Print tableprint(order_freq2)
ORDER
mc-sc sc-mc
275 128
15.2.2 Frequency tables (\(\geq\) 2 variables)
If we are interested in the relationship between multiple categorical variables, we can cross-tabulate the frequencies of their categories. For example, what is the distribution of clause order depending on the type of subordinate clause? The output is also referred to as a contingency table.
The table() way
# Get frequencies of ordering tpyes ("mc-sc" vs. "sc-mc") depending on the type of subordinate clause ("caus" vs. "temp")order_counts1 <-table(cl.order$ORDER, cl.order$SUBORDTYPE)# Print contingency tableprint(order_counts1)
caus temp
mc-sc 184 91
sc-mc 15 113
The xtabs() way
# Cross-tabulate ORDER and SUBORDTYPEorder_counts2 <-xtabs(~ ORDER + SUBORDTYPE, cl.order)# Print cross-tableprint(order_counts2)
SUBORDTYPE
ORDER caus temp
mc-sc 184 91
sc-mc 15 113
15.2.3 Percentage tables
There are several ways to compute percentages for your cross-tables, but by far the simplest is via the prop.table() function. As it only provides proportions, you can multiply the output by 100 to obtain real percentages.
Get percentages for a table() object
# Convert to % using the prop.table() functionpct1 <-prop.table(order_counts1) *100# Print percentagesprint(pct1)
# Convert to % using the prop.table() functionpct2 <-prop.table(order_counts2) *100# Print percentagesprint(pct2)
SUBORDTYPE
ORDER caus temp
mc-sc 45.657568 22.580645
sc-mc 3.722084 28.039702
Notice how pct2 still carries the variable labels SUBORDTYPE and ORDER, which is very convenient.
15.3 Plotting categorical data
This section demonstrates both the in-built plotting functions of R (‘Base R’) as well as the more modern versions provided by the tidyverse package.
Mosaicplots (raw counts)
A straightforward way to visualise a contingency table is the mosaicplot:
# Works with raw counts and percentages# Using the output of xtabs() as inputmosaicplot(order_counts2, color =TRUE)
Barplots (raw counts)
The workhorse of categorical data analysis is the barplot. Base R functions usually require a table object as input, whereas ggplot2 can operate on the raw dataset.
Bivariate barplots can be obtained by either supplying a contingency table (Base R) or by mapping the second variable onto the fill argument using the raw data.
# Generate cross-table with two variablesorder_counts2 <-xtabs(~ ORDER + SUBORDTYPE, cl.order)# Create simple barplotbarplot(order_counts2, beside =TRUE, # Make bars side-by-sidelegend =TRUE) # Add a legend
# Generate cross-table with two variablesorder_counts2 <-xtabs(~ ORDER + SUBORDTYPE, cl.order)# Customise barplot with axis labels, colours and legendbarplot(order_counts2, beside =TRUE, # Make bars dodged (i.e., side by side)main ="Distribution of ORDER by SUBORDTYPE (Base R)", xlab ="ORDER", ylab ="Frequency", col =c("lightblue", "lightgreen"), # Customize colorslegend =TRUE, # Add a legendargs.legend =list(title ="SUBORDTYPE", x ="topright"))
# Requirement: library(tidyverse)# Create simple barplot with the ggplot() functionggplot(cl.order, aes(x = ORDER, fill = SUBORDTYPE)) +geom_bar(position ="dodge")
# Requirement: library(tidyverse)# Fully customised ggplot2 objectggplot(cl.order, aes(x = ORDER, fill = SUBORDTYPE)) +geom_bar(position ="dodge") +labs(title ="Clause order by subordinate clause type",x ="Clause order",y ="Frequency",fill ="Type of subordinate clause" ) +theme_bw()
# Create simple barplot with a percentage table as inputbarplot(pct1, beside =TRUE, # Make bars side-by-sidelegend =TRUE) # Add a legend
Here, a few tweaks are necessary. Because the ggplot() function prefers to works with data frames rather than cross-tables, we’ll have to coerce it into one first:
# Convert a percentage table to a data frame# My recommendation: Use the pct2 object, which was generated using xtabs() because it will keep the variable namespct2_df <-as.data.frame(pct2)print(pct2_df)
Now we can plot the percentages with geom_col(). This geom (= ‘geometric object’) allows us to manually specify what should be mapped onto the y-axis:
# Requirement: library(tidyverse)# Create barplot with user-defined y-axis, which requires geom_col() rather than geom_bar()ggplot(pct2_df, aes(x = ORDER, y = Freq, fill = SUBORDTYPE)) +geom_col(position ="dodge") +labs(y ="Frequency (in %)")
Bubble plot (percentages)
# Requirement: library(tidyverse)# Bubble plotggplot(pct2_df, aes(x = ORDER, y = SUBORDTYPE, size = Freq)) +geom_point(color ="skyblue", alpha =0.7) +scale_size_continuous(range =c(5, 20)) +# Adjust bubble size rangelabs(title ="Bubble Plot of ORDER by SUBORDTYPE",x ="ORDER",y ="SUBORDTYPE",size ="Percentage") +theme_minimal()
Alluvial plot (percentages)
# Make sure to install this library prior to running the code below library(ggalluvial)ggplot(pct2_df,aes(axis1 = ORDER, axis2 = SUBORDTYPE, y = Freq)) +geom_alluvium(aes(fill = ORDER)) +geom_stratum(fill ="gray") +geom_text(stat ="stratum", aes(label =after_stat(stratum))) +labs(title ="Alluvial Plot of ORDER by SUBORDTYPE",x ="Categories", y ="Percentage") +theme_minimal()
15.4 Exporting tables to MS Word
The crosstable and flextable packages make it very easy to export elegant tables to MS Word.
Clean and to the point: crosstable()
This is perhaps the most elegant solution. Generate a crosstable() object by supplying at the very least …
the original dataset (data = cl.order),
the dependent variable (cols = ORDER), and
the independent variable (by = SUBORDTYPE).
You can further specify …
whether to include column totals, row totals or both (here: total = both),
the rounding scheme (here: percent_digits = 2),
…
# Required libraries:# library(crosstable)# library(flextable)# Create the cross tableoutput1 <-crosstable(data = cl.order,cols = ORDER, by = SUBORDTYPE, total ="both",percent_digits =2)# Generate fileas_flextable(output1)
label
variable
SUBORDTYPE
Total
caus
temp
ORDER
mc-sc
184 (66.91%)
91 (33.09%)
275 (68.24%)
sc-mc
15 (11.72%)
113 (88.28%)
128 (31.76%)
Total
199 (49.38%)
204 (50.62%)
403 (100.00%)
How much info do you need? Yes.
It also possible to use as_flextable() without pre-processing the data with crosstable(); supplying a table preferably created with xtabs() is sufficient. Without any doubt, the output is extremely informative, yet it is everything but reader-friendly.
For this reason, I recommend relying on the less overwhelming crosstable() option above if a plain and easy result is desired. However, readers who would like to leverage the full capabilities of the flextable() package and familiarise themselves with the abundant options for customisation, can find the detailed documentation here.
# Requires the following library:# library(flextable)# Create a tabletab1 <-xtabs(~ ORDER + SUBORDTYPE, cl.order)# Directly convert a table to a flextable with as_flextable()output_1 <-as_flextable(tab1)# Print outputprint(output_1)
Exercise 15.1 Download the dataset objects.xlsx from https://osf.io/j2mnx. Load it into R and store it in a variable objects. Make sure to load all the necessary libraries.
Exercise 15.2 Many rows from are irrelevant for the analysis. Exclude all rows from that are marked as containing passive clauses (see Clause_voice column). Store this reduced subset in a new variable objects_filtered.
Exercise 15.3 Investigate the relationship between
Object_realisation and Register as well as
Object_realisation and Lemma
by computing frequency tables and percentages based on objects_filtered. Plot your results and export your tables and figures to a Microsoft Word document.
Exercise 15.4 Which verb has the highest \(\frac{\text{null}}{\text{null} + \text{overt}}\) ratio and which one has the lowest?
Paquot, Magali, and Tove Larsson. 2020. “Descriptive Statistics and Visualization with r.” In A Practical Handbook of Corpus Linguistics, edited by Magali Paquot and Stefan Thomas Gries, 375–99. Cham: Springer.