# Load libraries
library(readxl)
library(tidyverse)
library(confintr) # for effect size calculation
# Load data
<- read_xlsx("Paquot_Larsson_2020_data.xlsx") cl.order
20 Chi-squared test
The chi-squared (\(\chi^2\)) test helps determine if there is a statistically significant association between two categorical variables. It compares the observed frequencies of categories with those expected under the null hypothesis. The \(\chi^2\) (chi-squared) score quantifies the difference between observed and expected frequencies for every cell in a contingency table. The greater the difference between observed and expected, the higher the \(\chi^2\) score and the lower the \(p\)-value, given the degrees of freedom. It is recommended to compute effect size measures and inspect the residuals to assess the nature of the association.
20.1 Suggested reading
For linguists:
Gries (2021): Chapter 4.1.2.1
General:
Baguley (2012): Chapter 4
Agresti and Kateri (2022): Chapter 5
20.2 Preparation
You can find the full R script associated with this unit here.
20.3 The Pearson \(\chi^2\)-test of independence
The first step of any significance test involves setting up the null and alternative hypothesis. In this unit, we will perform a sample analysis on the relationship between clause ORDER
and the type of subordinate clause (SUBORDTYPE
). Specifically, we will focus on the independence of these two discrete variables (i.e, the presence or absence of correlation).
\(H_0:\) The variables
ORDER
andSUBORDTYPE
are independent.\(H_1:\) The variables
ORDER
andSUBORDTYPE
are not independent.
The core idea is “that the probability distribution of the response variable is the same for each group” (Agresti and Kateri 2022: 177). If clause ORDER
is the response variable and SUBORDTYPE
the explanatory variable, independence would entail that the outcomes of the response variable ORDER = "mc-sc"
and ORDER = "sc-mc"
are equally likely to occur in the groups SUBORDTYPE = "temp"
and SUBORDTYPE = "caus"
.
The term probability distribution refers to a mathematical function that assigns probabilities to the outcomes of a variable. If we consider two variables at the same time, such as \(X\) and \(Y\), they are said to have marginal probability functions \(f_1(x)\) and \(f_2(y)\). If we condition the outcomes of all values on each other, the following equivalence will hold:
\[ f(x \mid y) = f_1(x) \text{ and } f(y \mid x) = f_2(y). \tag{20.1}\]
Thus, the null hypothesis assumes that the probabilities of each combination of values (such as ORDER
and SUBORDTYPE
), denoted by \(\pi_{ij}\), have the relationship in Equation 20.1. This can be stated succinctly as
\[ H_0 : \pi_{ij} = P(X = i)P(Y = j). \tag{20.2}\]
Next, we compute a test statistic that indicates how strongly our data conforms to \(H_0\), such as Pearson’s \(\chi^2\). To this end, we will need two types of values:
the observed frequencies \(f_{ij}\) present in our data set
the expected frequencies \(e_{ij}\), which we would expect to see if \(H_0\) were true,
where \(f, e \in \mathbb{N}\). The indices \(i\) and \(j\) uniquely identify the cell counts in all column-row combinations of a contingency table.
The table below represents a generic contingency table where \(X\) and \(Y\) are categorical variables and have the values \(X = \{x_1, x_2, \dots, x_i\}\) and \(Y = \{y_1, y_2, \dots, y_j \}\). In the table, each cell indicates the count of observation \(f_{ij}\) corresponding to the \(i\)-th row and \(j\)-th column.
\(Y\) | ||||||
---|---|---|---|---|---|---|
\(y_1\) | \(y_2\) | … | \(y_j\) | |||
\(x_1\) | \(f_{11}\) | \(f_{12}\) | … | \(f_{1j}\) | ||
\(x_2\) | \(f_{21}\) | \(f_{22}\) | … | \(f_{2j}\) | ||
\(X\) | … | … | … | … | … | |
\(x_i\) | \(f_{i1}\) | \(f_{i2}\) | … | \(f_{3j}\) |
In the cl.order
data, the observed frequencies correspond to how often each ORDER
value (i.e., mc-sc
and sc-mc
) is attested for a given SUBORDTYPE
(i.e., temp
and caus
). This can be done in a very straightforward fashion using R’s table()
function on the variables of interest.
<- table(cl.order$ORDER, cl.order$SUBORDTYPE)
observed_freqs
print(observed_freqs)
caus temp
mc-sc 184 91
sc-mc 15 113
The expected frequencies require a few additional steps. Usually, these steps are performed automatically when conducting the chi-squared test in R, so you don’t have to worry about calculating them by hand. We will do it anyway to drive home the rationale of the test.
The expected frequencies \(e_{ij}\) are given by the formula in Equation 20.3. In concrete terms, we go through each cell in the cross-table and multiply the corresponding row sums with the column sums, dividing the result by the total number of occurrences in the sample. For example, there are \(184\) occurrences of mc-sc
clause orders where the subordinate clause is causal. The row sum is \(184 + 91 = 275\) and the column sum is \(184 + 15 = 199\). Next, we take their product \(275 \times 199\) and divide it by the total number of observations, which is \(184 + 91 + 15 + 113 = 403\). Thus we obtain an expected frequency of \(\frac{275 \times 199}{403} = 135.79\) under the null hypothesis.
\[ e_{ij} = \frac{i\textrm{th row sum} \times j \textrm{th column sum}}{\textrm{number of observations}} \tag{20.3}\]
The expected frequencies for our combination of variables is shown below. In which cells can you see the greatest deviations between observed and expected frequencies?
Show the code
## Calculate row totals
<- rowSums(observed_freqs)
row_totals
## Calculate column totals
<- colSums(observed_freqs)
col_totals
## Total number of observations
<- sum(observed_freqs)
total_obs
## Calculate expected frequencies
<- outer(row_totals, col_totals) / total_obs
expected_freqs
print(expected_freqs)
caus temp
mc-sc 135.79404 139.20596
sc-mc 63.20596 64.79404
The \(\chi^2\)-test now offers a convenient way of quantifying the differences between the two tables above. It measures how much the observed frequencies deviate from the expected frequencies for each cell in a contingency table (cf. Heumann, Schomaker, and Shalabh 2022: 249-251). The gist of this procedure is summarised in Equation 20.4.
\[ \text{Chi-squared } \chi^2 =\frac{(\text{observed} - \text{expected})^2}{\text{expected}} \tag{20.4}\]
Given \(n\) observations and \(k\) degrees of freedom \(df\), the joint squared deviations between \(f_{ij}\) and \(e_{ij}\) contribute to the final \(\chi^2\)-score, which is defined as
\[ \chi^2 = \sum_{i=1}^{I}\sum_{i=j}^{J}{\frac{(f_{ij} - e_{ij})^2}{e_{ij}}} \tag{20.5}\]
for \(i = 1, ..., I\) and \(j = 1, ..., J\) and \(df = (\textrm{number of rows} -1) \times (\textrm{number of columns} - 1)\).
The implementation in R is a simple one-liner. Keep in mind that we have to supply absolute frequencies to chisq.test()
rather than percentages.
<- chisq.test(observed_freqs)
freqs_test
print(freqs_test)
Pearson's Chi-squared test with Yates' continuity correction
data: observed_freqs
X-squared = 104.24, df = 1, p-value < 2.2e-16
Quite conveniently, the test object freqs_test
stores the expected frequencies, which can be easily accessed via subsetting. Luckily, they are identical to what we calculated above!
$expected freqs_test
caus temp
mc-sc 135.79404 139.20596
sc-mc 63.20596 64.79404
This \(\chi^2\)-test comes with certain statistical assumptions. Violations of these assumptions decrease the validity of the result and could, therefore, lead to wrong conclusions about relationships in the data. In this case, other tests should be consulted.
- All observations are independent of each other.
- 80% of the expected frequencies are \(\geq\) 5.
- All observed frequencies are \(\geq\) 1.
If assumptions 2 and 3 are violated, it is recommended to use a more robust test such as the Fisher’s Exact Test (see ?fisher.test()
for details) or the likelihood ratio test (\(G\)-test).
In case of dependent observations (e.g., multiple measurements per participant), the default approach is to fit a multilevel model that can control for grouping factors (see mixed-effects regression in Section 23.3.)
20.3.1 How do I make sense of the test results?
The test output has three ‘ingredients’:
- the chi-squared score (X-squared)
- the degrees of freedom (df)
- the p-value.
It is absolutely essential to report all three of those as they determine each other. Here a few possible wordings that could be used:
According to a \(\chi^2\)-test, there is a highly significant association between clause ORDER and SUBORDTYPE at \(p < 0.001\) (\(\chi^2 = 106.44, df = 1\)), thus justifying the rejection of \(H_0\).
A \(\chi^2\)-test revealed a highly significant association between clause ORDER and SUBORDTYPE (\(\chi^2(1) = 106.44\), \(p < 0.001\)), supporting the rejection of \(H_0\).
The \(\chi^2\)-test results (\(\chi^2 = 106.44\), \(df = 1\), \(p < 0.001\)) provide strong evidence against the null hypothesis, demonstrating a significant association between clause ORDER and SUBORDTYPE.
The test results suggest that the dependent variable ORDER
and the explanatory variable SUBORDTYPE
are not independent of each other. The probability of randomly observing usage patterns such as those found in the cl.order
data is lower than 0.001 \(\approx\) 0.1%, which is enough to reject the null hypothesis at \(\alpha = 0.05\).
We can infer that a speaker’s choice of clause ORDER
is very likely influenced by the semantic type of subordinate clause; in other words, these two variables are correlated. However, there are still several things the test does not tell us:
Are there certain variable combinations where the \(\chi^2\)-scores are particularly high?
How strongly do
ORDER
andSUBORDTYPE
influence each other?Does a causal subordinate clause make the
mc-sc
clause order more likely?
20.4 Pearson residuals
If we’re interested in what cells show the greatest difference between observed and expected frequencies, an option would be to inspect the Pearson residuals (cf. Equation 20.6).
\[ \text{residuals} = \frac{\text{observed} - \text{expected}}{\sqrt{\text{expected}}} \tag{20.6}\]
These can be accessed via the test results stored freqs_test
.
$residuals freqs_test
caus temp
mc-sc 4.136760 -4.085750
sc-mc -6.063476 5.988708
The function assocplot()
can automatically compute the pearson residuals for any given contingency table and create a plot that highlights their contributions. If the bar is above the dashed line, it is black and indicates that a category is observed more frequently than expected (e.g., causal subordinate clauses in the mc-sc
order). Conversely, bars are coloured grey if a category is considerably less frequent than expected, such as caus
in sc-mc
.
assocplot(t(observed_freqs), col = c("black", "lightgrey"))
The chi-squared test only provides a \(p\)-value for the entire contingency table. But what if we wanted to test the residuals for their significance as well? Configural Frequency Analysis (Krauth and Lienert 1973) allows us to do exactly that: It performs a significance test for all combinations of variable values in a cross-table. Moreover, CFA is not limited to two variables only. Technically, users can test for associations between arbitrary numbers of variables, but should be aware of the increasing complexity of interpretation.
library(cfa) # install library beforehand
# Get the observed counts and convert them to a data frame
<- as.data.frame(observed_freqs)
config_df
# Convert to matrix
<- as.matrix(config_df[, 1:2]) # First two columns contain the configurations
configs <- config_df$Freq # The Freq column contains the counts
counts
# Perform CFA
<- cfa(configs, counts, bonferroni = TRUE)
cfa_output
# Print output
print(cfa_output)
*** Analysis of configuration frequencies (CFA) ***
label n expected Q chisq p.chisq sig.chisq z
1 sc-mc caus 15 63.20596 0.1418682 36.76575 1.332103e-09 TRUE -6.671872
2 sc-mc temp 113 64.79404 0.1425343 35.86463 2.115143e-09 TRUE 6.469444
3 mc-sc caus 184 135.79404 0.1804075 17.11278 3.522441e-05 TRUE 5.027611
4 mc-sc temp 91 139.20596 0.1827409 16.69335 4.393467e-05 TRUE -5.102385
p.z sig.z
1 1.000000e+00 TRUE
2 4.918210e-11 TRUE
3 2.483137e-07 TRUE
4 9.999998e-01 TRUE
Summary statistics:
Total Chi squared = 106.4365
Total degrees of freedom = 1
p = 0
Sum of counts = 403
Levels:
Var1 Var2
2 2
20.5 Effect size
The \(p\)-value only indicates the presence of correlation, but not its strength – regardless of how low it may be. It does not convey how much two variables determine each other. For this reason, it is highly recommended to report an effect size measure alongside the \(p\)-value. One such measure is Cramér’s \(V\), which takes values in the interval \([0, 1]\):
\[ V = \sqrt{\frac{\chi^2}{n \times (min(nrow, ncol) - 1)}}. \tag{20.7}\]
The package confintr
implements this in its cramersv()
function:
cramersv(freqs_test)
[1] 0.5085863
The association between two categorical variables is stronger, the closer \(V\) approximates 1. Conversely, if \(V = 0\), then the variables are completely independent. There are various guidelines in the literature that provide thresholds for “small”, “moderate” and “large” effects, yet these are rarely justified on theoretical grounds and could be viewed as arbitrary.
20.6 Likelihood-based inference
Since causal subordinate clauses typically follow the main clause in the sample, it may seem tempting to draw conclusions about the probability of a certain outcome of clause ORDER
. Unfortunately, the \(\chi^2\)-test is not designed for this kind of statistical inference. An approach that would actually allow the estimation of probabilities is logistic regression.
# Convert to factors and define reference levels
$ORDER <- factor(cl.order$ORDER, levels = c("sc-mc", "mc-sc"))
cl.order$SUBORDTYPE <- factor(cl.order$SUBORDTYPE, levels = c("temp", "caus"))
cl.order
# Fit logistic regression model
<- glm(ORDER ~ SUBORDTYPE, data = cl.order, family = "binomial")
order.glm1
# Store model parameters
<- order.glm1$coefficients[1]
intercept <- order.glm1$coefficients[2]
slope
# Convert to probabilities
unname(plogis(intercept + slope * 0)) # if temporal
[1] 0.4460784
unname(plogis(intercept + slope * 1)) # if causal
[1] 0.9246231
If the subordinate clause is causal, there is a 92.5% chance that speakers will use the mc-sc
ordering, but only a 44.6% chance of using sc-mc
.