20  Chi-squared test

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract

The chi-squared (\(\chi^2\)) test helps determine if there is a statistically significant association between two categorical variables. It compares the observed frequencies of categories with those expected under the null hypothesis. The \(\chi^2\) (chi-squared) score quantifies the difference between observed and expected frequencies for every cell in a contingency table. The greater the difference between observed and expected, the higher the \(\chi^2\) score and the lower the \(p\)-value, given the degrees of freedom. It is recommended to compute effect size measures and inspect the residuals to assess the nature of the association.

20.1 Suggested reading

For linguists:

Gries (2021): Chapter 4.1.2.1

General:

Baguley (2012): Chapter 4

Agresti and Kateri (2022): Chapter 5

20.2 Preparation

Script

You can find the full R script associated with this unit here.

#  Load libraries
library(readxl)
library(tidyverse)
library(confintr) # for effect size calculation

# Load data
cl.order <- read_xlsx("Paquot_Larsson_2020_data.xlsx")

20.3 The Pearson \(\chi^2\)-test of independence

The first step of any significance test involves setting up the null and alternative hypothesis. In this unit, we will perform a sample analysis on the relationship between clause ORDER and the type of subordinate clause (SUBORDTYPE). Specifically, we will focus on the independence of these two discrete variables (i.e, the presence or absence of correlation).

  • \(H_0:\) The variables ORDER and SUBORDTYPE are independent.

  • \(H_1:\) The variables ORDER and SUBORDTYPE are not independent.

The core idea is “that the probability distribution of the response variable is the same for each group” (Agresti and Kateri 2022: 177). If clause ORDER is the response variable and SUBORDTYPE the explanatory variable, independence would entail that the outcomes of the response variable ORDER = "mc-sc" and ORDER = "sc-mc" are equally likely to occur in the groups SUBORDTYPE = "temp" and SUBORDTYPE = "caus".

The term probability distribution refers to a mathematical function that assigns probabilities to the outcomes of a variable. If we consider two variables at the same time, such as \(X\) and \(Y\), they are said to have marginal probability functions \(f_1(x)\) and \(f_2(y)\). If we condition the outcomes of all values on each other, the following equivalence will hold:

\[ f(x \mid y) = f_1(x) \text{ and } f(y \mid x) = f_2(y). \tag{20.1}\]

Thus, the null hypothesis assumes that the probabilities of each combination of values (such as ORDER and SUBORDTYPE), denoted by \(\pi_{ij}\), have the relationship in Equation 20.1. This can be stated succinctly as

\[ H_0 : \pi_{ij} = P(X = i)P(Y = j). \tag{20.2}\]

Next, we compute a test statistic that indicates how strongly our data conforms to \(H_0\), such as Pearson’s \(\chi^2\). To this end, we will need two types of values:

  • the observed frequencies \(f_{ij}\) present in our data set

  • the expected frequencies \(e_{ij}\), which we would expect to see if \(H_0\) were true,

where \(f, e \in \mathbb{N}\). The indices \(i\) and \(j\) uniquely identify the cell counts in all column-row combinations of a contingency table.

The table below represents a generic contingency table where \(X\) and \(Y\) are categorical variables and have the values \(X = \{x_1, x_2, \dots, x_i\}\) and \(Y = \{y_1, y_2, \dots, y_j \}\). In the table, each cell indicates the count of observation \(f_{ij}\) corresponding to the \(i\)-th row and \(j\)-th column.

\(Y\)
\(y_1\) \(y_2\) \(y_j\)
\(x_1\) \(f_{11}\) \(f_{12}\) \(f_{1j}\)
\(x_2\) \(f_{21}\) \(f_{22}\) \(f_{2j}\)
\(X\)
\(x_i\) \(f_{i1}\) \(f_{i2}\) \(f_{3j}\)

In the cl.order data, the observed frequencies correspond to how often each ORDER value (i.e., mc-sc and sc-mc) is attested for a given SUBORDTYPE (i.e., temp and caus). This can be done in a very straightforward fashion using R’s table() function on the variables of interest.

observed_freqs <- table(cl.order$ORDER, cl.order$SUBORDTYPE)

print(observed_freqs)
       
        caus temp
  mc-sc  184   91
  sc-mc   15  113

The expected frequencies require a few additional steps. Usually, these steps are performed automatically when conducting the chi-squared test in R, so you don’t have to worry about calculating them by hand. We will do it anyway to drive home the rationale of the test.

The expected frequencies \(e_{ij}\) are given by the formula in Equation 20.3. In concrete terms, we go through each cell in the cross-table and multiply the corresponding row sums with the column sums, dividing the result by the total number of occurrences in the sample. For example, there are \(184\) occurrences of mc-sc clause orders where the subordinate clause is causal. The row sum is \(184 + 91 = 275\) and the column sum is \(184 + 15 = 199\). Next, we take their product \(275 \times 199\) and divide it by the total number of observations, which is \(184 + 91 + 15 + 113 = 403\). Thus we obtain an expected frequency of \(\frac{275 \times 199}{403} = 135.79\) under the null hypothesis.

\[ e_{ij} = \frac{i\textrm{th row sum} \times j \textrm{th column sum}}{\textrm{number of observations}} \tag{20.3}\]

The expected frequencies for our combination of variables is shown below. In which cells can you see the greatest deviations between observed and expected frequencies?

Show the code
## Calculate row totals
row_totals <- rowSums(observed_freqs)

## Calculate column totals
col_totals <- colSums(observed_freqs)

## Total number of observations
total_obs <- sum(observed_freqs)

## Calculate expected frequencies
expected_freqs <- outer(row_totals, col_totals) / total_obs

print(expected_freqs)
           caus      temp
mc-sc 135.79404 139.20596
sc-mc  63.20596  64.79404

The \(\chi^2\)-test now offers a convenient way of quantifying the differences between the two tables above. It measures how much the observed frequencies deviate from the expected frequencies for each cell in a contingency table (cf. Heumann, Schomaker, and Shalabh 2022: 249-251). The gist of this procedure is summarised in Equation 20.4.

\[ \text{Chi-squared } \chi^2 =\frac{(\text{observed} - \text{expected})^2}{\text{expected}} \tag{20.4}\]

Given \(n\) observations and \(k\) degrees of freedom \(df\), the joint squared deviations between \(f_{ij}\) and \(e_{ij}\) contribute to the final \(\chi^2\)-score, which is defined as

\[ \chi^2 = \sum_{i=1}^{I}\sum_{i=j}^{J}{\frac{(f_{ij} - e_{ij})^2}{e_{ij}}} \tag{20.5}\]

for \(i = 1, ..., I\) and \(j = 1, ..., J\) and \(df = (\textrm{number of rows} -1) \times (\textrm{number of columns} - 1)\).

The implementation in R is a simple one-liner. Keep in mind that we have to supply absolute frequencies to chisq.test() rather than percentages.

freqs_test <- chisq.test(observed_freqs)

print(freqs_test)

    Pearson's Chi-squared test with Yates' continuity correction

data:  observed_freqs
X-squared = 104.24, df = 1, p-value < 2.2e-16

Quite conveniently, the test object freqs_test stores the expected frequencies, which can be easily accessed via subsetting. Luckily, they are identical to what we calculated above!

freqs_test$expected
       
             caus      temp
  mc-sc 135.79404 139.20596
  sc-mc  63.20596  64.79404
Assumptions of the chi-squared test

This \(\chi^2\)-test comes with certain statistical assumptions. Violations of these assumptions decrease the validity of the result and could, therefore, lead to wrong conclusions about relationships in the data. In this case, other tests should be consulted.

  1. All observations are independent of each other.
  2. 80% of the expected frequencies are \(\geq\) 5.
  3. All observed frequencies are \(\geq\) 1.

If assumptions 2 and 3 are violated, it is recommended to use a more robust test such as the Fisher’s Exact Test (see ?fisher.test() for details) or the likelihood ratio test (\(G\)-test).

In case of dependent observations (e.g., multiple measurements per participant), the default approach is to fit a multilevel model that can control for grouping factors (see mixed-effects regression in Section 23.3.)

20.3.1 How do I make sense of the test results?

The test output has three ‘ingredients’:

  • the chi-squared score (X-squared)
  • the degrees of freedom (df)
  • the p-value.

It is absolutely essential to report all three of those as they determine each other. Here a few possible wordings that could be used:

According to a \(\chi^2\)-test, there is a highly significant association between clause ORDER and SUBORDTYPE at \(p < 0.001\) (\(\chi^2 = 106.44, df = 1\)), thus justifying the rejection of \(H_0\).

A \(\chi^2\)-test revealed a highly significant association between clause ORDER and SUBORDTYPE (\(\chi^2(1) = 106.44\), \(p < 0.001\)), supporting the rejection of \(H_0\).

The \(\chi^2\)-test results (\(\chi^2 = 106.44\), \(df = 1\), \(p < 0.001\)) provide strong evidence against the null hypothesis, demonstrating a significant association between clause ORDER and SUBORDTYPE.

The test results suggest that the dependent variable ORDER and the explanatory variable SUBORDTYPE are not independent of each other. The probability of randomly observing usage patterns such as those found in the cl.order data is lower than 0.001 \(\approx\) 0.1%, which is enough to reject the null hypothesis at \(\alpha = 0.05\).

We can infer that a speaker’s choice of clause ORDER is very likely influenced by the semantic type of subordinate clause; in other words, these two variables are correlated. However, there are still several things the test does not tell us:

  • Are there certain variable combinations where the \(\chi^2\)-scores are particularly high?

  • How strongly do ORDER and SUBORDTYPE influence each other?

  • Does a causal subordinate clause make the mc-sc clause order more likely?

20.4 Pearson residuals

If we’re interested in what cells show the greatest difference between observed and expected frequencies, an option would be to inspect the Pearson residuals (cf. Equation 20.6).

\[ \text{residuals} = \frac{\text{observed} - \text{expected}}{\sqrt{\text{expected}}} \tag{20.6}\]

These can be accessed via the test results stored freqs_test.

freqs_test$residuals
       
             caus      temp
  mc-sc  4.136760 -4.085750
  sc-mc -6.063476  5.988708

The function assocplot() can automatically compute the pearson residuals for any given contingency table and create a plot that highlights their contributions. If the bar is above the dashed line, it is black and indicates that a category is observed more frequently than expected (e.g., causal subordinate clauses in the mc-sc order). Conversely, bars are coloured grey if a category is considerably less frequent than expected, such as caus in sc-mc.

assocplot(t(observed_freqs), col = c("black", "lightgrey"))

The chi-squared test only provides a \(p\)-value for the entire contingency table. But what if we wanted to test the residuals for their significance as well? Configural Frequency Analysis (Krauth and Lienert 1973) allows us to do exactly that: It performs a significance test for all combinations of variable values in a cross-table. Moreover, CFA is not limited to two variables only. Technically, users can test for associations between arbitrary numbers of variables, but should be aware of the increasing complexity of interpretation.

library(cfa) # install library beforehand

# Get the observed counts and convert them to a data frame
config_df <- as.data.frame(observed_freqs)

# Convert to matrix
configs <- as.matrix(config_df[, 1:2])  # First two columns contain the configurations
counts <- config_df$Freq                # The Freq column contains the counts

# Perform CFA
cfa_output <- cfa(configs, counts, bonferroni = TRUE)

# Print output
print(cfa_output)

*** Analysis of configuration frequencies (CFA) ***

       label   n  expected         Q    chisq      p.chisq sig.chisq         z
1 sc-mc caus  15  63.20596 0.1418682 36.76575 1.332103e-09      TRUE -6.671872
2 sc-mc temp 113  64.79404 0.1425343 35.86463 2.115143e-09      TRUE  6.469444
3 mc-sc caus 184 135.79404 0.1804075 17.11278 3.522441e-05      TRUE  5.027611
4 mc-sc temp  91 139.20596 0.1827409 16.69335 4.393467e-05      TRUE -5.102385
           p.z sig.z
1 1.000000e+00  TRUE
2 4.918210e-11  TRUE
3 2.483137e-07  TRUE
4 9.999998e-01  TRUE


Summary statistics:

Total Chi squared         =  106.4365 
Total degrees of freedom  =  1 
p                         =  0 
Sum of counts             =  403 

Levels:

Var1 Var2 
   2    2 

20.5 Effect size

The \(p\)-value only indicates the presence of correlation, but not its strength – regardless of how low it may be. It does not convey how much two variables determine each other. For this reason, it is highly recommended to report an effect size measure alongside the \(p\)-value. One such measure is Cramér’s \(V\), which takes values in the interval \([0, 1]\):

\[ V = \sqrt{\frac{\chi^2}{n \times (min(nrow, ncol) - 1)}}. \tag{20.7}\]

The package confintr implements this in its cramersv() function:

cramersv(freqs_test)
[1] 0.5085863

The association between two categorical variables is stronger, the closer \(V\) approximates 1. Conversely, if \(V = 0\), then the variables are completely independent. There are various guidelines in the literature that provide thresholds for “small”, “moderate” and “large” effects, yet these are rarely justified on theoretical grounds and could be viewed as arbitrary.

20.6 Likelihood-based inference

Since causal subordinate clauses typically follow the main clause in the sample, it may seem tempting to draw conclusions about the probability of a certain outcome of clause ORDER. Unfortunately, the \(\chi^2\)-test is not designed for this kind of statistical inference. An approach that would actually allow the estimation of probabilities is logistic regression.

# Convert to factors and define reference levels
cl.order$ORDER <- factor(cl.order$ORDER, levels = c("sc-mc", "mc-sc"))
cl.order$SUBORDTYPE <- factor(cl.order$SUBORDTYPE, levels = c("temp", "caus")) 

# Fit logistic regression model
order.glm1 <- glm(ORDER ~ SUBORDTYPE, data = cl.order, family = "binomial")

# Store model parameters
intercept <- order.glm1$coefficients[1]
slope <- order.glm1$coefficients[2]

# Convert to probabilities
unname(plogis(intercept + slope * 0)) # if temporal
[1] 0.4460784
unname(plogis(intercept + slope * 1)) # if causal
[1] 0.9246231

If the subordinate clause is causal, there is a 92.5% chance that speakers will use the mc-sc ordering, but only a 44.6% chance of using sc-mc.