20 Chi-squared test

Author

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract

The chi-squared (\(\chi^2\)) test helps determine if there is a statistically significant association between two categorical variables. It compares the observed frequencies of categories with those expected under the null hypothesis. The \(\chi^2\) (chi-squared) score quantifies the difference between observed and expected frequencies for every cell in a contingency table. The greater the difference between observed and expected, the higher the \(\chi^2\) score and the lower the \(p\)-value, given the degrees of freedom. It is recommended to compute effect size measures and inspect the residuals to assess the nature of the association.

20.1 Suggested reading

For linguists:

Gries (2021): Chapter 4.1.2.1

General:

Baguley (2012): Chapter 4

Agresti and Kateri (2022): Chapter 5

20.2 Preparation

Script

You can find the full R script associated with this unit here.

#  Load libraries
library(readxl)
library(tidyverse)
library(confintr) # for effect size calculation

# Load data
cl.order <- read_xlsx("Paquot_Larsson_2020_data.xlsx")

20.3 The Pearson \(\chi^2\)-test of independence

The first step of any significance test involves setting up the null and alternative hypothesis. In this unit, we will perform a sample analysis on the relationship between clause ORDER and the type of subordinate clause (SUBORDTYPE). Specifically, we will focus on the independence of these two discrete variables (i.e, the presence or absence of correlation).

\(H_0:\) The variables ORDER and SUBORDTYPE are independent.
\(H_1:\) The variables ORDER and SUBORDTYPE are not independent.

What does independence really mean?

The core idea is “that the probability distribution of the response variable is the same for each group” (Agresti and Kateri 2022: 177). If clause ORDER is the response variable and SUBORDTYPE the explanatory variable, independence would entail that the probabilities of the outcomes of the response variable ORDER = "mc-sc" and ORDER = "sc-mc" are not influenced by whether they occur in the groups SUBORDTYPE = "temp" or SUBORDTYPE = "caus".

The term probability distribution refers to a mathematical function that assigns probabilities to the outcomes of a variable. If we consider two variables at the same time, such as \(X\) and \(Y\), they are said to have marginal probability functions \(f_1(x)\) and \(f_2(y)\). If we condition the outcomes of all values on each other, the following equivalence will hold:

\[ f(x \mid y) = f_1(x) \text{ and } f(y \mid x) = f_2(y). \tag{20.1}\]

Thus, the null hypothesis assumes that the probabilities of each combination of values (such as ORDER and SUBORDTYPE), denoted by \(\pi_{ij}\), have the relationship in Equation 20.1. This can be stated succinctly as

\[ H_0 : \pi_{ij} = P(X = j)P(Y = i). \tag{20.2}\]

Next, we compute a test statistic that indicates how strongly our data conforms to \(H_0\), such as Pearson’s \(\chi^2\). To this end, we will need two types of values:

the observed frequencies \(f_{ij}\) present in our data set
the expected frequencies \(e_{ij}\), which we would expect to see if \(H_0\) were true,

where \(f, e \in \mathbb{N}\). The indices \(i\) and \(j\) uniquely identify the cell counts in all column-row combinations of a contingency table.

20.3.1 Observerd frequencies

The table below represents a generic contingency table where \(Y\) and \(X\) are categorical variables and have the values \(Y = \{y_1, y_2, \dots, y_i \}\) and \(X = \{x_1, x_2, \dots, x_j\}\). In the table, each cell indicates the count of observation \(f_{ij}\) corresponding to the \(i\)-th row and \(j\)-th column.

			\(X\)
		\(x_1\)	\(x_2\)	…	\(x_j\)
	\(y_1\)	\(f_{11}\)	\(f_{12}\)	…	\(f_{1j}\)
	\(y_2\)	\(f_{21}\)	\(f_{22}\)	…	\(f_{2j}\)
\(Y\)	…	…	…	…	…
	\(y_i\)	\(f_{i1}\)	\(f_{i2}\)	…	\(f_{ij}\)

In the cl.order data, the observed frequencies correspond to how often each ORDER value (i.e., mc-sc and sc-mc) is attested for a given SUBORDTYPE (i.e., temp and caus). This can be done in a very straightforward fashion using R’s table() function on the variables of interest.

observed_freqs <- table(cl.order$ORDER, cl.order$SUBORDTYPE)

print(observed_freqs)

       
        caus temp
  mc-sc  184   91
  sc-mc   15  113

20.3.2 Expected frequencies

The expected frequencies require a few additional steps. Usually, these steps are performed automatically when conducting the chi-squared test in R, so you don’t have to worry about calculating them by hand. We will do it anyway to drive home the rationale of the test.

			\(X\)
		\(x_1\)	\(x_2\)	…	\(x_j\)
	\(y_1\)	\(e_{11}\)	\(e_{12}\)	…	\(e_{1j}\)
	\(y_2\)	\(e_{21}\)	\(e_{22}\)	…	\(e_{2j}\)
\(Y\)	…	…	…	…	…
	\(y_i\)	\(e_{i1}\)	\(e_{i2}\)	…	\(e_{ij}\)

The expected frequencies \(e_{ij}\) are given by the formula in Equation 20.3. In concrete terms, we go through each cell in the cross-table and multiply the corresponding row sums with the column sums, dividing the result by the total number of occurrences in the sample. For example, there are \(184\) occurrences of mc-sc clause orders where the subordinate clause is causal. The row sum is \(184 + 91 = 275\) and the column sum is \(184 + 15 = 199\). Next, we take their product \(275 \times 199\) and divide it by the total number of observations, which is \(184 + 91 + 15 + 113 = 403\). Thus we obtain an expected frequency of \(\frac{275 \times 199}{403} = 135.79\) under the null hypothesis.

\[ e_{ij} = \frac{i\textrm{th row sum} \times j \textrm{th column sum}}{\textrm{number of observations}} \tag{20.3}\]

The expected frequencies for our combination of variables is shown below. In which cells can you see the greatest deviations between observed and expected frequencies?

Show the code

## Calculate row totals
row_totals <- rowSums(observed_freqs)

## Calculate column totals
col_totals <- colSums(observed_freqs)

## Total number of observations
total_obs <- sum(observed_freqs)

## Calculate expected frequencies
expected_freqs <- outer(row_totals, col_totals) / total_obs

print(expected_freqs)

           caus      temp
mc-sc 135.79404 139.20596
sc-mc  63.20596  64.79404

20.3.3 Conducting the test

The \(\chi^2\)-test now offers a convenient way of quantifying the differences between the two tables above. It measures how much the observed frequencies deviate from the expected frequencies for each cell in a contingency table (cf. Heumann, Schomaker, and Shalabh 2022: 249-251). The gist of this procedure is summarised in Equation 20.4.

\[ \text{Chi-squared } \chi^2 =\frac{(\text{observed} - \text{expected})^2}{\text{expected}} \tag{20.4}\]

Formal definition of the \(\chi^2\)-test

Given \(n\) observations and \(k\) degrees of freedom \(df\), the joint squared deviations between \(f_{ij}\) and \(e_{ij}\) contribute to the final \(\chi^2\)-score, which is defined as

\[ \chi^2 = \sum_{i=1}^{I}\sum_{j=1}^{J}{\frac{(f_{ij} - e_{ij})^2}{e_{ij}}} \tag{20.5}\]

for \(i = 1, ..., I\) and \(j = 1, ..., J\) and \(df = (\textrm{number of rows} -1) \times (\textrm{number of columns} - 1)\).

Applying the formula, the updated contingency would have the following form:

			\(X\)
		\(x_1\)	\(x_2\)	…	\(x_j\)
	\(y_1\)	\(\frac{(f_{11} - e_{11})^2}{e_{11}}\)	\(\frac{(f_{12} - e_{12})^2}{e_{12}}\)	…	\(\frac{(f_{1j} - e_{1j})^2}{e_{1j}}\)
	\(y_2\)	\(\frac{(f_{21} - e_{21})^2}{e_{21}}\)	\(\frac{(f_{22} - e_{22})^2}{e_{22}}\)	…	\(\frac{(f_{2j} - e_{2j})^2}{e_{2j}}\)
\(Y\)	…	…	…	…	…
	\(y_i\)	\(\frac{(f_{i1} - e_{i1})^2}{e_{i1}}\)	\(\frac{(f_{i2} - e_{i1})^2}{e_{i2}}\)	…	\(\frac{(f_{ij} - e_{ij})^2}{e_{ij}}\)

The implementation in R is a simple one-liner. Keep in mind that we have to supply absolute frequencies to chisq.test() rather than percentages.

freqs_test <- chisq.test(observed_freqs)

print(freqs_test)


    Pearson's Chi-squared test with Yates' continuity correction

data:  observed_freqs
X-squared = 104.24, df = 1, p-value < 2.2e-16

Quite conveniently, the test object freqs_test stores the expected frequencies, which can be easily accessed via subsetting. Luckily, they are identical to what we calculated above!

freqs_test$expected

       
             caus      temp
  mc-sc 135.79404 139.20596
  sc-mc  63.20596  64.79404

20.3.4 Assumptions of the chi-squared test

This \(\chi^2\)-test comes with certain statistical assumptions. Violations of these assumptions decrease the validity of the result and could, therefore, lead to wrong conclusions about relationships in the data. In this case, other tests should be consulted.

Important

All observations are independent of each other.
80% of the expected frequencies are \(\geq\) 5.
All observed frequencies are \(\geq\) 1.

In case of dependent observations (e.g., multiple measurements per participant), the default approach is to fit a multilevel model that can control for grouping factors (see mixed-effects regression in Section 24.3.)

If assumptions 2 and 3 are violated, it is recommended to use a more robust test such as the Fisher’s Exact Test or the log-likelihood ratio test (\(G\)-test).

Fisher’s Exact Test

While the \(\chi^2\)-test can only approximate the \(p\)-value, Fisher’s Exact Test can provide an exact solution. Note that for anything more complex than a \(2 \times 2\) table, it becomes considerably more computationally expensive; if it takes too long, set simulate.p.value = TRUE.

Drawing on the hypergeometric distribution (see ?dhyper()), it computes the probability of all frequency tables that are equal or more extreme than the one observed.

fisher.test(observed_freqs)


    Fisher's Exact Test for Count Data

data:  observed_freqs
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  8.216959 29.548120
sample estimates:
odds ratio 
  15.11768

\(G\)-test

The \(G^2\)-statistic is analogous to \(\chi^2\), but it tends to be more robust for lower observed counts. It is defined as

\[ G^2 = 2\sum_{i=1}^{I}\sum_{j=1}^{J} f_{ij} \ln\left({\frac{f_{ij}}{e_{ij}}}\right) \tag{20.6}\]

and implemented in R via the DescTools package.

# Load library (install if necessary)
library(DescTools)

# Perform G-test (preferably for tables with more than 2 rows/columns)
GTest(observed_freqs)


    Log likelihood ratio (G-test) test of independence without correction

data:  observed_freqs
G = 116.97, X-squared df = 1, p-value < 2.2e-16

20.3.5 How do I make sense of the test results?

The test output has three ‘ingredients’:

the chi-squared score (X-squared)
the degrees of freedom (df)
the p-value.

It is absolutely essential to report all three of those as they determine each other. Here a few possible wordings that could be used:

According to a \(\chi^2\)-test, there is a highly significant association between clause ORDER and SUBORDTYPE at \(p < 0.001\) (\(\chi^2 = 106.44, df = 1\)), thus justifying the rejection of \(H_0\).

A \(\chi^2\)-test revealed a highly significant association between clause ORDER and SUBORDTYPE (\(\chi^2(1) = 106.44\), \(p < 0.001\)), supporting the rejection of \(H_0\).

The \(\chi^2\)-test results (\(\chi^2 = 106.44\), \(df = 1\), \(p < 0.001\)) provide strong evidence against the null hypothesis, demonstrating a significant association between clause ORDER and SUBORDTYPE.

The test results suggest that the dependent variable ORDER and the explanatory variable SUBORDTYPE are not independent of each other. The probability of randomly observing usage patterns such as those found in the cl.order data is lower than 0.001 \(\approx\) 0.1%, which is enough to reject the null hypothesis at \(\alpha = 0.05\).

We can infer that a speaker’s choice of clause ORDER is very likely influenced by the semantic type of subordinate clause; in other words, these two variables are correlated. However, there are still several things the test does not tell us:

Are there certain variable combinations where the \(\chi^2\)-scores are particularly high?
How strongly do ORDER and SUBORDTYPE influence each other?

20.4 Pearson residuals

If we’re interested in what cells show the greatest difference between observed and expected frequencies, an option would be to inspect the Pearson residuals (cf. Equation 20.7).

\[ \text{residuals} = \frac{\text{observed} - \text{expected}}{\sqrt{\text{expected}}} \tag{20.7}\]

These can be accessed via the test results stored freqs_test.

freqs_test$residuals

       
             caus      temp
  mc-sc  4.136760 -4.085750
  sc-mc -6.063476  5.988708

The function assocplot() can automatically compute the pearson residuals for any given contingency table and create a plot that highlights their contributions. If the bar is above the dashed line, it is black and indicates that a category is observed more frequently than expected (e.g., causal subordinate clauses in the mc-sc order). Conversely, bars are coloured grey if a category is considerably less frequent than expected, such as caus in sc-mc.

assocplot(t(observed_freqs), col = c("black", "lightgrey"))

Testing the residuals: Configural Frequency Analysis

The chi-squared test only provides a \(p\)-value for the entire contingency table. But what if we wanted to test the residuals for their significance as well? Configural Frequency Analysis (Krauth and Lienert 1973) allows us to do exactly that: It performs a significance test for all combinations of variable values in a cross-table. Moreover, CFA is not limited to two variables only. Technically, users can test for associations between arbitrary numbers of variables, but should be aware of the increasing complexity of interpretation.

library(cfa) # install library beforehand

# Get the observed counts and convert them to a data frame
config_df <- as.data.frame(observed_freqs)

# Convert to matrix
configs <- as.matrix(config_df[, 1:2])  # first two columns contain the configurations (= combinations of variable values)
counts <- config_df$Freq # Freq column contains the corresponding counts

# Perform CFA on configuarations and counts; apply Bonferroni correction for multiple testing
cfa_output <- cfa(configs, counts, bonferroni = TRUE)

# Print output
print(cfa_output)


*** Analysis of configuration frequencies (CFA) ***

       label   n  expected         Q    chisq      p.chisq sig.chisq         z
1 sc-mc caus  15  63.20596 0.1418682 36.76575 1.332103e-09      TRUE -6.671872
2 sc-mc temp 113  64.79404 0.1425343 35.86463 2.115143e-09      TRUE  6.469444
3 mc-sc caus 184 135.79404 0.1804075 17.11278 3.522441e-05      TRUE  5.027611
4 mc-sc temp  91 139.20596 0.1827409 16.69335 4.393467e-05      TRUE -5.102385
           p.z sig.z
1 1.000000e+00  TRUE
2 4.918210e-11  TRUE
3 2.483137e-07  TRUE
4 9.999998e-01  TRUE


Summary statistics:

Total Chi squared         =  106.4365 
Total degrees of freedom  =  1 
p                         =  0 
Sum of counts             =  403 

Levels:

Var1 Var2 
   2    2

20.5 Effect size

The \(p\)-value only indicates the presence of correlation, but not its strength – regardless of how low it may be. It does not convey how much two variables determine each other. For this reason, it is highly recommended to report an effect size measure alongside the \(p\)-value. One such measure is Cramér’s \(V\), which takes values in the interval \([0, 1]\):

\[ V = \sqrt{\frac{\chi^2}{n \times (min(nrow, ncol) - 1)}}. \tag{20.8}\]

The package confintr implements this in its cramersv() function:

cramersv(freqs_test)

[1] 0.5085863

The association between two categorical variables is stronger, the closer \(V\) approximates 1. Conversely, if \(V = 0\), then the variables are completely independent. There are various guidelines in the literature that provide thresholds for “small”, “moderate” and “large” effects, yet these are rarely justified on theoretical grounds and could be viewed as arbitrary.

20.6 Only one variable? The Pearson \(\chi^2\) goodness-of-fit test

The \(\chi^2\)-statistic can also be utilised in simpler scenarios that involve only one single categorical variable.¹ Consider the small eat_obj_aspect.xlsx dataset, where the object_realisation column indicates whether the object was realised or dropped in a given observation.

¹ For this reason, the distinction between dependent and independent variable is irrelevant here.

# Load data
eat <- read_xlsx("eat_obj_aspect.xlsx")

# Show object realisation pattern
eat_observed <- table(eat$object_realisation)

print(eat_observed)


 no yes 
 57  45

We can use the \(\chi^2\) goddness-of-fit test “to compare an observed frequency distribution against its expected probability of occurrence” (Baguley 2012: 132). In other words, we can check if the observed object data matches the frequencies we’d expect to see if both outcomes of object realisation were equally likely and hence randomly distributed.

Our hypotheses are thus

\(H_0:\) Observed frequencies of object realisation \(=\) expected frequencies of object realisation
\(H_1:\) Observed frequencies of object realisation \(\neq\) expected frequencies of object realisation

The expected frequencies are given by

\[ e_i = \frac{\text{number of observations}}{\text{number of cells}}, \tag{20.9}\]

and the test statistic simplifies to

\[ \chi^2 = \sum_{i=1}^{I}{\frac{(f_{i} - e_{i})^2}{e_{i}}}. \tag{20.10}\]

In R:

# Perform the chi-squared goodness-of-fit test
eat_gof <-  chisq.test(eat_observed)

print(eat_gof)


    Chi-squared test for given probabilities

data:  eat_observed
X-squared = 1.4118, df = 1, p-value = 0.2348

# Check expected frequencies (should be greater or equal to 5)
print(eat_gof$expected)

 no yes 
 51  51

A chi-squared goodness-of-fit test indicates that the distribution of object realisation for eat does not significantly differ from an equal distribution (\(\chi^2\) = 1.41, \(df\) = 1, \(p\) = 0.23). In the sample at hand, the verb eat does not seem to prefer one variant or the other.

20.7 Exercises

Solutions

You can find the solutions to the exercises here.

Exercise 20.1 Load the file eat_obj_aspect.xlsx into R. It contains hits for the verb lemma eat which were annotated for the presence or absence of a direct object (object_realisation column) and for the aspect of the verb (verb_aspect column).

Create a frequency table that cross-classifies object_realisation and verb_aspect.
Forward a set of statistical hypotheses.
Perform a \(\chi^2\)-test of independence and interpret the result.

Exercise 20.2 The chisq.test() returns a warning message: “Chi-squared approximation may be incorrect”. Examine the frequency table for potential violations of the test assumptions. How could you control for them?