# Load libraries
library(readxl)
library(tidyverse)
library(confintr) # for effect size calculation
# Load data
<- read.csv("Grafmiller_genitive_alternation.csv", sep = "\t") genitive
4.6 Chi-squared test
The chi-squared (\(\chi^2\)) test helps determine if there is a statistically significant association between two categorical variables. It compares the observed frequencies of categories with those expected under the null hypothesis. The \(\chi^2\) (chi-squared) score quantifies the difference between observed and expected frequencies for every cell in a contingency table. The greater the difference between observed and expected, the higher the \(\chi^2\) score and the lower the \(p\)-value, given the degrees of freedom. It is recommended to compute effect size measures and inspect the residuals to assess the nature of the association.
Suggested reading
For linguists:
Gries (2021): Chapter 4.1.2.1
General:
Baguley (2012): Chapter 4
Agresti and Kateri (2022): Chapter 5
Preparation
You can find the full R script associated with this unit here.
The Pearson \(\chi^2\)-test of independence
The first step of any significance test involves setting up the null and alternative hypothesis. In this unit, we will perform a sample analysis on the relationship between genitive Type
and text Possessor Animacy
. Specifically, we will focus on the independence of these two discrete variables (i.e, the presence or absence of correlation).
\(H_0:\) The variables
Type
andPossessor Animacy
are independent.\(H_1:\) The variables
Type
andPossessor Animacy
are not independent.
Next, we compute a test statistic that indicates how strongly our data conforms to proportions expected under \(H_0\). To this end, we will need two types of values:
the observed frequencies \(f_{ij}\) present in our data set
the expected frequencies \(e_{ij}\), which we would expect to see if \(H_0\) were true,
where \(f, e \in \mathbb{N}\). The indices \(i\) and \(j\) uniquely identify the cell counts in all column-row combinations of a contingency table.
Observerd frequencies
The table below represents a generic contingency table where \(Y\) and \(X\) are categorical variables and have the values \(Y = \{y_1, y_2, \dots, y_i \}\) and \(X = \{x_1, x_2, \dots, x_j\}\). In the table, each cell indicates the count of observation \(f_{ij}\) corresponding to the \(i\)-th row and \(j\)-th column.
\(X\) | ||||||
---|---|---|---|---|---|---|
\(x_1\) | \(x_2\) | … | \(x_j\) | |||
\(y_1\) | \(f_{11}\) | \(f_{12}\) | … | \(f_{1j}\) | ||
\(y_2\) | \(f_{21}\) | \(f_{22}\) | … | \(f_{2j}\) | ||
\(Y\) | … | … | … | … | … | |
\(y_i\) | \(f_{i1}\) | \(f_{i2}\) | … | \(f_{ij}\) |
In the genitive
data, the observed frequencies correspond to how often each Type
value (i.e., s
and of
) is attested for a given Possessor Animacy
group (i.e., animate
and inanimate
). These can be computed in a very straightforward fashion by applying R’s table()
function to the variables of interest.
<- table(genitive$Type, genitive$Possessor_Animacy2)
observed_freqs
print(observed_freqs)
animate inanimate
of 592 2511
s 1451 544
Expected frequencies
The expected frequencies require a few additional steps. Usually, these steps are performed automatically when conducting the chi-squared test in R, so you don’t have to worry about calculating them by hand. We will do it anyway to drive home the rationale of the test.
\(X\) | ||||||
---|---|---|---|---|---|---|
\(x_1\) | \(x_2\) | … | \(x_j\) | |||
\(y_1\) | \(e_{11}\) | \(e_{12}\) | … | \(e_{1j}\) | ||
\(y_2\) | \(e_{21}\) | \(e_{22}\) | … | \(e_{2j}\) | ||
\(Y\) | … | … | … | … | … | |
\(y_i\) | \(e_{i1}\) | \(e_{i2}\) | … | \(e_{ij}\) |
The expected frequencies \(e_{ij}\) are given by the formula in Equation 1. In concrete terms, we go through each cell in the cross-table and multiply the corresponding row sums with the column sums, dividing the result by the total number of occurrences in the sample. For example, there are \(592\) occurrences of of
genitives in the animate
group. The row sum is \(592+ 2511 = 3103\) and the column sum is \(592 + 1451 = 2043\). Next, we take their product \(3103 \cdot 2043\) and divide it by the total number of observations, which is \(592 + 2511 + 1451 + 544 = 5098\). Thus we obtain an expected frequency of \(\frac{3103 \times 2043}{5098} \approx 1243\) under the null hypothesis.
\[ e_{ij} = \frac{i\textrm{th row sum} \times j \textrm{th column sum}}{\textrm{sample size}} \tag{1}\]
The expected frequencies for our combination of variables is shown below. In which cells can you see the greatest deviations between observed and expected frequencies?
Show the code
## Calculate row totals
<- rowSums(observed_freqs)
row_totals
## Calculate column totals
<- colSums(observed_freqs)
col_totals
## Total number of observations
<- sum(observed_freqs)
total_obs
## Calculate expected frequencies
<- outer(row_totals, col_totals) / total_obs
expected_freqs
print(expected_freqs)
animate inanimate
of 1243.5129 1859.487
s 799.4871 1195.513
Conducting the test
The \(\chi^2\)-test now offers a convenient way of quantifying the differences between the two tables above. It measures how much the observed frequencies deviate from the expected frequencies for each cell in a contingency table (cf. Heumann, Schomaker, and Shalabh 2022: 249-251). The gist of this procedure is summarised in Equation 2.
\[ \text{Chi-squared } \chi^2 =\frac{(\text{observed} - \text{expected})^2}{\text{expected}} \tag{2}\]
Let \(I\) denote the total number of rows and \(J\) the total number of columns, with degress of freedom equal to \((I-1) \cdot (J-1)\). We compute the joint squared deviations between \(f_{ij}\) and \(e_{ij}\) for every row-column combination:
\[ \chi^2 = \sum_{i=1}^{I}\sum_{j=1}^{J}{\frac{(f_{ij} - e_{ij})^2}{e_{ij}}} \tag{3}\]
A visual representation is given in the table below.
\(X\) | ||||||
---|---|---|---|---|---|---|
\(x_1\) | \(x_2\) | … | \(x_j\) | |||
\(y_1\) | \(\frac{(f_{11} - e_{11})^2}{e_{11}}\) | \(\frac{(f_{12} - e_{12})^2}{e_{12}}\) | … | \(\frac{(f_{1j} - e_{1j})^2}{e_{1j}}\) | ||
\(y_2\) | \(\frac{(f_{21} - e_{21})^2}{e_{21}}\) | \(\frac{(f_{22} - e_{22})^2}{e_{22}}\) | … | \(\frac{(f_{2j} - e_{2j})^2}{e_{2j}}\) | ||
\(Y\) | … | … | … | … | … | |
\(y_i\) | \(\frac{(f_{i1} - e_{i1})^2}{e_{i1}}\) | \(\frac{(f_{i2} - e_{i1})^2}{e_{i2}}\) | … | \(\frac{(f_{ij} - e_{ij})^2}{e_{ij}}\) |
The implementation in R is a simple one-liner. Keep in mind that we have to supply absolute frequencies to chisq.test()
rather than percentages.
<- chisq.test(observed_freqs)
freqs_test
print(freqs_test)
Pearson's Chi-squared test with Yates' continuity correction
data: observed_freqs
X-squared = 1453.4, df = 1, p-value < 2.2e-16
Quite conveniently, the test object freqs_test
stores the expected frequencies, which can be easily accessed via subsetting. Luckily, they are identical to what we calculated above!
$expected freqs_test
animate inanimate
of 1243.5129 1859.487
s 799.4871 1195.513
Assumptions of the chi-squared test
This \(\chi^2\)-test comes with certain statistical assumptions. Violations of these assumptions decrease the validity of the result and could, therefore, lead to wrong conclusions about relationships in the data. In this case, other tests should be consulted.
- All observations are independent of each other.
- 80% of the expected frequencies are \(\geq\) 5.
- All observed frequencies are \(\geq\) 1.
Dependent observations (e.g., multiple measurements per participant) are a common problem of linguistic data and should always be controlled for. The gold standard are hierarchical (multilevel) models which respect grouping structures.
If the (expected) frequencies are low, it is recommended to use a more robust test such as the Fisher’s Exact Test or the log-likelihood ratio test (\(G\)-test).
While the \(\chi^2\)-test can only approximate the \(p\)-value, Fisher’s Exact Test can provide an exact solution. Note that for anything more complex than a \(2 \times 2\) table, it becomes considerably more computationally expensive; if it takes too long, set simulate.p.value = TRUE
.
Drawing on the hypergeometric distribution (see ?dhyper()
), it computes the probability of all frequency tables that are equal or more extreme than the one observed.
fisher.test(observed_freqs)
Fisher's Exact Test for Count Data
data: observed_freqs
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.07722662 0.10118258
sample estimates:
odds ratio
0.08841846
The \(G^2\)-statistic is analogous to \(\chi^2\), but it tends to be more robust for lower observed counts. It is defined as
\[ G^2 = 2\sum_{i=1}^{I}\sum_{j=1}^{J} f_{ij} \ln\left({\frac{f_{ij}}{e_{ij}}}\right) \tag{4}\]
and implemented in R via the DescTools
package.
# Load library (install if necessary)
library(DescTools)
# Perform G-test (preferably for tables with more than 2 rows/columns)
GTest(observed_freqs)
Log likelihood ratio (G-test) test of independence without correction
data: observed_freqs
G = 1502.8, X-squared df = 1, p-value < 2.2e-16
How do I make sense of the test results?
The test output has three ‘ingredients’:
- the chi-squared score (X-squared)
- the degrees of freedom (df)
- the p-value.
It is absolutely essential to report all three of those as they determine each other. Here a few possible wordings that could be used:
According to a \(\chi^2\)-test, there is a highly significant association between genitive type and possessor animacy at \(p < 0.001\) (\(\chi^2 = 106.44, df = 1\)), thus justifying the rejection of \(H_0\).
A \(\chi^2\)-test revealed a highly significant association between genitive type and possessor animacy (\(\chi^2(1) = 106.44\), \(p < 0.001\)), supporting the rejection of \(H_0\).
The \(\chi^2\)-test results (\(\chi^2 = 106.44\), \(df = 1\), \(p < 0.001\)) provide strong evidence against the null hypothesis, demonstrating a significant association between genitive type and possessor animacy.
On the whole, the test results suggest that the dependent variable Type
and the explanatory variable Possessor Animacy
are not independent of each other. The probability of randomly observing such a distribution under the assumption of no effect is lower than 0.05 (actually, it is \(< 2.2 \cdot 10^{-16}\)), which is sufficient to reject the null hypothesis.
We can infer that a speaker’s choice of clause Type
is very likely influenced by the animacy of the possessor; in other words, these two variables are correlated. However, there are still several things the test does not tell us:
Are there certain variable combinations where the \(\chi^2\)-scores are particularly high?
How strongly do
Type
andPossessor Animacy
influence each other?
Pearson residuals
If we’re interested in what cells show the greatest difference between observed and expected frequencies, an option would be to inspect the Pearson residuals (cf. Equation 5).
\[ \text{residuals} = \frac{\text{observed} - \text{expected}}{\sqrt{\text{expected}}} \tag{5}\]
These can be accessed via the test results stored freqs_test
.
$residuals freqs_test
animate inanimate
of -18.47557 15.10868
s 23.04185 -18.84282
The function assocplot()
can automatically compute the pearson residuals for any given contingency table and create a plot that highlights their contributions. If the bar is above the dashed line, it is black and indicates that a category is observed more frequently than expected (e.g., of genitives in the inanimate group). Conversely, bars are coloured grey if a category is considerably less frequent than expected, such as s-genitives with inanimate possessors.
assocplot(t(observed_freqs), col = c("black", "lightgrey"))
The chi-squared test only provides a \(p\)-value for the entire contingency table. But what if we wanted to test the residuals for their significance as well? Configural Frequency Analysis (Krauth and Lienert 1973) allows us to do exactly that: It performs a significance test for all combinations of variable values in a cross-table. Moreover, CFA is not limited to two variables only. Technically, users can test for associations between arbitrary numbers of variables, but should be aware of the increasing complexity of interpretation.
library(cfa) # install library beforehand
# Get the observed counts and convert them to a data frame
<- as.data.frame(observed_freqs)
config_df
# Convert to matrix
<- as.matrix(config_df[, 1:2]) # first two columns contain the configurations (= combinations of variable values)
configs <- config_df$Freq # Freq column contains the corresponding counts
counts
# Perform CFA on configuarations and counts; apply Bonferroni correction for multiple testing
<- cfa(configs, counts, bonferroni = TRUE)
cfa_output
# Print output
print(cfa_output)
*** Analysis of configuration frequencies (CFA) ***
label n expected Q chisq p.chisq sig.chisq z
1 s animate 1451 799.4871 0.1515671 530.9268 0 TRUE 25.09332
2 s inanimate 544 1195.5129 0.1669481 355.0519 0 TRUE -21.53650
3 of animate 592 1243.5129 0.1690271 341.3468 0 TRUE -21.24783
4 of inanimate 2511 1859.4871 0.2011766 228.2722 0 TRUE 18.95630
p.z sig.z
1 0 TRUE
2 1 TRUE
3 1 TRUE
4 0 TRUE
Summary statistics:
Total Chi squared = 1455.598
Total degrees of freedom = 1
p = 0
Sum of counts = 5098
Levels:
Var1 Var2
2 2
Effect size
The \(p\)-value only indicates the presence of correlation, but not its strength – regardless of how low it may be. It does not convey how much two variables determine each other. For this reason, it is highly recommended to report an effect size measure alongside the \(p\)-value. One such measure is Cramér’s \(V\), which takes values in the interval \([0, 1]\):
\[ V = \sqrt{\frac{\chi^2}{n \times (min(nrow, ncol) - 1)}}. \tag{6}\]
The package confintr
implements this in its cramersv()
function:
cramersv(freqs_test)
[1] 0.5339337
The association between two categorical variables is stronger, the closer \(V\) approximates 1. Conversely, if \(V = 0\), then the variables are completely independent. There are various guidelines in the literature that provide thresholds for “small”, “moderate” and “large” effects, yet these are rarely justified on theoretical grounds and could be viewed as arbitrary.
Exercises
Exercise 1 Schröter & Kortmann (2016) investigate the relationship between subject realisation (overt
vs. null
) and the grammatical category Person (1.p
. vs. 2.p.
vs. 3.p.
) in three varieties of English (Great Britain
vs. Hong Kong
vs. Singapore
). They report the following test results (2016: 235):
Chi-square test results: \[ \begin{align} \text{Singapore: \quad} & \chi^2 = 3.3245, df = 2, p = 0.1897 \\ \text{Hong Kong: \quad} & \chi^2 = 40.799, df = 2, p < 0.01 \\ \text{Great Britain: \quad} & \chi^2 = 3.6183, df = 2, p = 0.1638 \\ \end{align} \]
- What hypotheses are the authors testing?
- Assuming a significance level \(\alpha = 0.05\), what statistical conclusions can be drawn from the test results?
- What could be the theoretical implications of these results?
Exercise 2 Conduct a small-scale analysis of genitive Type
by Genre
(~ 1 page). Structure your write-up as follows:
- Research question and hypotheses
- Statistical methods
- Results (frequency tables, plots, test results, effect size)
- Interpretation of standardised residuals
- (Optional: Interpretation of configural frequency analysis)
- Brief theoretical assessment
- Conclusion