Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Research questions
    • 1.3 Linguistic variables
    • 1.4 Set theory and mathematical notation
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 Data annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Binomial test
    • 4.6 Chi-squared test
    • 4.7 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. 4. Introduction to Statistics
  2. 4.6 Chi-squared test
  • 4. Introduction to Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Binomial test
    • 4.6 Chi-squared test
    • 4.7 t-test

On this page

  • Suggested reading
  • Preparation
  • The Pearson \(\chi^2\)-test of independence
    • Observerd frequencies
    • Expected frequencies
    • Conducting the test
    • Assumptions of the chi-squared test
    • How do I make sense of the test results?
  • Pearson residuals
  • Effect size
  • Exercises
  1. 4. Introduction to Statistics
  2. 4.6 Chi-squared test

4.6 Chi-squared test

Inferential statistics
Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract

The chi-squared (\(\chi^2\)) test helps determine if there is a statistically significant association between two categorical variables. It compares the observed frequencies of categories with those expected under the null hypothesis. The \(\chi^2\) (chi-squared) score quantifies the difference between observed and expected frequencies for every cell in a contingency table. The greater the difference between observed and expected, the higher the \(\chi^2\) score and the lower the \(p\)-value, given the degrees of freedom. It is recommended to compute effect size measures and inspect the residuals to assess the nature of the association.

Suggested reading

For linguists:

Gries (2021): Chapter 4.1.2.1

General:

Baguley (2012): Chapter 4

Agresti and Kateri (2022): Chapter 5

Preparation

Script

You can find the full R script associated with this unit here.

#  Load libraries
library(readxl)
library(tidyverse)
library(confintr) # for effect size calculation

# Load data
genitive <- read.csv("Grafmiller_genitive_alternation.csv", sep = "\t")

The Pearson \(\chi^2\)-test of independence

The first step of any significance test involves setting up the null and alternative hypothesis. In this unit, we will perform a sample analysis on the relationship between genitive Type and text Possessor Animacy. Specifically, we will focus on the independence of these two discrete variables (i.e, the presence or absence of correlation).

  • \(H_0:\) The variables Type and Possessor Animacy are independent.

  • \(H_1:\) The variables Type and Possessor Animacy are not independent.

Next, we compute a test statistic that indicates how strongly our data conforms to proportions expected under \(H_0\). To this end, we will need two types of values:

  • the observed frequencies \(f_{ij}\) present in our data set

  • the expected frequencies \(e_{ij}\), which we would expect to see if \(H_0\) were true,

where \(f, e \in \mathbb{N}\). The indices \(i\) and \(j\) uniquely identify the cell counts in all column-row combinations of a contingency table.

Observerd frequencies

The table below represents a generic contingency table where \(Y\) and \(X\) are categorical variables and have the values \(Y = \{y_1, y_2, \dots, y_i \}\) and \(X = \{x_1, x_2, \dots, x_j\}\). In the table, each cell indicates the count of observation \(f_{ij}\) corresponding to the \(i\)-th row and \(j\)-th column.

\(X\)
\(x_1\) \(x_2\) … \(x_j\)
\(y_1\) \(f_{11}\) \(f_{12}\) … \(f_{1j}\)
\(y_2\) \(f_{21}\) \(f_{22}\) … \(f_{2j}\)
\(Y\) … … … … …
\(y_i\) \(f_{i1}\) \(f_{i2}\) … \(f_{ij}\)

In the genitive data, the observed frequencies correspond to how often each Type value (i.e., s and of) is attested for a given Possessor Animacy group (i.e., animate and inanimate). These can be computed in a very straightforward fashion by applying R’s table() function to the variables of interest.

observed_freqs <- table(genitive$Type, genitive$Possessor_Animacy2)

print(observed_freqs)
    
     animate inanimate
  of     592      2511
  s     1451       544

Expected frequencies

The expected frequencies require a few additional steps. Usually, these steps are performed automatically when conducting the chi-squared test in R, so you don’t have to worry about calculating them by hand. We will do it anyway to drive home the rationale of the test.

\(X\)
\(x_1\) \(x_2\) … \(x_j\)
\(y_1\) \(e_{11}\) \(e_{12}\) … \(e_{1j}\)
\(y_2\) \(e_{21}\) \(e_{22}\) … \(e_{2j}\)
\(Y\) … … … … …
\(y_i\) \(e_{i1}\) \(e_{i2}\) … \(e_{ij}\)

The expected frequencies \(e_{ij}\) are given by the formula in Equation 1. In concrete terms, we go through each cell in the cross-table and multiply the corresponding row sums with the column sums, dividing the result by the total number of occurrences in the sample. For example, there are \(592\) occurrences of of genitives in the animate group. The row sum is \(592+ 2511 = 3103\) and the column sum is \(592 + 1451 = 2043\). Next, we take their product \(3103 \cdot 2043\) and divide it by the total number of observations, which is \(592 + 2511 + 1451 + 544 = 5098\). Thus we obtain an expected frequency of \(\frac{3103 \times 2043}{5098} \approx 1243\) under the null hypothesis.

\[ e_{ij} = \frac{i\textrm{th row sum} \times j \textrm{th column sum}}{\textrm{sample size}} \tag{1}\]

The expected frequencies for our combination of variables is shown below. In which cells can you see the greatest deviations between observed and expected frequencies?

Show the code
## Calculate row totals
row_totals <- rowSums(observed_freqs)

## Calculate column totals
col_totals <- colSums(observed_freqs)

## Total number of observations
total_obs <- sum(observed_freqs)

## Calculate expected frequencies
expected_freqs <- outer(row_totals, col_totals) / total_obs

print(expected_freqs)
     animate inanimate
of 1243.5129  1859.487
s   799.4871  1195.513

Conducting the test

The \(\chi^2\)-test now offers a convenient way of quantifying the differences between the two tables above. It measures how much the observed frequencies deviate from the expected frequencies for each cell in a contingency table (cf. Heumann, Schomaker, and Shalabh 2022: 249-251). The gist of this procedure is summarised in Equation 2.

\[ \text{Chi-squared } \chi^2 =\frac{(\text{observed} - \text{expected})^2}{\text{expected}} \tag{2}\]

Let \(I\) denote the total number of rows and \(J\) the total number of columns, with degress of freedom equal to \((I-1) \cdot (J-1)\). We compute the joint squared deviations between \(f_{ij}\) and \(e_{ij}\) for every row-column combination:

\[ \chi^2 = \sum_{i=1}^{I}\sum_{j=1}^{J}{\frac{(f_{ij} - e_{ij})^2}{e_{ij}}} \tag{3}\]

A visual representation is given in the table below.

\(X\)
\(x_1\) \(x_2\) … \(x_j\)
\(y_1\) \(\frac{(f_{11} - e_{11})^2}{e_{11}}\) \(\frac{(f_{12} - e_{12})^2}{e_{12}}\) … \(\frac{(f_{1j} - e_{1j})^2}{e_{1j}}\)
\(y_2\) \(\frac{(f_{21} - e_{21})^2}{e_{21}}\) \(\frac{(f_{22} - e_{22})^2}{e_{22}}\) … \(\frac{(f_{2j} - e_{2j})^2}{e_{2j}}\)
\(Y\) … … … … …
\(y_i\) \(\frac{(f_{i1} - e_{i1})^2}{e_{i1}}\) \(\frac{(f_{i2} - e_{i1})^2}{e_{i2}}\) … \(\frac{(f_{ij} - e_{ij})^2}{e_{ij}}\)

The implementation in R is a simple one-liner. Keep in mind that we have to supply absolute frequencies to chisq.test() rather than percentages.

freqs_test <- chisq.test(observed_freqs)

print(freqs_test)

    Pearson's Chi-squared test with Yates' continuity correction

data:  observed_freqs
X-squared = 1453.4, df = 1, p-value < 2.2e-16

Quite conveniently, the test object freqs_test stores the expected frequencies, which can be easily accessed via subsetting. Luckily, they are identical to what we calculated above!

freqs_test$expected
    
       animate inanimate
  of 1243.5129  1859.487
  s   799.4871  1195.513

Assumptions of the chi-squared test

This \(\chi^2\)-test comes with certain statistical assumptions. Violations of these assumptions decrease the validity of the result and could, therefore, lead to wrong conclusions about relationships in the data. In this case, other tests should be consulted.

Important
  1. All observations are independent of each other.
  2. 80% of the expected frequencies are \(\geq\) 5.
  3. All observed frequencies are \(\geq\) 1.

Dependent observations (e.g., multiple measurements per participant) are a common problem of linguistic data and should always be controlled for. The gold standard are hierarchical (multilevel) models which respect grouping structures.

If the (expected) frequencies are low, it is recommended to use a more robust test such as the Fisher’s Exact Test or the log-likelihood ratio test (\(G\)-test).

Fisher’s Exact Test

While the \(\chi^2\)-test can only approximate the \(p\)-value, Fisher’s Exact Test can provide an exact solution. Note that for anything more complex than a \(2 \times 2\) table, it becomes considerably more computationally expensive; if it takes too long, set simulate.p.value = TRUE.

Drawing on the hypergeometric distribution (see ?dhyper()), it computes the probability of all frequency tables that are equal or more extreme than the one observed.

fisher.test(observed_freqs)

    Fisher's Exact Test for Count Data

data:  observed_freqs
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.07722662 0.10118258
sample estimates:
odds ratio 
0.08841846 
\(G\)-test

The \(G^2\)-statistic is analogous to \(\chi^2\), but it tends to be more robust for lower observed counts. It is defined as

\[ G^2 = 2\sum_{i=1}^{I}\sum_{j=1}^{J} f_{ij} \ln\left({\frac{f_{ij}}{e_{ij}}}\right) \tag{4}\]

and implemented in R via the DescTools package.

# Load library (install if necessary)
library(DescTools)

# Perform G-test (preferably for tables with more than 2 rows/columns)
GTest(observed_freqs)

    Log likelihood ratio (G-test) test of independence without correction

data:  observed_freqs
G = 1502.8, X-squared df = 1, p-value < 2.2e-16

How do I make sense of the test results?

The test output has three ‘ingredients’:

  • the chi-squared score (X-squared)
  • the degrees of freedom (df)
  • the p-value.

It is absolutely essential to report all three of those as they determine each other. Here a few possible wordings that could be used:

According to a \(\chi^2\)-test, there is a highly significant association between genitive type and possessor animacy at \(p < 0.001\) (\(\chi^2 = 106.44, df = 1\)), thus justifying the rejection of \(H_0\).

A \(\chi^2\)-test revealed a highly significant association between genitive type and possessor animacy (\(\chi^2(1) = 106.44\), \(p < 0.001\)), supporting the rejection of \(H_0\).

The \(\chi^2\)-test results (\(\chi^2 = 106.44\), \(df = 1\), \(p < 0.001\)) provide strong evidence against the null hypothesis, demonstrating a significant association between genitive type and possessor animacy.

On the whole, the test results suggest that the dependent variable Type and the explanatory variable Possessor Animacy are not independent of each other. The probability of randomly observing such a distribution under the assumption of no effect is lower than 0.05 (actually, it is \(< 2.2 \cdot 10^{-16}\)), which is sufficient to reject the null hypothesis.

We can infer that a speaker’s choice of clause Type is very likely influenced by the animacy of the possessor; in other words, these two variables are correlated. However, there are still several things the test does not tell us:

  • Are there certain variable combinations where the \(\chi^2\)-scores are particularly high?

  • How strongly do Type and Possessor Animacy influence each other?

Pearson residuals

If we’re interested in what cells show the greatest difference between observed and expected frequencies, an option would be to inspect the Pearson residuals (cf. Equation 5).

\[ \text{residuals} = \frac{\text{observed} - \text{expected}}{\sqrt{\text{expected}}} \tag{5}\]

These can be accessed via the test results stored freqs_test.

freqs_test$residuals
    
       animate inanimate
  of -18.47557  15.10868
  s   23.04185 -18.84282

The function assocplot() can automatically compute the pearson residuals for any given contingency table and create a plot that highlights their contributions. If the bar is above the dashed line, it is black and indicates that a category is observed more frequently than expected (e.g., of genitives in the inanimate group). Conversely, bars are coloured grey if a category is considerably less frequent than expected, such as s-genitives with inanimate possessors.

assocplot(t(observed_freqs), col = c("black", "lightgrey"))

Testing the residuals: Configural Frequency Analysis

The chi-squared test only provides a \(p\)-value for the entire contingency table. But what if we wanted to test the residuals for their significance as well? Configural Frequency Analysis (Krauth and Lienert 1973) allows us to do exactly that: It performs a significance test for all combinations of variable values in a cross-table. Moreover, CFA is not limited to two variables only. Technically, users can test for associations between arbitrary numbers of variables, but should be aware of the increasing complexity of interpretation.

library(cfa) # install library beforehand

# Get the observed counts and convert them to a data frame
config_df <- as.data.frame(observed_freqs)

# Convert to matrix
configs <- as.matrix(config_df[, 1:2])  # first two columns contain the configurations (= combinations of variable values)
counts <- config_df$Freq # Freq column contains the corresponding counts

# Perform CFA on configuarations and counts; apply Bonferroni correction for multiple testing
cfa_output <- cfa(configs, counts, bonferroni = TRUE)

# Print output
print(cfa_output)

*** Analysis of configuration frequencies (CFA) ***

         label    n  expected         Q    chisq p.chisq sig.chisq         z
1    s animate 1451  799.4871 0.1515671 530.9268       0      TRUE  25.09332
2  s inanimate  544 1195.5129 0.1669481 355.0519       0      TRUE -21.53650
3   of animate  592 1243.5129 0.1690271 341.3468       0      TRUE -21.24783
4 of inanimate 2511 1859.4871 0.2011766 228.2722       0      TRUE  18.95630
  p.z sig.z
1   0  TRUE
2   1  TRUE
3   1  TRUE
4   0  TRUE


Summary statistics:

Total Chi squared         =  1455.598 
Total degrees of freedom  =  1 
p                         =  0 
Sum of counts             =  5098 

Levels:

Var1 Var2 
   2    2 

Effect size

The \(p\)-value only indicates the presence of correlation, but not its strength – regardless of how low it may be. It does not convey how much two variables determine each other. For this reason, it is highly recommended to report an effect size measure alongside the \(p\)-value. One such measure is Cramér’s \(V\), which takes values in the interval \([0, 1]\):

\[ V = \sqrt{\frac{\chi^2}{n \times (min(nrow, ncol) - 1)}}. \tag{6}\]

The package confintr implements this in its cramersv() function:

cramersv(freqs_test)
[1] 0.5339337

The association between two categorical variables is stronger, the closer \(V\) approximates 1. Conversely, if \(V = 0\), then the variables are completely independent. There are various guidelines in the literature that provide thresholds for “small”, “moderate” and “large” effects, yet these are rarely justified on theoretical grounds and could be viewed as arbitrary.

Exercises

Exercise 1 Schröter & Kortmann (2016) investigate the relationship between subject realisation (overt vs. null) and the grammatical category Person (1.p. vs. 2.p. vs. 3.p.) in three varieties of English (Great Britain vs. Hong Kong vs. Singapore). They report the following test results (2016: 235):

Chi-square test results: \[ \begin{align} \text{Singapore: \quad} & \chi^2 = 3.3245, df = 2, p = 0.1897 \\ \text{Hong Kong: \quad} & \chi^2 = 40.799, df = 2, p < 0.01 \\ \text{Great Britain: \quad} & \chi^2 = 3.6183, df = 2, p = 0.1638 \\ \end{align} \]

  • What hypotheses are the authors testing?
  • Assuming a significance level \(\alpha = 0.05\), what statistical conclusions can be drawn from the test results?
  • What could be the theoretical implications of these results?

Exercise 2 Conduct a small-scale analysis of genitive Type by Genre (~ 1 page). Structure your write-up as follows:

  • Research question and hypotheses
  • Statistical methods
  • Results (frequency tables, plots, test results, effect size)
  • Interpretation of standardised residuals
  • (Optional: Interpretation of configural frequency analysis)
  • Brief theoretical assessment
  • Conclusion

References

Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.
Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral Sciences. Houndmills, Basingstoke: Palgrave Macmillan.
Gries, Stefan Thomas. 2021. Statistics for Linguistics with r: A Practical Introduction. 3rd rev. ed. Berlin; Boston: De Gruyter Mouton.
Heumann, Christian, Michael Schomaker, and Shalabh. 2022. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in r. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-031-11833-3.
Krauth, J., and G. A. Lienert. 1973. Die Konfigurationsfrequenzanalyse (KFA) Und Ihre Anwendung in Psychologie Und Medizin. Reprint 1995. Weinheim: Beltz Psychologie Verlagsunion.
Schröter, Verena, and Bernd Kortmann. 2016. “Pronoun Deletion in Hong Kong English and Colloquial Singaporean English.” World Englishes 35 (2): 221–41. https://doi.org/10.1111/weng.12192.
4.5 Binomial test
4.7 t-test