Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Linguistic variables
    • 1.3 Research questions
    • 1.4 Set theory and mathematical notation
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 The CQP interface
    • 3.4 Data annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Chi-squared test
    • 4.6 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. 4. Introduction to Statistics
  2. 4.4 Hypothesis testing
  • 4. Introduction to Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Chi-squared test
    • 4.6 t-test

On this page

  • Suggested reading
  • On scientific inference
  • Null hypothesis significance testing (NHST)
    • \(H_0\) vs. \(H_1\)
    • Test statistics
    • Statistical significance
    • What could go wrong? Type I and Type II errors
    • Where does the \(p\)-value come from?
  • Practical considerations
  1. 4. Introduction to Statistics
  2. 4.4 Hypothesis testing

4.4 Hypothesis testing

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract
This handout introduces the principles of scientific inference and null hypothesis significance testing (NHST) for both categorical and continuous linguistic data, covering hypothesis formulation, test statistics, sampling distributions, \(p\)-values, Type I/II errors, statistical power, and common misinterpretations of significance, with illustrative examples from linguistic research.

Suggested reading

For linguists:

Gries (2021): Chapter 1.3.2

General:

Baguley (2012): Chapter 4

Agresti and Kateri (2022): Chapter 5

Dienes (2008)

On scientific inference

Science begins and ends with theory, and statistics acts as the “go-between”. Regardless of the discipline, solid research is characterised by a robust theoretical foundation that gives rise to substantive hypotheses, i.e., theory-driven predictions about a population of interest. From this rather concrete hypothesis, it should be possible to derive a statistical hypothesis that re-states the prediction in more formal/mathematical terms. After checking it against real-world data, researchers can either confirm or reject their hypothesis, after which they may decide to amend (or even abandon) their theory – or keep it as is.

Null hypothesis significance testing (NHST)

The NHST framework offers researchers a convenient way of testing their theoretical assumptions. This chiefly involves setting up a set of (ideally) falsifiable statistical hypotheses, gathering evidence from the observed data and computing the (in)famous ‘\(p\)-value’ to determine “statistical significance” – a notion that is frequently misinterpreted in scientific studies.

Is this the only way of testing hypotheses?

The answer is a resounding no. Despite its immense popularity, NHST is problematic in many respects and hence subject to heavy criticism (cf. Dienes (2008): 76; Baguley (2012): 143-144). There are other statistical schools that can remedy many of its shortcomings and come with distinct advantages, such as those relying on likelihood-based inference and Bayesian principles. Although these are also becoming increasingly common in linguistics, they are still restricted to very few sub-disciplines and journals (mostly in the area of psycholinguistics).

\(H_0\) vs. \(H_1\)

Statistical hypotheses always come in pairs: A null hypothesis is accompanied by an alternative hypothesis. They are set up before (!) seeing the data and justified by previous research. Note that these hypotheses “always make a statement about the population” (Casella and Berger 2002: 373).

  • The null hypothesis \(H_0\) describes the “default state of the world” (James et al. 2021: 555). It claims there is no noteworthy effect to be observed in the data.

  • The alternative hypothesis \(H_1\) (or \(H_a\)) plainly states that \(H_0\) is false, suggesting that there is an effect of some kind.

Categorical data

We are interested in finding out whether the Type of an English genitive (‘s’ vs. ‘of’) depends on Possessor Animacy (‘animate’ vs. ‘inanimate’). Our hypotheses are:

  • \(H_0:\) Genitive Type and Possessor Animacy are independent.

  • \(H_1:\) Genitive Type and Possessor Animacy are not independent.

According to a \(\chi^2\)-test of independence, there is a statistically significant relationship between genitive type and possessor animacy (\(p < 0.001\), \(\chi^2 = 106.44\), \(df = 1\)).

What does independence really mean?

The core idea is “that the probability distribution of the response variable is the same for each group” (Agresti and Kateri 2022: 177). Assume genitive Type is the dependent variable and Possessor Animacy the independent variable. Then independence would entail that the probabilities of the outcomes of the response variable Type = "s" and Type = "of" are not influenced by whether they occur in the groups Possessor Animacy = "animate" or Possessor Animacy = "inanimate".

If we consider two variables at the same time, such as \(X\) and \(Y\), they are said to have marginal probability functions, which we can call \(f_1(\text{Type})\) and \(f_2(\text{Animacy})\) here. If we condition the outcomes of both variables on each other, the following equivalence will hold if the variables are independent:

\[ f(\text{Type} \mid \text{Animacy}) = f_1(\text{Type}) \text{ and } f(\text{Animacy} \mid \text{Type}) = f_2(\text{Animacy}). \tag{1}\]

Thus, the null hypothesis assumes that the probabilities of each combination of values (such as Type and Possessor Animacy), denoted by \(\pi_{ij}\), have the relationship in Equation 1. This can be stated succinctly as

\[ H_0 : \pi_{ij} = P(X = j)P(Y = i), \tag{2}\]

where \(P(X = j)\) denotes the the probability that the random variable \(X\) takes the value \(j\) and \(P(Y = i)\) the probability that the random variable \(Y\) takes the value \(i\).

For illustration, consider the data below:

\(\text{Possessor Animacy}\)
\(\text{animate}\) \(\text{inanimate}\)
\(\text{Genitive}\) \(\text{s}\) \(\pi_{11}\) \(\pi_{12}\)
\(\text{of}\) \(\pi_{21}\) \(\pi_{22}\)

Independence would be given if for each cell the following relationships would hold:

\[ \begin{align} \pi_{11} &= P(X = \text{animate})P(Y = \text{s}) \\ \pi_{12} &= P(X = \text{inanimate})P(Y = \text{s}) \\ \pi_{21} &= P(X = \text{animate})P(Y = \text{of}) \\ \pi_{22} &= P(X = \text{inanimate})P(Y = \text{of}) \\ \end{align} \]

Continuous data

As part of a phonetic study, we compare the base frequencies of the F1 formants of vowels (in Hz) for male and female speakers of Apache. We forward the following hypotheses:

  • \(H_0:\) mean F1 frequency of men \(=\) mean F1 frequency of women.

  • \(H_1:\) mean F1 frequency of men \(\ne\) mean F1 frequency of women.

According to a two-sample \(t\)-test, there is a significant difference between the mean F1 frequencies of male and female speakers of Apache (\(t(112.19) = 2.44\), \(p < 0.05\)).

Test statistics

To facilitate the decision-making process, we proceed to gather statistical evidence from the observed data, that is, the sample. Since NHST primarily revolves around \(H_0\) (and not \(H_1\)!), we need to review the evidence the data provides against or in favour of \(H_0\). This is done via a test statistic \(T\) that characterises the sample at hand. Agresti & Kateri (2022: 163) describe test statistics as indicators of how strongly a point estimate (e.g., a mean or a proportion) deviates from its expected value under \(H_0\).

There are many possible test statistics out there:

  • For instance, if the data are discrete and nominal, the \(\chi^2\) measure is used to compute differences between observed and expected frequencies in the entire sample.
  • In the case of continuous data, it is common to rely on \(t\) for quantifying differences between sample means.
  • Other possible test statistics include the correlation coefficient \(r\), \(z\)-scores, the \(F\)-statistic, and many others.

Test statistics have characteristic probability distributions which are known as sampling distributions (e.g., the \(\chi^2\)-distribution or the \(t\)-distribution). Recall that probability distributions are obtained by assigning probabilities to every outcome of a random variable.

Sampling distributions

A fundamental theoretical insight is that sample statistics (specifically means and sums of continuous random variables) converge onto a standard normal distribution with \(\mu = 0\) and \(\sigma = 1\). This proven statement is known as the Central Limit Theorem (Heumann, Schomaker, and Shalabh 2022: 547). However, if the population standard deviation is unknown, the Student \(t\) distribution provides a reasonable approximation. It has a single parameter \(v\) determining its shape, and it stands for the degrees of freedom.

\(t\) distribution
Show the code
library(ggplot2)

# Define the degrees of freedom
df_t <- 10

# Create a sequence of x values
x_t <- seq(-4, 4, length.out = 1000)

# Compute the t-distribution density
y_t <- dt(x_t, df = df_t)

# Create a data frame
t_distribution_data <- data.frame(x = x_t, y = y_t)

# Generate the plot
ggplot(t_distribution_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", linewidth = 1) + # Line for the density curve
  labs(
    title = "t-Distribution",
    subtitle = "Probability density function with 10 degrees of freedom",
    x = "t value",
    y = "Probability density"
  ) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 0.4), xlim = c(-4, 4)) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)
  )

The \(\chi^2\) distribution is closely related to the \(t\) distribution, sharing the degrees of freedom parameter \(v\). In essence, it is the distribution of squared (standard) normally distributed random variables \(Z^2\). Another example are sums of squares \(\sum_{i=1}^n(x_i - \mu)^2\).

\(\chi^2\) distribution
Show the code
# Load ggplot2
library(ggplot2)

# Define the degrees of freedom
df <- 2

# Create a sequence of x values
x <- seq(0, 30, length.out = 1000)

# Compute the chi-squared density
y <- dchisq(x, df = df)

# Create a data frame
chi_squared_data <- data.frame(x = x, y = y)

# Generate the plot
ggplot(chi_squared_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", linewidth = 1) + # Line for the density curve
  labs(
    title = "Chi-Squared Distribution",
    subtitle = "Probability density function with 2 degrees of freedom",
    x = "Chi-squared value",
    y = "Probability density"
  ) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 0.05), xlim = c(0, 30)) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)
  )

Statistical significance

The final rejection of \(H_0\) is determined by the significance probability \(p\). Due to the frequency and ferocity with which statistical significance is misinterpreted in the research literature, we will begin by reviewing its technical definition:

“The \(p\)-value is the probabilty, presuming that \(H_0\) is true, that the test statistic equals the observed value or a value even more extreme in the direction predicted by \(H_a\)” (Agresti and Kateri 2022: 163).

In compact notation, it is equivalent to the conditional probability

\[ P(T \geq \text{observed value} \mid H_0 \text{ is true}). \] If \(p\) is lower than a pre-defined threshold (typically \(0.05\)), also known as the significance level \(\alpha\), we can reject \(H_0\). However, if \(p \geq\) 0.05, this neither justifies rejecting nor accepting the null hypothesis (Baguley 2012: 121).

What could go wrong? Type I and Type II errors

There is always a chance that we accept or reject the wrong hypothesis; the four possible constellations are summarised in the table below (cf. Heumann, Schomaker, and Shalabh 2022: 223):

\(H_0\) is true \(H_0\) is not true
\(H_0\) is not rejected \(\color{green}{\text{Correct decision}}\) \(\color{red}{\text{Type II } (\beta)\text{-error}}\)
\(H_0\) is rejected \(\color{red}{\text{Type I } (\alpha)\text{-error}}\) \(\color{green}{\text{Correct decision}}\)

There is a trade-off between Type I and Type II errors (Agresti and Kateri 2022: 182-186):

  • If we try to decrease \(P\)(Type I) by selecting a lower \(\alpha\), \(P\)(Type II) will inevitably increase.
  • If we increase the sample size to lower \(P\) (Type II), \(P\)(Type I) will increase again.
Statistical power

It is recommended to computer the power 1 - \(P\)(Type II) of a test, in order to estimate the probability of correctly rejecting \(H_0\). The pwr library provides a convenient interface:

library(pwr)

# w = effect size (between 0 and 1)
# df = degrees of freedom
# sig.level = significance level alpha
# power = 1 - beta
pwr.chisq.test(w = 0.5, df = 1, sig.level = 0.05, power = 0.95)

     Chi squared power calculation 

              w = 0.5
              N = 51.97884
             df = 1
      sig.level = 0.05
          power = 0.95

NOTE: N is the number of observations
# We'd need a sample size of N = 52 to detect an effect with effect size 0.5, alpha = 0.05, beta = 0.05, and 1 degree of freedom (low data complexity, e.g., simple 2x2 contingency table).

Where does the \(p\)-value come from?

Let’s say that the statistical analysis of two discrete variables \(X\) and \(Y\) has returned a test statistic of \(\chi^2 = 6.5\) for 2 \(df\). In order to compute the corresponding \(p\)-value we need to consult the sampling distribution of this test statistic.

\(\chi^2\) distribution
Show the code
# Load ggplot2
library(ggplot2)

# Define the degrees of freedom
df <- 2

# Create a sequence of x values
x <- seq(0, 30, length.out = 1000)

# Compute the chi-squared density
y <- dchisq(x, df = df)

# Create a data frame
chi_squared_data <- data.frame(x = x, y = y)

# Generate the plot
ggplot(chi_squared_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", size = 1) +
  labs(
    title = "Chi-Squared Distribution",
    subtitle = "Probability density function with 2 degrees of freedom",
    x = "Chi-squared value",
    y = "Probability density"
  ) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 0.05), xlim = c(0, 30)) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)
  )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Because sampling distributions are continuous probability distributions, they have a potentially infinite number of \(x\)-values. This also implies that the probability of any single \(x\)-value must be 0.1

1 The proof is given in Heumann et al. (2022: 544).

The sampling distribution thus has a probability density function (PDF), which we will call \(f(x)\), and probabilities can only be obtained for an interval \([a,b]\). The probability that a value \(X\) falls into the interval \(a < X < b\) is then equivalent to the area under the curve between \(a\) and \(b\) (cf. Equation 3).

\[ P(a < X < b) = \int_a^b f(x)dx. \tag{3}\]

Recall the PDF \(f_{\chi^2}(x)\) of the \(\chi^2\)-distribution with 2 degrees of freedom. The \(p\)-value corresponds to the green area under the curve ranging from \(x = 6.5\) up to \(\infty\), which can be restated formally in Equation 4. This brings us back to the definition of the \(p\)-value: It is the probability that the \(\chi^2\) score is equal to 6.5 or higher, i.e., \(P(\chi^2 \geq 6.5)\).

\[ P(\chi^2 \geq 6.5) = P(6.5 < \chi^2 < \infty) = \int_{6.5}^\infty f_{\chi^2}(x)dx. \tag{4}\]

In practice, statistical software computes this integral for us:

pchisq(6.5, df = 2, lower.tail = FALSE)
[1] 0.03877421

Practical considerations

Common pitfalls (cf. Agresti and Kateri 2022: 189-190)
  • Statistical significance is NOT an indication of a causal relationship between the variables of interest (correlation \(\neq\) causation).

  • \(p\)-values do NOT signify the strength of an effect (\(\neq\) effect size). It only helps identify whether there is an effect to begin with.

  • \(p\)-values are NOT the probability of the null hypothesis being true.

  • Statistical significance is only a starting point for further scientific inquiry, and by no means the end of it.

References

Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.
Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral Sciences. Houndmills, Basingstoke: Palgrave Macmillan.
Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Pacific Grove, Calif.: Duxbury/Thomson Learning.
Dienes, Zoltán. 2008. Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Houndmills: Palgrave Macmillan.
Gries, Stefan Thomas. 2021. Statistics for Linguistics with r: A Practical Introduction. 3rd rev. ed. Berlin; Boston: De Gruyter Mouton.
Heumann, Christian, Michael Schomaker, and Shalabh. 2022. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in r. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-031-11833-3.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. New York: Springer. https://doi.org/10.1007/978-1-0716-1418-1.
4.3 Descriptive statistics
4.5 Chi-squared test