4.4 Hypothesis testing

Author

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract

This handout introduces the principles of scientific inference and null hypothesis significance testing (NHST) for both categorical and continuous linguistic data, covering hypothesis formulation, test statistics, sampling distributions, \(p\)-values, Type I/II errors, statistical power, and common misinterpretations of significance, with illustrative examples from linguistic research.

On scientific inference

Science begins and ends with theory, and statistics acts as the “go-between”. Regardless of the discipline, solid research is characterised by a robust theoretical foundation that gives rise to substantive hypotheses, i.e., theory-driven predictions about a population of interest. From this rather concrete hypothesis, it should be possible to derive a statistical hypothesis that re-states the prediction in more formal/mathematical terms. After checking it against real-world data, researchers can either confirm or reject their hypothesis, after which they may decide to amend (or even abandon) their theory – or keep it as is.

Null hypothesis significance testing (NHST)

The NHST framework offers researchers a convenient way of testing their theoretical assumptions. This chiefly involves setting up a set of (ideally) falsifiable statistical hypotheses, gathering evidence from the observed data and computing the (in)famous ‘\(p\)-value’ to determine “statistical significance” – a notion that is frequently misinterpreted in scientific studies.

Is this the only way of testing hypotheses?

The answer is a resounding no. Despite its immense popularity, NHST is problematic in many respects and hence subject to heavy criticism (cf. Dienes (2008): 76; Baguley (2012): 143-144). There are other statistical schools that can remedy many of its shortcomings and come with distinct advantages, such as those relying on likelihood-based inference and Bayesian principles. Although these are also becoming increasingly common in linguistics, they are still restricted to very few sub-disciplines and journals (mostly in the area of psycholinguistics).

\(H_0\) vs. \(H_1\)

Statistical hypotheses always come in pairs: A null hypothesis is accompanied by an alternative hypothesis. They are set up before (!) seeing the data and justified by previous research.

The null hypothesis \(H_0\) describes the “default state of the world” (James et al. 2021: 555). It claims there is no noteworthy effect to be observed in the data.
The alternative hypothesis \(H_1\) (or \(H_a\)) plainly states that \(H_0\) is false, suggesting that there is an effect of some kind.

Categorical data

We are interested in finding out whether the Type of an English genitive (‘s’ or ‘of-sc’) depends on Possessor Animacy (‘animate’ vs. ‘inanimate’). Our hypotheses are:

\(H_0:\) Genitive Type and Possessor Animancy are independent.
\(H_1:\) Genitive Type and Possessor Animancy are not independent.

According to a \(\chi^2\)-test of independence, there is a statistically significant relationship between genitive type and possessor animacy (\(p < 0.001\), \(\chi^2 = 106.44\), \(df = 1\)).

What does independence really mean?

The core idea is “that the probability distribution of the response variable is the same for each group” (Agresti and Kateri 2022: 177). Assume genitive Type is the dependent variable and Possessor Animacy the independent variable. Then independence would entail that the probabilities of the outcomes of the response variable Type = "s" and Type = "of" are not influenced by whether they occur in the groups Possessor Animacy = "animate" or Possessor Animacy = "inanimate".

If we consider two variables at the same time, such as \(X\) and \(Y\), they are said to have marginal probability functions, which we can call \(f_1(\text{Type})\) and \(f_2(\text{Animacy})\) here. If we condition the outcomes of both variables on each other, the following equivalence will hold:

\[ f(\text{Type} \mid \text{Animacy}) = f_1(\text{Type}) \text{ and } f(\text{Animacy} \mid \text{Type}) = f_2(\text{Animacy}). \tag{1}\]

Thus, the null hypothesis assumes that the probabilities of each combination of values (such as Type and Possessor Animacy), denoted by \(\pi_{ij}\), have the relationship in Equation 1. This can be stated succinctly as

\[ H_0 : \pi_{ij} = P(X = j)P(Y = i), \tag{2}\]

where \(P(X = j)\) denotes the the probability that the random variable \(X\) takes the value \(j\) and \(P(Y = i)\) the probability that the random variable \(Y\) takes the value \(i\).

For illustration, consider the data below:

			\(\text{Possessor Animacy}\)
		\(\text{animate}\)	\(\text{inanimate}\)
\(\text{Genitive}\)	\(\text{s}\)	\(\pi_{11}\)	\(\pi_{12}\)
	\(of\)	\(\pi_{21}\)	\(\pi_{22}\)

Independence would be given if for each cell the following relationships would hold:

\[ \begin{align} \pi_{11} &= P(X = \text{animate})P(Y = \text{s}) \\ \pi_{12} &= P(X = \text{inanimate})P(Y = \text{s}) \\ \pi_{21} &= P(X = \text{animate})P(Y = \text{of}) \\ \pi_{22} &= P(X = \text{inanimate})P(Y = \text{of}) \\ \end{align} \]

Continuous data

As part of a phonetic study, we compare the base frequencies of the F1 formants of vowels (in Hz) for male and female speakers of Apache. We forward the following hypotheses:

\(H_0:\) mean F1 frequency of men \(=\) mean F1 frequency of women.
\(H_1:\) mean F1 frequency of men \(\ne\) mean F1 frequency of women.

According to a two-sample \(t\)-test, there is a significant difference between the mean F1 frequencies of male and female speakers of Apache (\(t(112.19) = 2.44\), \(p < 0.05\)).

Generalisation

To be precise, we use the hypotheses to make statements about a population parameter \(\theta\), which can be a mean \(\mu\) for continuous data or a proportion \(\pi\) for categorical data. Formally, the null and alternative hypotheses can be restated as in Equation 3.

\[ \begin{align} H_0: \theta = 0 \\ H_1: \theta \neq 0 \end{align} \tag{3}\]

In the NHST world, we’re dealing with a “This town ain’t big enough for the both of us” situation: While we have to state both \(H_0\) and \(H_1\), only one of them can remain at the end of the day. But how do we decide between these two?

Test statistics

To facilitate the decision-making process, we proceed to gather statistical evidence from the observed data. Since NHST primarily revolves around \(H_0\) (and not \(H_1\)!), we need to review the evidence the data provides against or in favour \(H_0\). This is done via a test statistic \(T\) that characterises the sample at hand. Agresti & Kateri (2022: 163) describe test statistics as indicators of how strongly a point estimate (e.g., a mean) deviates from its expected value under \(H_0\).

There are many possible test statistics out there:

For instance, if the data are discrete, the \(\chi^2\) measure is used to compute differences between observed and expected frequencies in the entire sample.
In the case of continuous data, it is common to rely on \(t\) for quantifying differences between sample means.
Other possible test statistics include the correlation coefficient \(r\), \(z\)-scores, the \(F\)-statistic, and many others.

Sampling distributions

Probability distributions of sample statistics are called sampling distributions. A fundamental statistical insight is that sample statistics (specifically means and sums of continuous random variables) converge onto a standard normal distribution with \(\mu = 0\) and \(\sigma = 1\). This proven statement is known as the Central Limit Theorem (Heumann, Schomaker, and Shalabh 2022: 547). However, if the population standard deviation is unknown, the Student \(t\) distribution provides a reasonable approximation. It has a single parameter \(v\) determining its shape, and it stands for the degrees of freedom.

\(t\) distribution

Show the code

library(ggplot2)

# Define the degrees of freedom
df_t <- 10

# Create a sequence of x values
x_t <- seq(-4, 4, length.out = 1000)

# Compute the t-distribution density
y_t <- dt(x_t, df = df_t)

# Create a data frame
t_distribution_data <- data.frame(x = x_t, y = y_t)

# Generate the plot
ggplot(t_distribution_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", size = 1) + # Line for the density curve
  labs(
    title = "t-Distribution",
    subtitle = "Probability density function with 10 degrees of freedom",
    x = "t value",
    y = "Probability density"
  ) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 0.4), xlim = c(-4, 4)) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The \(\chi^2\) distribution is closely related to the \(t\) distribution, sharing the degrees of freedom parameter \(v\). In essence, it is the distribution of squared (standard) normally distributed random variables \(Z^2\). Another example are sums of squares \(\sum_{i=1}^n(x_i - \mu)^2\).

\(\chi^2\) distribution

Show the code

# Load ggplot2
library(ggplot2)

# Define the degrees of freedom
df <- 2

# Create a sequence of x values
x <- seq(0, 30, length.out = 1000)

# Compute the chi-squared density
y <- dchisq(x, df = df)

# Create a data frame
chi_squared_data <- data.frame(x = x, y = y)

# Generate the plot
ggplot(chi_squared_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", size = 1) + # Line for the density curve
  labs(
    title = "Chi-Squared Distribution",
    subtitle = "Probability density function with 2 degrees of freedom",
    x = "Chi-squared value",
    y = "Probability density"
  ) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 0.05), xlim = c(0, 30)) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)
  )

Statistical significance

The final rejection of \(H_0\) is determined by the significance probability \(p\). Due to the frequency and ferocity with which statistical significance is misinterpreted in the research literature, we will begin by reviewing its technical definition:

“The \(p\)-value is the probabilty, presuming that \(H_0\) is true, that the test statistic equals the observed value or a value even more extreme in the direction predicted by \(H_a\)” (Agresti and Kateri 2022: 163).

In compact notation, it is equivalent to the conditional probability

\[ P(T \geq \text{observed value} \mid H_0 \text{ is true}). \] If \(p\) is lower than a pre-defined threshold (typically \(0.05\)), also known as the significance level \(\alpha\), we can reject \(H_0\). However, if \(p \geq\) 0.05, this neither justifies rejecting nor accepting the null hypothesis (Baguley 2012: 121).

What could go wrong? Type I and Type II errors

There is always a chance that we accept or reject the wrong hypothesis; the four possible constellations are summarised in the table below (cf. Heumann, Schomaker, and Shalabh 2022: 223):

	\(H_0\) is true	\(H_0\) is not true
\(H_0\) is not rejected	\(\color{green}{\text{Correct decision}}\)	\(\color{red}{\text{Type II } (\beta)\text{-error}}\)
\(H_0\) is rejected	\(\color{red}{\text{Type I } (\alpha)\text{-error}}\)	\(\color{green}{\text{Correct decision}}\)

There is a trade-off between Type I and Type II errors (Agresti and Kateri 2022: 182-186):

If we try to decrease \(P\)(Type I) by selecting a lower \(\alpha\), \(P\)(Type II) will inevitably increase.
If we increase the sample size to lower \(P\) (Type II), \(P\)(Type I) will increase again.

It is recommended to computer the power 1 - \(P\)(Type II) of a test, in order to estimate the probability of correctly rejecting \(H_0\).

library(pwr)

# w = effect size (between 0 and 1)
# df = degrees of freedom
# sig.level = significance level alpha
# power = 1 - beta
pwr.chisq.test(w = 0.5, df = 1, sig.level = 0.05, power = 0.95)


     Chi squared power calculation 

              w = 0.5
              N = 51.97884
             df = 1
      sig.level = 0.05
          power = 0.95

NOTE: N is the number of observations

# We'd need a sample size of N = 52 to detect an effect with effect size 0.5, alpha = 0.05, beta = 0.05, and 1 degree of freedom (low data complexity, e.g., simple 2x2 contingency table).

Computing the \(p\)-value

Let’s say that the statistical analysis of two discrete variables \(X\) and \(Y\) has returned a test statistic of \(\chi^2 = 6.5\) for 2 \(df\). In order to compute the corresponding \(p\)-value we need to consult the sampling distribution of this test statistic.

\(\chi^2\) distribution

Show the code

# Load ggplot2
library(ggplot2)

# Define the degrees of freedom
df <- 2

# Create a sequence of x values
x <- seq(0, 30, length.out = 1000)

# Compute the chi-squared density
y <- dchisq(x, df = df)

# Create a data frame
chi_squared_data <- data.frame(x = x, y = y)

# Generate the plot
ggplot(chi_squared_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", size = 1) +
  labs(
    title = "Chi-Squared Distribution",
    subtitle = "Probability density function with 2 degrees of freedom",
    x = "Chi-squared value",
    y = "Probability density"
  ) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 0.05), xlim = c(0, 30)) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)
  )

Because continuous functions have an infinite number of \(x\)-values, the probability of any single value must be 0.¹ Therefore, if we are interested in obtaining actual probabilities from the PDF, we can only do so for intervals of values. The probability that a value \(X\) falls into the interval \(a < X < b\) is in fact equivalent to the area under the curve between \(a\) and \(b\) (cf. Equation 4).

¹ The proof for the underlying theorem is given in Heumann et al. (2022: 544).

\[ P(a < X < b) = \int_a^b f(x)dx. \tag{4}\]

Recall the PDF \(f(x)\) of the \(\chi^2\)-distribution with 2 degrees of freedom. The \(p\)-value corresponds to the green area under the curve ranging from \(x = 6.5\) up to \(\infty\), which can be restated formally in Equation 5. This brings us back to the definition of the \(p\)-value: It is the probability that the \(\chi^2\) score is equal to 6.5 or higher, i.e., \(P(\chi^2 \geq 6.5)\).

\[ P(6.5 < X < \infty) = \int_{6.5}^\infty f(x)dx. \tag{5}\]

Practical considerations

Common pitfalls (cf. Agresti and Kateri 2022: 189-190)

Statistical significance is NOT an indication of a causal relationship between the variables of interest (correlation \(\neq\) causation).
\(p\)-values do NOT signify the strength of an effect (\(\neq\) effect size). It only helps identify whether there is an effect to begin with.
\(p\)-values are NOT the probability of the null hypothesis being true.
Statistical significance is only a starting point for further scientific inquiry, and by no means the end of it.

References

Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.

Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral Sciences. Houndmills, Basingstoke: Palgrave Macmillan.

Dienes, Zoltán. 2008. Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Houndmills: Palgrave Macmillan.

Gries, Stefan Thomas. 2021. Statistics for Linguistics with r: A Practical Introduction. 3rd rev. ed. Berlin; Boston: De Gruyter Mouton.

Heumann, Christian, Michael Schomaker, and Shalabh. 2022. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in r. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-031-11833-3.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. New York: Springer. https://doi.org/10.1007/978-1-0716-1418-1.

Suggested reading

On scientific inference

Null hypothesis significance testing (NHST)

\(H_0\) vs. \(H_1\)

Categorical data

Continuous data

Generalisation

Test statistics

Sampling distributions

Statistical significance

What could go wrong? Type I and Type II errors

Computing the \(p\)-value

Practical considerations

References