18  Hypothesis testing


Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

18.1 Suggested reading

For linguists:

Gries (2021): Chapter 1.3.2


Baguley (2012): Chapter 4

Agresti and Kateri (2022): Chapter 5

Dienes (2008)

18.2 On scientific inference

Science begins and ends with theory, and statistics acts as the “go-between”. Regardless of the discipline, solid research is characterised by a robust theoretical foundation that gives rise to substantive hypotheses, i.e., theory-driven predictions about a population of interest. From this rather concrete hypothesis, it should be possible to derive a statistical hypothesis that re-states the prediction in more formal/mathematical terms. After checking it against real-world data, researchers can either confirm or reject their hypothesis, after which they may decide to amend (or even abandon) their theory – or keep it as is.

18.3 Null hypothesis significance testing (NHST)

The NHST framework offers researchers a convenient way of testing their theoretical assumptions. This chiefly involves setting up a set of (ideally) falsifiable statistical hypotheses, gathering evidence from the observed data and computing the (in)famous ‘\(p\)-value’ to determine “statistical significance” – a notion that is frequently misinterpreted in scientific studies.

Is this the only way of testing hypotheses?

The answer is a resounding no. Despite its immense popularity, NHST is problematic in many respects and hence subject to heavy criticism (cf. Dienes (2008): 76; Baguley (2012): 143-144). There are other statistical schools that can remedy many of its shortcomings and come with distinct advantages, such as those relying on likelihood-based inference and Bayesian principles. Although these are also becoming increasingly common in linguistics, they are still restricted to very few sub-disciplines and journals (mostly in the area of psycholinguistics).

18.3.1 \(H_0\) vs. \(H_1\)

Statistical hypotheses always come in pairs: A null hypothesis is accompanied by an alternative hypothesis. They are set up before (!) seeing the data and justified by previous research.

  • The null hypothesis \(H_0\) describes the “default state of the world” (James et al. 2021: 555). It claims there is no noteworthy effect to be observed in the data.

  • The alternative hypothesis \(H_1\) (or \(H_a\)) plainly states that the \(H_0\) is false, suggesting that there is an effect of some kind.

We are interested in finding out whether English clause ORDER (‘sc-mc’ or ‘mc-sc’) depends on the type of the subordinate clause (SUBORDTYPE), which can be either temporal (‘temp’) or causal (‘caus’).

Our hypotheses are:

  • \(H_0:\) The variables ORDER and SUBORDTYPE are independent.

  • \(H_1:\) The variables ORDER and SUBORDTYPE are not independent.

As part of a phonetic study, we compare the base frequencies of the F1 formants of vowels (in Hz) for male and female speakers of Apache. We forward the following hypotheses:

  • \(H_0:\) mean F1 frequency of men \(=\) mean F1 frequency of women.

  • \(H_1:\) mean F1 frequency of men \(\ne\) mean F1 frequency of women.

To be precise, we use the hypotheses to make statements about a population parameter \(\theta\), which can be a mean \(\mu\) for continuous data or a proportion \(\pi\) for categorical data, among other things. Mathematically, the null and alternative hypotheses can be restated as in Equation 18.1.

\[ \begin{align} H_0: \theta = 0 \\ H_1: \theta \neq 0 \end{align} \tag{18.1}\]

In the NHST world, we’re dealing with a “This town ain’t big enough for the both of us” situation: While we have to state both \(H_0\) and \(H_1\), only one of them can remain at the end of the day. But how do we decide between these two?

18.3.2 Test statistics

To facilitate the decision-making process, we proceed to gather statistical evidence from the observed data. Since NHST primarily revolves around \(H_0\) (and not \(H_1\)!), we need to review the evidence the data provides against or in favour \(H_0\). This is done via a test statistic \(T\) that characterises the sample at hand. Essentially, you can think of \(T\) as one-value summary of your data.

There are many possible test statistics out there:

  • For instance, if the data are discrete, the \(\chi^2\) measure is used to compute differences between observed and expected frequencies in the entire sample.
  • In the case of continuos data, it is common to rely on \(t\) for quantifying differences between sample means.
  • Other possible test statistics include the correlation coefficient \(r\), \(z\)-scores, the \(F\)-statistic, and many others.

18.3.3 Statistical significance

The final rejection of \(H_0\) is determined by the significance probability \(p\). Due to the frequency and ferocity with which statistical significance is misinterpreted in the research literature, we will begin by reviewing its technical definition:

“The \(p\)-value is the probabilty, presuming that \(H_0\) is true, that the test statistic equals the observed value or a value even more extreme in the direction predicted by \(H_a\)(Agresti and Kateri 2022: 163).

In compact notation, it is equivalent to the conditional probability

\[ P(T \geq \text{observed value} \mid H_0 \text{ is true}). \] If \(p\) is lower than a pre-defined threshold (typically \(0.05\)), also known as the significance level \(\alpha\), we can reject \(H_0\). However, if \(p \geq\) 0.05, this neither justifies rejecting nor accepting the null hypothesis (Baguley 2012: 121).

For example, a \(p\)-value of \(0.02\) means that we would see a test statistic \(T\) only 2% of the time if \(H_0\) were true. Since \(0.02\) lies below our significance level \(\alpha\) = \(0.05\), this would suggest a statistically significant relationship in the data, and we could therefore reject \(H_0\).

18.3.4 What could go wrong? Type I and Type II errors

There is always a chance that we accept or reject the wrong hypothesis; the four possible constellations are summarised in the table below (cf. Heumann, Schomaker, and Shalabh 2022: 223):

\(H_0\) is true \(H_0\) is not true
\(H_0\) is not rejected \(\color{green}{\text{Correct decision}}\) \(\color{red}{\text{Type II } (\beta)\text{-error}}\)
\(H_0\) is rejected \(\color{red}{\text{Type I } (\alpha)\text{-error}}\) \(\color{green}{\text{Correct decision}}\)

18.3.5 The mathematics of the \(p\)-value

Let’s say that the statistical analysis of clause ORDER and SUBORDTYPE has returned a test statistic of \(\chi^2 = 6.5\) for 2 \(df\). In order to compute the corresponding \(p\)-value we need to consult the sampling distribution of this test statistic.

A sampling distribution is a probability distribution that assigns probabilities to the values of a test statistic. Because most (if not all) of them are continuous, they have characteristic probability density functions (PDFs). Some of them are illustrated below:

Show the code
# Load ggplot2

# Define the degrees of freedom
df <- 2

# Create a sequence of x values
x <- seq(0, 30, length.out = 1000)

# Compute the chi-squared density
y <- dchisq(x, df = df)

# Create a data frame
chi_squared_data <- data.frame(x = x, y = y)

# Generate the plot
ggplot(chi_squared_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", size = 1) + # Line for the density curve
    title = "Chi-Squared Distribution",
    subtitle = "Probability density function with 2 degrees of freedom",
    x = "Chi-squared value",
    y = "Probability density"
  ) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 0.05), xlim = c(0, 30)) +
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)

Show the code
# Define the degrees of freedom
df_t <- 10

# Create a sequence of x values
x_t <- seq(-4, 4, length.out = 1000)

# Compute the t-distribution density
y_t <- dt(x_t, df = df_t)

# Create a data frame
t_distribution_data <- data.frame(x = x_t, y = y_t)

# Generate the plot
ggplot(t_distribution_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", size = 1) + # Line for the density curve
    title = "t-Distribution",
    subtitle = "Probability density function with 10 degrees of freedom",
    x = "t value",
    y = "Probability density"
  ) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 0.4), xlim = c(-4, 4)) +
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)

Show the code
# Define the degrees of freedom
df1 <- 5
df2 <- 10

# Create a sequence of x values
x_f <- seq(0, 5, length.out = 1000)

# Compute the F-distribution density
y_f <- df(x_f, df1 = df1, df2 = df2)

# Create a data frame
f_distribution_data <- data.frame(x = x_f, y = y_f)

# Generate the plot
ggplot(f_distribution_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", size = 1) + # Line for the density curve
    title = "F-Distribution",
    subtitle = "Probability density function with 5 and 10 degrees of freedom",
    x = "F value",
    y = "Probability density"
  ) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 1), xlim = c(0, 5)) +
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)

Because continuous functions have an infinite number of \(x\)-values, the probability of any single value must be 0.1 Therefore, if we are interested in obtaining actual probabilities from the PDF, we can only do so for intervals of values. The probability that a value \(X\) falls into the interval \(a < X < b\) is in fact equivalent to the area under the curve between \(a\) and \(b\) (cf. Equation 18.2).

  • 1 The proof for the underlying theorem is given in Heumann et al. (2022: 544).

  • \[ P(a < X < b) = \int_a^b f(x)dx. \tag{18.2}\]

    Recall the PDF \(f(x)\) of the \(\chi^2\)-distribution with 2 degrees of freedom. The \(p\)-value corresponds to the green area under the curve ranging from \(x = 6.5\) up to \(\infty\), which can be restated formally in Equation 18.3. This brings us back to the definition of the \(p\)-value: It is the probability that the \(\chi^2\) score is equal to 6.5 or higher, i.e., \(P(\chi^2 \geq 6.5)\).

    \[ P(6.5 < X < \infty) = \int_{6.5}^\infty f(x)dx. \tag{18.3}\]

    18.4 Practical considerations

    Common pitfalls (cf. Agresti and Kateri 2022: 189-190)
    • Statistical significance is NOT an indication of a causal relationship between the variables of interest (correlation \(\neq\) causation).

    • \(p\)-values do NOT signify the strength of an effect (\(\neq\) effect size). It only helps identify whether there is an effect to begin with.

    • \(p\)-values are NOT the probability of the null hypothesis being true.

    • Statistical significance is only a starting point for further scientific inquiry, and by no means the end of it.

    18.5 Exercises

    Exercise 18.1 Schröter & Kortmann (2016) investigate the relationship between subject realisation (overt vs. null) and the grammatical category Person (1.p. vs. 2.p. vs. 3.p.) in three varieties of English (Great Britain vs. Hong Kong vs. Singapore). They report the following test results (2016: 235):

    Chi-square test scores: \[ \begin{align} \text{Singapore: \quad} & \chi^2 = 3.3245, df = 2, p = 0.1897 \\ \text{Hong Kong: \quad} & \chi^2 = 40.799, df = 2, p < 0.01 \\ \text{Great Britain: \quad} & \chi^2 = 3.6183, df = 2, p = 0.1638 \\ \end{align} \]

    • What hypotheses are the authors testing?
    • Assuming a significance level \(\alpha = 0.05\), what statistical conclusions can be drawn from the test results?
    • What could be the theoretical implications of these results?

    Exercise 18.2 Try to develop statistical hypotheses for a research project you are currently working on!