Science begins and ends with theory, and statistics acts as the “go-between”. Regardless of the discipline, solid research is characterised by a robust theoretical foundation that gives rise to substantive hypotheses, i.e., theory-driven predictions about a population of interest. From this rather concrete hypothesis, it should be possible to derive a statistical hypothesis that re-states the prediction in more formal/mathematical terms. After checking it against real-world data, researchers can either confirm or reject their hypothesis, after which they may decide to amend (or even abandon) their theory – or keep it as is.
18.3 Null hypothesis significance testing (NHST)
The NHST framework offers researchers a convenient way of testing their theoretical assumptions. This chiefly involves setting up a set of (ideally) falsifiable statistical hypotheses, gathering evidence from the observed data and computing the (in)famous ‘\(p\)-value’ to determine “statistical significance” – a notion that is frequently misinterpreted in scientific studies.
Is this the only way of testing hypotheses?
The answer is a resounding no. Despite its immense popularity, NHST is problematic in many respects and hence subject to heavy criticism (cf. Dienes (2008): 76; Baguley (2012): 143-144). There are other statistical schools that can remedy many of its shortcomings and come with distinct advantages, such as those relying on likelihood-based inference and Bayesian principles. Although these are also becoming increasingly common in linguistics, they are still restricted to very few sub-disciplines and journals (mostly in the area of psycholinguistics).
18.3.1\(H_0\) vs. \(H_1\)
Statistical hypotheses always come in pairs: A null hypothesis is accompanied by an alternative hypothesis. They are set up before (!) seeing the data and justified by previous research.
The null hypothesis\(H_0\) describes the “default state of the world” (James et al. 2021: 555). It claims there is no noteworthy effect to be observed in the data.
The alternative hypothesis\(H_1\) (or \(H_a\)) plainly states that the \(H_0\) is false, suggesting that there is an effect of some kind.
Example: Hypotheses for categorical data
We are interested in finding out whether English clause ORDER (‘sc-mc’ or ‘mc-sc’) depends on the type of the subordinate clause (SUBORDTYPE), which can be either temporal (‘temp’) or causal (‘caus’).
Our hypotheses are:
\(H_0:\) The variables ORDER and SUBORDTYPE are independent.
\(H_1:\) The variables ORDER and SUBORDTYPE are not independent.
Example: Hypotheses for continuous data
As part of a phonetic study, we compare the base frequencies of the F1 formants of vowels (in Hz) for male and female speakers of Apache. We forward the following hypotheses:
\(H_0:\) mean F1 frequency of men \(=\) mean F1 frequency of women.
\(H_1:\) mean F1 frequency of men \(\ne\) mean F1 frequency of women.
In formal terms
To be precise, we use the hypotheses to make statements about a population parameter\(\theta\), which can be a mean \(\mu\) for continuous data or a proportion \(\pi\) for categorical data, among other things. Mathematically, the null and alternative hypotheses can be restated as in Equation 18.1.
In the NHST world, we’re dealing with a “This town ain’t big enough for the both of us” situation: While we have to state both \(H_0\) and \(H_1\), only one of them can remain at the end of the day. But how do we decide between these two?
18.3.2 Test statistics
To facilitate the decision-making process, we proceed to gather statistical evidence from the observed data. Since NHST primarily revolves around \(H_0\) (and not \(H_1\)!), we need to review the evidence the data provides against or in favour \(H_0\). This is done via a test statistic\(T\) that characterises the sample at hand. Essentially, you can think of \(T\) as one-value summary of your data.
There are many possible test statistics out there:
For instance, if the data are discrete, the \(\chi^2\) measure is used to compute differences between observed and expected frequencies in the entire sample.
In the case of continuos data, it is common to rely on \(t\) for quantifying differences between sample means.
Other possible test statistics include the correlation coefficient \(r\), \(z\)-scores, the \(F\)-statistic, and many others.
18.3.3 Statistical significance
The final rejection of \(H_0\) is determined by the significance probability\(p\). Due to the frequency and ferocity with which statistical significance is misinterpreted in the research literature, we will begin by reviewing its technical definition:
“The \(p\)-value is the probabilty, presuming that \(H_0\) is true, that the test statistic equals the observed value or a value even more extreme in the direction predicted by \(H_a\)” (Agresti and Kateri 2022: 163).
In compact notation, it is equivalent to the conditional probability
\[
P(T \geq \text{observed value} \mid H_0 \text{ is true}).
\] If \(p\) is lower than a pre-defined threshold (typically \(0.05\)), also known as the significance level\(\alpha\), we can reject \(H_0\). However, if \(p \geq\) 0.05, this neither justifies rejecting nor accepting the null hypothesis (Baguley 2012: 121).
For example, a \(p\)-value of \(0.02\) means that we would see a test statistic \(T\) only 2% of the time if \(H_0\) were true. Since \(0.02\) lies below our significance level \(\alpha\) = \(0.05\), this would suggest a statistically significant relationship in the data, and we could therefore reject \(H_0\).
18.3.4 What could go wrong? Type I and Type II errors
There is always a chance that we accept or reject the wrong hypothesis; the four possible constellations are summarised in the table below (cf. Heumann, Schomaker, and Shalabh 2022: 223):
\(H_0\) is true
\(H_0\) is not true
\(H_0\)is not rejected
\(\color{green}{\text{Correct decision}}\)
\(\color{red}{\text{Type II } (\beta)\text{-error}}\)
\(H_0\)is rejected
\(\color{red}{\text{Type I } (\alpha)\text{-error}}\)
\(\color{green}{\text{Correct decision}}\)
18.3.5 The mathematics of the \(p\)-value
Let’s say that the statistical analysis of clause ORDER and SUBORDTYPE has returned a test statistic of \(\chi^2 = 6.5\) for 2 \(df\). In order to compute the corresponding \(p\)-value we need to consult the sampling distribution of this test statistic.
A sampling distribution is a probability distribution that assigns probabilities to the values of a test statistic. Because most (if not all) of them are continuous, they have characteristic probability density functions (PDFs). Some of them are illustrated below:
\(\chi^2\) distribution
Show the code
# Load ggplot2library(ggplot2)# Define the degrees of freedomdf <-2# Create a sequence of x valuesx <-seq(0, 30, length.out =1000)# Compute the chi-squared densityy <-dchisq(x, df = df)# Create a data framechi_squared_data <-data.frame(x = x, y = y)# Generate the plotggplot(chi_squared_data, aes(x = x, y = y)) +geom_line(color ="steelblue", size =1) +# Line for the density curvelabs(title ="Chi-Squared Distribution",subtitle ="Probability density function with 2 degrees of freedom",x ="Chi-squared value",y ="Probability density" ) +theme_minimal() +coord_cartesian(ylim =c(0, 0.05), xlim =c(0, 30)) +theme(plot.title =element_text(hjust =0.5),axis.title =element_text(size =12) )
\(t\) distribution
Show the code
# Define the degrees of freedomdf_t <-10# Create a sequence of x valuesx_t <-seq(-4, 4, length.out =1000)# Compute the t-distribution densityy_t <-dt(x_t, df = df_t)# Create a data framet_distribution_data <-data.frame(x = x_t, y = y_t)# Generate the plotggplot(t_distribution_data, aes(x = x, y = y)) +geom_line(color ="steelblue", size =1) +# Line for the density curvelabs(title ="t-Distribution",subtitle ="Probability density function with 10 degrees of freedom",x ="t value",y ="Probability density" ) +theme_minimal() +coord_cartesian(ylim =c(0, 0.4), xlim =c(-4, 4)) +theme(plot.title =element_text(hjust =0.5),axis.title =element_text(size =12) )
\(F\) distribution
Show the code
# Define the degrees of freedomdf1 <-5df2 <-10# Create a sequence of x valuesx_f <-seq(0, 5, length.out =1000)# Compute the F-distribution densityy_f <-df(x_f, df1 = df1, df2 = df2)# Create a data framef_distribution_data <-data.frame(x = x_f, y = y_f)# Generate the plotggplot(f_distribution_data, aes(x = x, y = y)) +geom_line(color ="steelblue", size =1) +# Line for the density curvelabs(title ="F-Distribution",subtitle ="Probability density function with 5 and 10 degrees of freedom",x ="F value",y ="Probability density" ) +theme_minimal() +coord_cartesian(ylim =c(0, 1), xlim =c(0, 5)) +theme(plot.title =element_text(hjust =0.5),axis.title =element_text(size =12) )
Because continuous functions have an infinite number of \(x\)-values, the probability of any single value must be 0.1 Therefore, if we are interested in obtaining actual probabilities from the PDF, we can only do so for intervals of values. The probability that a value \(X\) falls into the interval \(a < X < b\) is in fact equivalent to the area under the curve between \(a\) and \(b\) (cf. Equation 18.2).
1 The proof for the underlying theorem is given in Heumann et al. (2022: 544).
\[
P(a < X < b) = \int_a^b f(x)dx.
\tag{18.2}\]
Recall the PDF \(f(x)\) of the \(\chi^2\)-distribution with 2 degrees of freedom. The \(p\)-value corresponds to the green area under the curve ranging from \(x = 6.5\) up to \(\infty\), which can be restated formally in Equation 18.3. This brings us back to the definition of the \(p\)-value: It is the probability that the \(\chi^2\) score is equal to 6.5 or higher, i.e., \(P(\chi^2 \geq 6.5)\).
\[
P(6.5 < X < \infty) = \int_{6.5}^\infty f(x)dx.
\tag{18.3}\]
Statistical significance is NOT an indication of a causal relationship between the variables of interest (correlation\(\neq\)causation).
\(p\)-values do NOT signify the strength of an effect (\(\neq\) effect size). It only helps identify whether there is an effect to begin with.
\(p\)-values are NOT the probability of the null hypothesis being true.
Statistical significance is only a starting point for further scientific inquiry, and by no means the end of it.
18.5 Exercises
Exercise 18.1 Schröter & Kortmann (2016) investigate the relationship between subject realisation (overt vs. null) and the grammatical category Person (1.p. vs. 2.p. vs. 3.p.) in three varieties of English (Great Britain vs. Hong Kong vs. Singapore). They report the following test results (2016: 235):
Assuming a significance level \(\alpha = 0.05\), what statistical conclusions can be drawn from the test results?
What could be the theoretical implications of these results?
Exercise 18.2 Try to develop statistical hypotheses for a research project you are currently working on!
Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.
Baguley, Thomas. 2012. Serious Stats: AGuide to AdvancedStatistics for the BehavioralSciences. Houndmills, Basingstoke: Palgrave Macmillan.
Dienes, Zoltán. 2008. Understanding Psychology as a Science: AnIntroduction to Scientific and StatisticalInference. Houndmills: Palgrave Macmillan.
Gries, Stefan Thomas. 2021. Statistics for Linguistics with r: A Practical Introduction. 3rd rev. ed. Berlin; Boston: De Gruyter Mouton.
Heumann, Christian, Michael Schomaker, and Shalabh. 2022. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in r. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-031-11833-3.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. New York: Springer. https://doi.org/10.1007/978-1-0716-1418-1.
Schröter, Verena, and Bernd Kortmann. 2016. “Pronoun Deletion in Hong Kong English and Colloquial Singaporean English.”World Englishes 35 (2): 221–41. https://doi.org/10.1111/weng.12192.