4.6 t-test

Author

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract

This handout covers the application of \(t\)-tests for independent and paired samples, effect size calculation, and ANOVA for comparing multiple groups, with practical examples from phonetic and psycholinguistic data, including visualization, assumption checks, post-hoc testing, and exercises on statistical interpretation and reporting.

Preparation

Load packages and data:

library("readxl")
library("tidyverse")

data_vowels <- read.csv("../datasets/Vowels_Apache.csv", sep = "\t")

Overview and hypotheses

Since the \(\chi^2\) measure exclusively works with categorical variables, a separate test statistic is required if one of them is a continuous variable. The \(t\)-statistic is often used for research questions involving differences between sample means.

For instance, we will compare the mean frequencies of the F1 formants of vowels (in Hz) for male and female speakers of Apache. We forward the following hypotheses:

\(H_0:\) \(\mu_{\text{men}} = \mu_{\text{women}}\)
\(H_1:\) \(\mu_{\text{men}} \neq \mu_{\text{women}}\)

These are so-called two-sided hypotheses: We are only interested in whether or not the means are different, regardless of one mean being higher or lower than the other. By contrast, a one-sided alternative hypothesis predicts an effect in a specific direction (e.g., \(\mu_{\text{men}} > \mu_{\text{women}}\) or \(\mu_{\text{men}} < \mu_{\text{women}}\)).

The \(t\)-test

The way \(t\) is calculated depends on the sources of \(X\) and \(Y\): Do they originate from the same sample or from two (in-)dependent ones?

First, we consider two independent samples from a population:

Sample \(X\) with the observations \(\{x_1, x_2, ..., {x_n}_1\}\), sample size \(n_x\), sample mean \(\hat{\mu}_{x}\) and sample variance \(\hat{\sigma}^2_x\).
Sample \(Y\) with the observations \(\{y_1, y_2, ..., {y_n}_2\}\), sample size \(n_y\), sample mean \(\hat{\mu}_{y}\) and sample variance \(\hat{\sigma}^2_y\).

Descriptive overview

We select the variables of interest and proceed calculate the mean F1 frequencies for each level of SEX, requiring a grouped data frame.

Code

# Filter data so as to show only those observations that are relevant
data_vowels %>% 
  # Filter columns
  select(HZ_F1, SEX) %>%
    # Define grouping variable
    group_by(SEX) %>% 
      # Compute mean and standard deviation for each sex
      summarise(mean = mean(HZ_F1),
                sd = sd(HZ_F1)) -> data_vowels_stats

knitr::kable(data_vowels_stats)

SEX	mean	sd
F	528.8548	110.80099
M	484.2740	87.90112

Code

# Plot distributions
data_vowels_stats %>% 
  ggplot(aes(x = SEX, y = mean)) +
    geom_col() +
    geom_errorbar(aes(x = SEX,
                    ymin = mean-sd,
                    ymax = mean+sd), width = .2) +
    theme_classic()

Code

# Plot quartiles
data_vowels %>% 
  ggplot(aes(x = SEX, y = HZ_F1)) +
    geom_boxplot() +
    theme_classic()

Two-sample \(t\)-test

Suppose that the observations in samples \(X\) and \(Y\) each follow a normal distribution parametrised by \(\mathcal{N}(\mu, \sigma^2)\).

The rationale of the \(t\)-test is then to check how many standard errors \(\mu_{x}\) is away from \(\mu_{y}\). In short:

\[ t = \frac{\hat{\mu}_{x}-\hat{\mu}_{y}}{se} \tag{1}\]

The difficulty lies in selecting an appropriate estimate of the standard error. In the literature, the following scenarios are distinguished (Heumann, Schomaker, and Shalabh 2022: 232–233):

The population variances are known: This is hardly ever the case; instead, we tend to rely on sample estimates (cf. next point).
The population variances are unknown, but equal: As long as the variances (and hence the standard deviations) of the two group samples are not too different, we can estimate the standard error via the pooled variances \(\hat{\sigma}_{\text{pooled}}^2\):

\[ \hat{\sigma}_{\text{pooled}}^2 = \frac{(n_x - 1) \hat{\sigma}^2_x + (n_y - 1) \hat{\sigma}_y^2}{n_x + n_y - 2} \tag{2}\]

Then the standard error corresponds to

\[ se = \sqrt{\frac{\hat{\sigma}_{\text{pooled}}^2}{n_x} + \frac{\hat{\sigma}_{\text{pooled}}^2}{n_y}} \tag{3}\]

# Two-sample test with equal variances
apache_test1 <- t.test(data_vowels$HZ_F1 ~ data_vowels$SEX, 
                      var.equal = TRUE,
                      paired = FALSE)

print(apache_test1)


    Two Sample t-test

data:  data_vowels$HZ_F1 by data_vowels$SEX
t = 2.4416, df = 118, p-value = 0.01611
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
  8.423043 80.738624
sample estimates:
mean in group F mean in group M 
       528.8548        484.2740

The population variances are unknown and unequal: According to Agresti & Kateri (2022: 171), it is better to employ a different \(se\) estimate “[i]f the data show evidence of greatly different standard deviations (with, say, one being at least 50% larger than the other one).” In short, we replace the pooled variances by the unbiased sample variance estimators \(\hat{\sigma}_x^2\) and \(\hat{\sigma}_y^2\) (cf. Equation 4). This special case is also known as the \(t\)-test after Welch.

\[ se = \sqrt{\frac{\hat{\sigma}_x^2}{n_1} + \frac{\hat{\sigma}_y^2}{n_2}} \tag{4}\]

# Welch Two Sample t-test
apache_test2 <- t.test(data_vowels$HZ_F1 ~ data_vowels$SEX, 
                       var.eqal = FALSE,
                       paired = FALSE)

print(apache_test2)


    Welch Two Sample t-test

data:  data_vowels$HZ_F1 by data_vowels$SEX
t = 2.4416, df = 112.19, p-value = 0.01619
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
  8.403651 80.758016
sample estimates:
mean in group F mean in group M 
       528.8548        484.2740

Paired \(t\)-test

If there is more than one observation for a given subject (e.g, before and after an experiment), the samples are called paired (or dependent). Here, the variable \(d_i\) denotes the difference between two data points \(x_i - y_i\) in the \(i\)th observation. The corresponding \(t\)-statistic is obtained by relating the mean difference \(\bar{d} = \frac{1}{n}\sum_{i=1}^n{d_i}\) to its standard error:

\[ t(x, y) = t(d) = \frac{\bar{d}}{\hat{\sigma}_d} \sqrt{n}. \tag{5}\]

The variance of the mean difference is estimated by

\[ \hat{\sigma}_d = \frac{\sum_{i=1}^n({d_i} - \bar{d})^2}{n-1}. \tag{6}\]

Assumptions

Traditionally, the \(t\)-test is based on the assumptions of normality and variance homogeneity (i.e., equal sample variances). As long as the sample sizes are more or less comparable, the two-sided \(t\)-test remains robust even if there is some evidence of non-normality. The \(t\)-test after Welch can additionally handle unequal variances.

If the violations are substantial, it is advisable to use a non-parametric test such as the Wilcoxon-Mann-Whitney (WMW) U-Test instead. In essence, this test compares the probabilities of encountering a value \(x\) from sample \(X\) that is greater than a value \(y\) from sample \(Y\).

Is it possible to test for violations?

Yes, there are formal tests that allow us to test for (non-)normality and (un-)equal variances. However, a visual inspection of some diagnostic plots (e.g., histograms or qq-plots) is usually more than enough!

# Shapiro-Wilk Test of Normality
shapiro.test(data_vowels$HZ_F1) # H0: data points follow the normal distribution.


    Shapiro-Wilk normality test

data:  data_vowels$HZ_F1
W = 0.98996, p-value = 0.5311

# Check histogram
ggplot(data_vowels, aes(x = HZ_F1)) +
  geom_histogram(bins = 30) +
  theme_classic()

# Check quantile-quantile plot
qqnorm(data_vowels$HZ_F1, main = "Q-Q Plot for F1 Frequencies")
qqline(data_vowels$HZ_F1, col = "red")

# F test to compare two variances
var.test(data_vowels$HZ_F1 ~ data_vowels$SEX) # H0: variances are not too different from each other


    F test to compare two variances

data:  data_vowels$HZ_F1 by data_vowels$SEX
F = 1.5889, num df = 59, denom df = 59, p-value = 0.07789
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.949093 2.660040
sample estimates:
ratio of variances 
          1.588907

Effect size

Cohen’s d is a possible effect size measure for continuous data and is obtained by dividing the difference of both sample means by the pooled standard deviation.

cohen.d(data_vowels$HZ_F1, data_vowels$SEX) # see also ?cohen.d for more details


Cohen's d

d estimate: 0.4457697 (small)
95 percent confidence interval:
     lower      upper 
0.07976048 0.81177897

The documentation of the test (type ?cohen.d into the console for details) lists the following interpretation guidelines:

\(|d| < 0.2\): negligible effect
\(|d| < 0.5\): small effect
\(|d| < 0.8\): moderate effect
\(|d| \geq 0.8\): large effect

Reporting the results

According to a two-sample \(t\)-test, there is a significant difference between the mean F1 frequencies of male and female speakers of Apache (\(t = 2.44\), \(df = 112.19\), \(p = 0.02\)), yet the effect is rather small (Cohen’s \(d\) = 0.45). This provides sufficient evidence to reject the null hypothesis at \(\alpha = 0.05\).

ANOVA

A disadvantage of the \(t\)-test is that it only allows comparing two groups at most. For contrasts between three or more groups, ANOVA (Analysis of Variance) is the method of choice. For details, see Kirk (1982: Chapter 5) or Mendenhall et al. (1986: Chapter 13).

The following dataset comprises various semantic and distributional measures on 407 transitive verbs.

psycholing_data <- read_xlsx("psycholing_data.xlsx")

We will group the verbs into four frequency bins: ‘low’, ‘medium-low’, ‘medium-high’ and ‘high’.

Code

# Create frequency bins
psycholing_data <- psycholing_data %>%
  mutate(
    Frequency_bin = case_when(
      Frequency <= quantile(Frequency, 0.25, na.rm = TRUE) ~ "low",
      Frequency <= quantile(Frequency, 0.50, na.rm = TRUE) ~ "moderate", 
      Frequency <= quantile(Frequency, 0.75, na.rm = TRUE) ~ "high",
      TRUE ~ "very_high"
    ),
     # Convert to factor with proper ordering
    Frequency_bin = factor(Frequency_bin, 
                          levels = c("low", "moderate", "high", "very_high"))
  )


# Show some examples
psycholing_data %>% 
  select(Verb, Frequency_bin) %>% 
  head(n = 10)

# A tibble: 10 × 2
   Verb       Frequency_bin
   <chr>      <fct>        
 1 accelerate moderate     
 2 accept     very_high    
 3 accompany  low          
 4 accumulate moderate     
 5 acquire    high         
 6 adore      low          
 7 advance    low          
 8 advise     high         
 9 affect     high         
10 alter      high

The column Neighbourhood density represents the mean distances of word neighbours (collocates) in semantic vector spaces. According to Reilly and Desai (2017), “abstract words tend to have more sparse neighborhoods than concrete words” (2017: 51). This raises the question of how semantic neighbourhoods could be reflected in a verb’s usage patterns, if at all.

We will examine the mean neighbourhood density \(\mu\) in four frequency bins, testing

\[ H_0: \mu_{\text{low}} = \mu_{\text{moderate}} = \mu_{\text{high}} = \mu_{\text{very high}}. \]

From a descriptive point of view, there is indeed some evidence that neighbourhood density and frequency are somehow linked:

Code

psycholing_data %>%
  group_by(Frequency_bin) %>%
  summarise(
    n = n(),
    mean_density = mean(Neighbourhood_density, na.rm = TRUE),
    sd_density = sd(Neighbourhood_density, na.rm = TRUE),
    median_density = median(Neighbourhood_density, na.rm = TRUE),
    .groups = 'drop'
  )

# A tibble: 4 × 5
  Frequency_bin     n mean_density sd_density median_density
  <fct>         <int>        <dbl>      <dbl>          <dbl>
1 low             123         61.9       37.1           53.2
2 moderate         81         89.6       41.6           81.4
3 high            103        131.        45.3          126. 
4 very_high       100        152.        51.8          148.

Code

ggplot(psycholing_data, aes(x = Frequency_bin, y = Neighbourhood_density)) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5) +
  stat_summary(
    fun = mean,
    geom = "point",
    shape = 23,  # filled diamond
    size = 3,
    fill = "white",
    color = "black",
    stroke = 0.8
  ) +
  labs(
    title = "Neighbourhood Density of Transitive Verbs by Frequency",
    subtitle = "Bins based on frequency quartiles",
    x = "Frequency Bin", 
    y = "Neighbourhood Density"
  ) +
  theme_minimal()

In order to this hypothesis, R’s aov() function performs an \(F\) test, contrasting the Mean Square for Treatments (MST) with the Mean Square for Error (MSE).

To begin with, the total sum of squared errors \(\sum_{i=1}^n (y_i - \bar{y})^2\) is computed by summing over the squared errors of each of the \(1, \dots, k\) groups.

\[ \text{Total SS} = \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar{y})^2 \tag{7}\]

Let \(\bar{T_i}\) denote the sample mean of the \(i\)th group. Then the remaining quantities are defined by

\[ SST = \sum_{i=1}^k n_i(\bar{T_i} - \bar{y})^2 \tag{8}\] \[ SSE = \sum_{i=1}^k\sum_{j=1}^{n_i} (y_{ij} - \bar{T_i})^2 \tag{9}\]

From these, one can obtain the MST (i.e, \(\text{SST} / k-1\)) and MSE (i.e, \(\text{SSE} / n_1 + n_2 + \dots + n_k - k\)).

In R:

anova_model <- aov(Neighbourhood_density ~ Frequency_bin, data = psycholing_data)

summary(anova_model)

               Df Sum Sq Mean Sq F value Pr(>F)    
Frequency_bin   3 539765  179922    92.8 <2e-16 ***
Residuals     403 781328    1939                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To compute an effect size measure, we could use eta_squared() from the effectsize package:

eta_squared(anova_model)

For one-way between subjects designs, partial eta squared is equivalent
  to eta squared. Returning eta squared.

# Effect Size for ANOVA

Parameter     | Eta2 |       95% CI
-----------------------------------
Frequency_bin | 0.41 | [0.35, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

According to a one-way ANOVA, the main effect of frequency is statistically significant and large (F(3, 403) = 92.80, \(p\) < 0.001, \(\eta^2\) = 0.41, 95% CI [0.35, 1.00]).

Exercises

Tier 1

Exercise 1 You are investigating whether high-frequency words are processed faster than low-frequency words in a lexical decision task. State the null and alternative hypotheses for:

A two-sided test comparing reaction times
A one-sided test (assuming high-frequency words are processed faster)

Exercise 2 How does the \(t\)-test differ from the chi-squared test in terms of the type of data and research questions it addresses?

Tier 2

Exercise 3 The Dispersion column indicates how evenly words are distributed across text categories in a corpus. We will perform a (rather crude) division into words with high dispersion (‘even’) and low dispersion (‘uneven’).

# Create dispersion bins
psycholing_data <- psycholing_data %>%
  mutate(
    Dispersion_bin = case_when(
      Dispersion <= quantile(Dispersion, 0.5, na.rm = TRUE) ~ "even",
      TRUE ~ "uneven"
    ),
     # Convert to factor with proper ordering
    Dispersion_bin = factor(Dispersion_bin, 
                          levels = c("even", "uneven"))
  )

Earlier, we showed that word frequency and neighbourhood density go hand in hand. Since dispersion also expresses distributional information, we might expect differences in neighbourhood density here as well. Investigate their relationship by

forwarding undirected hypotheses;
computing and visualising the density means and standard deviations for each dispersion bin;
conducting an appropriate statistical test and assessing the extent to which its assumptions are violated;
reporting the results;
giving a brief linguistic interpretation.

Tier 3

Exercise 4 When an ANOVA reveals significant differences between groups, it tells us that at least one group differs from the others, but doesn’t specify which groups differ. Post-hoc tests address this by examining all possible pairwise comparisons between groups.

You previously conducted an ANOVA to test whether neighborhood density differs across frequency groups (‘low’, ‘moderate’, ‘high’, and ‘very high’):

Recall the ANOVA model:

anova_model <- aov(formula = Neighbourhood_density ~ Frequency_bin, data = psycholing_data)

Now perform a post-hoc analysis using Tukey’s Honest Significant Difference (HSD) test, which controls for multiple comparisons:

TukeyHSD(anova_model)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Neighbourhood_density ~ Frequency_bin, data = psycholing_data)

$Frequency_bin
                       diff       lwr       upr     p adj
moderate-low       27.69311 11.438849  43.94737 0.0000833
high-low           69.37941 54.207854  84.55098 0.0000000
very_high-low      90.30297 75.008055 105.59789 0.0000000
high-moderate      41.68630 24.817084  58.55552 0.0000000
very_high-moderate 62.60986 45.629614  79.59011 0.0000000
very_high-high     20.92356  4.976643  36.87047 0.0043341

Interpret the findings by addressing which frequency groups significantly differ from each other. How do these results relate to the original ANOVA findings?

Exercise 5 Multiple testing is a serious problem in the Null Hypothesis Significance Testing framework. Baguley (2012: 366) writes:

Multiple testing arises for several reasons. One reason is that the research questions rarely reduce to a single hypothesis test. There are usually several hypotheses of interest to a researcher and several tests are conducted. As each test has its own Type I error rate, the overall rate for multiple tests (the chance of at one Type I error) rises rapidly with the number of uncorrected tests.

The Tukey HSD or the Configural Frequency Analysis for categorical data are typical examples of multiple testing. Inspect the documentation of these approaches and find out whether (and, if possible, how) these methods adjust p-values or critical values to control the familywise error rate.

References

Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.

Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral Sciences. Houndmills, Basingstoke: Palgrave Macmillan.

Heumann, Christian, Michael Schomaker, and Shalabh. 2022. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in r. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-031-11833-3.

Kirk, Roger E. 1982. Experimental Design: Procedures for the Behavioral Sciences. 2. ed. Monterey, Calif.: Brooks, Cole.

Reilly, Megan, and Rutvik H. Desai. 2017. “Effects of Semantic Neighborhood Density in Abstract and Concrete Words.” Cognition 169: 46–53. https://doi.org/10.1016/j.cognition.2017.08.004.

Scheaffer, Richard L., William Mendenhall, and Dennid D. Wackerly. 1986. Mathematical Statistics with Applications. 3. ed. Boston, Mass.: Duxbury Pr.

Recommended reading

Preparation

Overview and hypotheses

The \(t\)-test

Descriptive overview

Two-sample \(t\)-test

Paired \(t\)-test

Assumptions

Effect size

Reporting the results

ANOVA

Exercises

Tier 1

Tier 2

Tier 3

References