library("readxl")
library("tidyverse")
<- read.csv("../datasets/Vowels_Apache.csv", sep = "\t") data_vowels
4.7 t-test
Recommended reading
Agresti & Kateri (2022: Chapter 5.3)
Heumann et al. (2022: Chapter 10.3.3)
Preparation
Load packages and data:
Overview and hypotheses
Since the \(\chi^2\) measure exclusively works with categorical variables, a separate test statistic is required if one of them is a continuous variable. The \(t\)-statistic is often used for research questions involving differences between sample means.
For instance, we will compare the mean frequencies of the F1 formants of vowels (in Hz) for male and female speakers of Apache. We forward the following hypotheses:
\(H_0:\) \(\mu_{\text{men}} = \mu_{\text{women}}\)
\(H_1:\) \(\mu_{\text{men}} \neq \mu_{\text{women}}\)
These are so-called two-sided hypotheses: We are only interested in whether or not the means are different, regardless of one mean being higher or lower than the other. By contrast, a one-sided alternative hypothesis predicts an effect in a specific direction (e.g., \(\mu_{\text{men}} > \mu_{\text{women}}\) or \(\mu_{\text{men}} < \mu_{\text{women}}\)).
The \(t\)-test
The way \(t\) is calculated depends on the sources of \(X\) and \(Y\): Do they originate from the same sample or from two (in-)dependent ones?
First, we consider two independent samples from a population:
Sample \(X\) with the observations \(\{x_1, x_2, ..., {x_n}_1\}\), sample size \(n_x\), sample mean \(\hat{\mu}_{x}\) and sample variance \(\hat{\sigma}^2_x\).
Sample \(Y\) with the observations \(\{y_1, y_2, ..., {y_n}_2\}\), sample size \(n_y\), sample mean \(\hat{\mu}_{y}\) and sample variance \(\hat{\sigma}^2_y\).
Descriptive overview
We select the variables of interest and proceed calculate the mean F1 frequencies
for each level of SEX
, requiring a grouped data frame.
Code
# Filter data so as to show only those observations that are relevant
%>%
data_vowels # Filter columns
select(HZ_F1, SEX) %>%
# Define grouping variable
group_by(SEX) %>%
# Compute mean and standard deviation for each sex
summarise(mean = mean(HZ_F1),
sd = sd(HZ_F1)) -> data_vowels_stats
::kable(data_vowels_stats) knitr
SEX | mean | sd |
---|---|---|
F | 528.8548 | 110.80099 |
M | 484.2740 | 87.90112 |
Code
# Plot distributions
%>%
data_vowels_stats ggplot(aes(x = SEX, y = mean)) +
geom_col() +
geom_errorbar(aes(x = SEX,
ymin = mean-sd,
ymax = mean+sd), width = .2) +
theme_classic()
Code
# Plot quartiles
%>%
data_vowels ggplot(aes(x = SEX, y = HZ_F1)) +
geom_boxplot() +
theme_classic()
Two-sample \(t\)-test
Suppose that the observations in samples \(X\) and \(Y\) each follow a normal distribution parametrised by \(\mathcal{N}(\mu, \sigma^2)\).
The rationale of the \(t\)-test is then to check how many standard errors \(\mu_{x}\) is away from \(\mu_{y}\). In short:
\[ t = \frac{\hat{\mu}_{x}-\hat{\mu}_{y}}{se} \tag{1}\]
The difficulty lies in selecting an appropriate estimate of the standard error. In the literature, the following scenarios are distinguished (Heumann, Schomaker, and Shalabh 2022: 232–233):
- The population variances are known: This is hardly ever the case; instead, we tend to rely on sample estimates (cf. next point).
- The population variances are unknown, but equal: As long as the variances (and hence the standard deviations) of the two group samples are not too different, we can estimate the standard error via the pooled variances \(\hat{\sigma}_{\text{pooled}}^2\):
\[ \hat{\sigma}_{\text{pooled}}^2 = \frac{(n_x - 1) \hat{\sigma}^2_x + (n_y - 1) \hat{\sigma}_y^2}{n_x + n_y - 2} \tag{2}\]
Then the standard error corresponds to
\[ se = \sqrt{\frac{\hat{\sigma}_{\text{pooled}}^2}{n_x} + \frac{\hat{\sigma}_{\text{pooled}}^2}{n_y}} \tag{3}\]
# Two-sample test with equal variances
<- t.test(data_vowels$HZ_F1 ~ data_vowels$SEX,
apache_test1 var.equal = TRUE,
paired = FALSE)
print(apache_test1)
Two Sample t-test
data: data_vowels$HZ_F1 by data_vowels$SEX
t = 2.4416, df = 118, p-value = 0.01611
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
8.423043 80.738624
sample estimates:
mean in group F mean in group M
528.8548 484.2740
- The population variances are unknown and unequal: According to Agresti & Kateri (2022: 171), it is better to employ a different \(se\) estimate “[i]f the data show evidence of greatly different standard deviations (with, say, one being at least 50% larger than the other one).” In short, we replace the pooled variances by the unbiased sample variance estimators \(\hat{\sigma}_x^2\) and \(\hat{\sigma}_y^2\) (cf. Equation 4). This special case is also known as the \(t\)-test after Welch.
\[ se = \sqrt{\frac{\hat{\sigma}_x^2}{n_1} + \frac{\hat{\sigma}_y^2}{n_2}} \tag{4}\]
# Welch Two Sample t-test
<- t.test(data_vowels$HZ_F1 ~ data_vowels$SEX,
apache_test2 var.eqal = FALSE,
paired = FALSE)
print(apache_test2)
Welch Two Sample t-test
data: data_vowels$HZ_F1 by data_vowels$SEX
t = 2.4416, df = 112.19, p-value = 0.01619
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
8.403651 80.758016
sample estimates:
mean in group F mean in group M
528.8548 484.2740
Paired \(t\)-test
If there is more than one observation for a given subject (e.g, before and after an experiment), the samples are called paired (or dependent). Here, the variable \(d_i\) denotes the difference between two data points \(x_i - y_i\) in the \(i\)th observation. The corresponding \(t\)-statistic is obtained by relating the mean difference \(\bar{d} = \frac{1}{n}\sum_{i=1}^n{d_i}\) to its standard error:
\[ t(x, y) = t(d) = \frac{\bar{d}}{\hat{\sigma}_d} \sqrt{n}. \tag{5}\]
The variance of the mean difference is estimated by
\[ \hat{\sigma}_d = \frac{\sum_{i=1}^n({d_i} - \bar{d})^2}{n-1}. \tag{6}\]
Assumptions
Traditionally, the \(t\)-test is based on the assumptions of normality and variance homogeneity (i.e., equal sample variances). As long as the sample sizes are more or less comparable, the two-sided \(t\)-test remains robust even if there is some evidence of non-normality. The \(t\)-test after Welch can additionally handle unequal variances.
If the violations are substantial, it is advisable to use a non-parametric test such as the Wilcoxon-Mann-Whitney (WMW) U-Test instead. In essence, this test compares the probabilities of encountering a value \(x\) from sample \(X\) that is greater than a value \(y\) from sample \(Y\).
Yes, there are formal tests that allow us to test for (non-)normality and (un-)equal variances. However, a visual inspection of some diagnostic plots (e.g., histograms or qq-plots) is usually more than enough!
# Shapiro-Wilk Test of Normality
shapiro.test(data_vowels$HZ_F1) # H0: data points follow the normal distribution.
Shapiro-Wilk normality test
data: data_vowels$HZ_F1
W = 0.98996, p-value = 0.5311
# Check histogram
ggplot(data_vowels, aes(x = HZ_F1)) +
geom_histogram(bins = 30) +
theme_classic()
# Check quantile-quantile plot
qqnorm(data_vowels$HZ_F1, main = "Q-Q Plot for F1 Frequencies")
qqline(data_vowels$HZ_F1, col = "red")
# F test to compare two variances
var.test(data_vowels$HZ_F1 ~ data_vowels$SEX) # H0: variances are not too different from each other
F test to compare two variances
data: data_vowels$HZ_F1 by data_vowels$SEX
F = 1.5889, num df = 59, denom df = 59, p-value = 0.07789
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.949093 2.660040
sample estimates:
ratio of variances
1.588907
Effect size
Cohen’s d is a possible effect size measure for continuous data and is obtained by dividing the difference of both sample means by the pooled standard deviation.
cohen.d(data_vowels$HZ_F1, data_vowels$SEX) # see also ?cohen.d for more details
Cohen's d
d estimate: 0.4457697 (small)
95 percent confidence interval:
lower upper
0.07976048 0.81177897
The documentation of the test (type ?cohen.d
into the console for details) lists the following interpretation guidelines:
- \(|d| < 0.2\): negligible effect
- \(|d| < 0.5\): small effect
- \(|d| < 0.8\): moderate effect
- \(|d| \geq 0.8\): large effect
Reporting the results
According to a two-sample \(t\)-test, there is a significant difference between the mean F1 frequencies
of male and female speakers of Apache (\(t = 2.44\), \(df = 112.19\), \(p = 0.02\)), yet the effect is rather small (Cohen’s \(d\) = 0.45). This provides sufficient evidence to reject the null hypothesis at \(\alpha = 0.05\).
ANOVA
A disadvantage of the \(t\)-test is that it only allows comparing two groups at most. For contrasts between three or more groups, ANOVA (Analysis of Variance) is the method of choice. For details, see Kirk (1982: Chapter 5) or Mendenhall et al. (1986: Chapter 13).
The following dataset comprises various semantic and distributional measures on 407 transitive verbs.
<- read_xlsx("psycholing_data.xlsx") psycholing_data
We will group the verbs into four frequency bins: ‘low’, ‘medium-low’, ‘medium-high’ and ‘high’.
Code
# Create frequency bins
<- psycholing_data %>%
psycholing_data mutate(
Frequency_bin = case_when(
<= quantile(Frequency, 0.25, na.rm = TRUE) ~ "low",
Frequency <= quantile(Frequency, 0.50, na.rm = TRUE) ~ "moderate",
Frequency <= quantile(Frequency, 0.75, na.rm = TRUE) ~ "high",
Frequency TRUE ~ "very_high"
),# Convert to factor with proper ordering
Frequency_bin = factor(Frequency_bin,
levels = c("low", "moderate", "high", "very_high"))
)
# Show some examples
%>%
psycholing_data select(Verb, Frequency_bin) %>%
head(n = 10)
# A tibble: 10 × 2
Verb Frequency_bin
<chr> <fct>
1 accelerate moderate
2 accept very_high
3 accompany low
4 accumulate moderate
5 acquire high
6 adore low
7 advance low
8 advise high
9 affect high
10 alter high
The column Neighbourhood density
represents the mean distances of word neighbours (collocates) in semantic vector spaces. According to Reilly and Desai (2017), “abstract words tend to have more sparse neighborhoods than concrete words” (2017: 51). This raises the question of how semantic neighbourhoods could be reflected in a verb’s usage patterns, if at all.
We will examine the mean neighbourhood density \(\mu\) in four frequency bins, testing
\[ H_0: \mu_{\text{low}} = \mu_{\text{moderate}} = \mu_{\text{high}} = \mu_{\text{very high}}. \]
From a descriptive point of view, there is indeed some evidence that neighbourhood density and frequency are somehow linked:
Code
%>%
psycholing_data group_by(Frequency_bin) %>%
summarise(
n = n(),
mean_density = mean(Neighbourhood_density, na.rm = TRUE),
sd_density = sd(Neighbourhood_density, na.rm = TRUE),
median_density = median(Neighbourhood_density, na.rm = TRUE),
.groups = 'drop'
)
# A tibble: 4 × 5
Frequency_bin n mean_density sd_density median_density
<fct> <int> <dbl> <dbl> <dbl>
1 low 123 61.9 37.1 53.2
2 moderate 81 89.6 41.6 81.4
3 high 103 131. 45.3 126.
4 very_high 100 152. 51.8 148.
Code
ggplot(psycholing_data, aes(x = Frequency_bin, y = Neighbourhood_density)) +
geom_boxplot() +
geom_jitter(width = 0.2, alpha = 0.5) +
stat_summary(
fun = mean,
geom = "point",
shape = 23, # filled diamond
size = 3,
fill = "white",
color = "black",
stroke = 0.8
+
) labs(
title = "Neighbourhood Density of Transitive Verbs by Frequency",
subtitle = "Bins based on frequency quartiles",
x = "Frequency Bin",
y = "Neighbourhood Density"
+
) theme_minimal()
In order to this hypothesis, R’s aov()
function performs an \(F\) test, contrasting the Mean Square for Treatments (MST) with the Mean Square for Error (MSE).
To begin with, the total sum of squared errors \(\sum_{i=1}^n (y_i - \bar{y})^2\) is computed by summing over the squared errors of each of the \(1, \dots, k\) groups.
\[ \text{Total SS} = \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar{y})^2 \tag{7}\]
Let \(\bar{T_i}\) denote the sample mean of the \(i\)th group. Then the remaining quantities are defined by
\[ SST = \sum_{i=1}^k n_i(\bar{T_i} - \bar{y})^2 \tag{8}\] \[ SSE = \sum_{i=1}^k\sum_{j=1}^{n_i} (y_{ij} - \bar{T_i})^2 \tag{9}\]
From these, one can obtain the MST (i.e, \(\text{SST} / k-1\)) and MSE (i.e, \(\text{SSE} / n_1 + n_2 + \dots + n_k - k\)).
In R:
<- aov(Neighbourhood_density ~ Frequency_bin, data = psycholing_data)
anova_model
summary(anova_model)
Df Sum Sq Mean Sq F value Pr(>F)
Frequency_bin 3 539765 179922 92.8 <2e-16 ***
Residuals 403 781328 1939
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To compute an effect size measure, we could use eta_squared()
from the effectsize
package:
eta_squared(anova_model)
For one-way between subjects designs, partial eta squared is equivalent
to eta squared. Returning eta squared.
# Effect Size for ANOVA
Parameter | Eta2 | 95% CI
-----------------------------------
Frequency_bin | 0.41 | [0.35, 1.00]
- One-sided CIs: upper bound fixed at [1.00].
According to a one-way ANOVA, the main effect of frequency is statistically significant and large (F(3, 403) = 92.80, \(p\) < 0.001, \(\eta^2\) = 0.41, 95% CI [0.35, 1.00]).
Exercises
Tier 1
Exercise 1 You are investigating whether high-frequency words are processed faster than low-frequency words in a lexical decision task. State the null and alternative hypotheses for:
- A two-sided test comparing reaction times
- A one-sided test (assuming high-frequency words are processed faster)
Exercise 2 How does the \(t\)-test differ from the chi-squared test in terms of the type of data and research questions it addresses?
Tier 2
Exercise 3 The Dispersion
column indicates how evenly words are distributed across text categories in a corpus. We will perform a (rather crude) division into words with high dispersion (‘even’) and low dispersion (‘uneven’).
# Create dispersion bins
<- psycholing_data %>%
psycholing_data mutate(
Dispersion_bin = case_when(
<= quantile(Dispersion, 0.5, na.rm = TRUE) ~ "even",
Dispersion TRUE ~ "uneven"
),# Convert to factor with proper ordering
Dispersion_bin = factor(Dispersion_bin,
levels = c("even", "uneven"))
)
Earlier, we showed that word frequency and neighbourhood density go hand in hand. Since dispersion also expresses distributional information, we might expect differences in neighbourhood density here as well. Investigate their relationship by
- forwarding undirected hypotheses;
- computing and visualising the density means and standard deviations for each dispersion bin;
- conducting an appropriate statistical test and assessing the extent to which its assumptions are violated;
- reporting the results;
- giving a brief linguistic interpretation.
Tier 3
Exercise 4 When an ANOVA reveals significant differences between groups, it tells us that at least one group differs from the others, but doesn’t specify which groups differ. Post-hoc tests address this by examining all possible pairwise comparisons between groups.
You previously conducted an ANOVA to test whether neighborhood density differs across frequency groups (‘low’, ‘moderate’, ‘high’, and ‘very high’):
Recall the ANOVA model:
<- aov(formula = Neighbourhood_density ~ Frequency_bin, data = psycholing_data) anova_model
Now perform a post-hoc analysis using Tukey’s Honest Significant Difference (HSD) test, which controls for multiple comparisons:
TukeyHSD(anova_model)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Neighbourhood_density ~ Frequency_bin, data = psycholing_data)
$Frequency_bin
diff lwr upr p adj
moderate-low 27.69311 11.438849 43.94737 0.0000833
high-low 69.37941 54.207854 84.55098 0.0000000
very_high-low 90.30297 75.008055 105.59789 0.0000000
high-moderate 41.68630 24.817084 58.55552 0.0000000
very_high-moderate 62.60986 45.629614 79.59011 0.0000000
very_high-high 20.92356 4.976643 36.87047 0.0043341
Interpret the findings by addressing which frequency groups significantly differ from each other. How do these results relate to the original ANOVA findings?
Exercise 5 Multiple testing is a serious problem in the Null Hypothesis Significance Testing framework. Baguley (2012: 366) writes:
Multiple testing arises for several reasons. One reason is that the research questions rarely reduce to a single hypothesis test. There are usually several hypotheses of interest to a researcher and several tests are conducted. As each test has its own Type I error rate, the overall rate for multiple tests (the chance of at one Type I error) rises rapidly with the number of uncorrected tests.
The Tukey HSD or the Configural Frequency Analysis for categorical data are typical examples of multiple testing. Inspect the documentation of these approaches and find out whether (and, if possible, how) these methods adjust p-values or critical values to control the familywise error rate.