library("readxl")
library("tidyverse")
<- read.csv("Vowels_Apache.csv", sep = "\t") data_vowels
21 t-test
21.1 Preparation
- Load packages and data:
21.2 The \(t\)-test
Since the \(\chi^2\) measure exclusively works with categorical variables, a separate test statistic is required if one of them is a continuous variable. The \(t\) statistic is often used for research questions involving differences between sample means. The way \(t\) is calculated depends on the sources of \(X\) and \(Y\): Do they originate from the same sample or from two (in-)dependent ones?
First, we consider two independent samples from a population:
Sample \(X\) with the observations \(\{x_1, x_2, ..., {x_n}_1\}\), sample size \(n_1\), sample mean \(\bar{x}\) and sample variance \(s^2_x\).
Sample \(Y\) with the observations \(\{y_1, y_2, ..., {y_n}_2\}\), sample size \(n_2\), sample mean \(\bar{y}\) and sample variance \(s^2_y\).
The \(t\)-statistic after Welch is given by:
\[ t(x, y) = \frac{|\bar{x} - \bar{y}|}{\sqrt{\frac{s^2_x}{n_1} + \frac{s^2_y}{n_2}}} \tag{21.1}\]
If there is more than one observation for a given subject (e.g, before and after an experiment), the samples are called dependent or paired. The paired \(t\)-test assumes two continuous variables \(X\) and \(Y\).
In the paired test, the variable \(d\) denotes the difference between them, i.e., \(x - y\). The corresponding test statistic is obtained via
\[ t(x, y) = t(d) = \frac{\bar{d}}{s_d} \sqrt{n}. \tag{21.2}\]
Note the difference \(\bar{d} = \frac{1}{n}\sum_{i=1}^n{d_i}\) and the variance
\[ s^2_d = \frac{\sum_{i=1}^n({d_i} - \bar{d})^2}{n-1}. \tag{21.3}\]
Traditionally, the \(t\)-test is based on the assumptions of …
- Normality and
- Variance homogeneity (i.e., equal sample variances). Note that this does not apply to the \(t\)-test after Welch, which can handle unequal variances.
The implementation in R is very straightforward:
t.test(data_vowels$HZ_F1 ~ data_vowels$SEX, paired = FALSE) # there is a significant difference!
Welch Two Sample t-test
data: data_vowels$HZ_F1 by data_vowels$SEX
t = 2.4416, df = 112.19, p-value = 0.01619
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
8.403651 80.758016
sample estimates:
mean in group F mean in group M
528.8548 484.2740
If at least one assumption of the \(t\)-test has been violated, it is advisable to use a non-parametric test such as the Wilcoxon-Mann-Whitney (WMW) U-Test instead. In essence, this test compares the probabilities of encountering a value \(x\) from sample \(X\) that is greater than a value \(y\) from sample \(Y\). For details, see ?wilcox.test()
.
wilcox.test(data_vowels$HZ_F1 ~ data_vowels$SEX)
Wilcoxon rank sum test with continuity correction
data: data_vowels$HZ_F1 by data_vowels$SEX
W = 2270, p-value = 0.01373
alternative hypothesis: true location shift is not equal to 0
21.3 Workflow in R
21.3.1 Define hypotheses
\(H_0:\) mean
F1 frequency
of men \(=\) meanF1 frequency
of women.\(H_1:\) mean
F1 frequency
of men \(\ne\) meanF1 frequency
of women.
21.3.2 Descriptive overview
We select the variables of interest and proceed calculate the mean F1 frequencies
for each level of SEX
, requiring a grouped data frame.
Code
# Filter data so as to show only those observations that are relevant
%>%
data_vowels # Filter columns
select(HZ_F1, SEX) %>%
# Define grouping variable
group_by(SEX) %>%
# Compute mean and standard deviation for each sex
summarise(mean = mean(HZ_F1),
sd = sd(HZ_F1)) -> data_vowels_stats
::kable(data_vowels_stats) knitr
SEX | mean | sd |
---|---|---|
F | 528.8548 | 110.80099 |
M | 484.2740 | 87.90112 |
Code
# Plot distributions
%>%
data_vowels_stats ggplot(aes(x = SEX, y = mean)) +
geom_col() +
geom_errorbar(aes(x = SEX,
ymin = mean-sd,
ymax = mean+sd), width = .2) +
theme_classic()
Code
# Plot quartiles
%>%
data_vowels ggplot(aes(x = SEX, y = HZ_F1)) +
geom_boxplot() +
theme_classic()
21.3.3 Check \(t\)-test assumptions
# Normality
shapiro.test(data_vowels$HZ_F1) # H0: data points follow the normal distribution; however, this test is pretty unreliable!
Shapiro-Wilk normality test
data: data_vowels$HZ_F1
W = 0.98996, p-value = 0.5311
# Check histogram
ggplot(data_vowels, aes(x = HZ_F1)) +
geom_histogram(bins = 30) +
theme_classic()
# Variance homogeneity
var.test(data_vowels$HZ_F1 ~ data_vowels$SEX) # H0: variances are not too different from each other
F test to compare two variances
data: data_vowels$HZ_F1 by data_vowels$SEX
F = 1.5889, num df = 59, denom df = 59, p-value = 0.07789
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.949093 2.660040
sample estimates:
ratio of variances
1.588907
21.3.4 Running the test
# t-test for two independent samples
t.test(data_vowels$HZ_F1 ~ data_vowels$SEX, paired = FALSE) # there is a significant difference between sample means!
Welch Two Sample t-test
data: data_vowels$HZ_F1 by data_vowels$SEX
t = 2.4416, df = 112.19, p-value = 0.01619
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
8.403651 80.758016
sample estimates:
mean in group F mean in group M
528.8548 484.2740
21.3.5 Effect size
Cohen’s d is a possible effect size measure for continuous data and is obtained by dividing the difference of both sample means by the pooled standard deviation:
\[\frac{\bar{x} - \bar{y}}{\sqrt{\frac{{(n_1 - 1)s_x^2 + (n_2 - 1)s_y^2}}{{n_1 + n_2 - 2}}}}.\]
cohen.d(data_vowels$HZ_F1, data_vowels$SEX) # see also ?cohen.d for more details
Cohen's d
d estimate: 0.4457697 (small)
95 percent confidence interval:
lower upper
0.07976048 0.81177897
21.3.6 Reporting the results
According to a two-sample \(t\)-test, there is a significant difference between the mean F1 frequencies
of male and female speakers of Apache (\(t = 2.44\), \(df = 112.19\), \(p < 0.05\)). Therefore, \(H_0\) will be rejected.