dbinom(x = 3, size = 10, prob = 0.5)
[1] 0.1171875
A binomial distribution is a statistical distribution which can be found in an experiment with only two possible outcomes (0
and 1
, also referred to as failure
and success
, respectively) and which are independent of each other in case of repetition.
For linguists:
Gries (2021: 39–45)
General:
Baguley (2012: Chapters 2.3.1, 4.5.1, 4.8.4)
To illustrate this, consider a binomial distribution in linguistics, such as gender bias in pronoun use. The research question is as follows:
Do people tend to exhibit gender bias in pronoun use (e.g. he or she) when referring to professionals in leadership roles?
As a result, one can either expect to see pronouns used in a way that aligns with traditional gender stereotypes, or notice efforts to counteract such biases by adopting more inclusive and neutral language. Nevertheless, the pronouns that are being analysed can only be female or male and thus, they represent the required structure for a binomial distribution (he or she), which is binary data.
A test that is built on the assumptions of the binomial distribution is called the Binomial Test. The Binomial Test calculates the likelihood of observing a particular pronoun (e.g., he) being used to refer to professionals in leadership roles, under the assumption of a given expected proportion, and assesses whether the observed frequency differs significantly from the expected one.
If you are working with corpus-based frequency data that involves binary categories (yes or no, male or female, VO or OV, British or US-American,…) and if you have a specific probability of success which is specified in your hypotheses and does not change from one trial to the next, then this approach is working for you and will lead to successful results.
The binomial distribution is used to calculate the probability of observing up to and including \(x\) successes in \(n\) trials, given a fixed probability of success \(P\). This calculation is conducted using the Cumulative Mass Function (CMF), which is a running total of probabilities. The CMF is a valuable tool for calculating probabilities for ranges, such as “what is the probability of 0 to 3 successes?” as opposed to the probability of a specific number of successes (see Equation 19.1). 1
1 For more information on the binomial test using a specific number of successes, please refer to The Probability Mass Function PMF.
\[ P(x;n,P) = \sum_{i=1}^{|x|}\bigg[\binom{n}{i} p^i (1-P)^{n-i}\bigg] \tag{19.1}\]
Imagine studying gender bias in pronoun use. If \(P = 0.5\) (i.e., the probability of obtaining the masculine pronoun), and you observe up to 3 uses of he out of 10 trials, the CMF calculates the probability of getting 0, 1, 2, or 3 successes (usage of he) combined.
The CMF is especially useful when testing hypotheses about ranges of outcomes.
Steps for a Binomial Test Using the CMF:
Null hypothesis (\(H_0\)): The observed data matches the expected cumulative probability. Therefore: The observed frequency of the pronouns he and she referring to professionals in leadership roles corresponds with the expected proportion of 50% each, indicating that both pronouns are being used equally.
Alternative hypothesis (\(H_1\)): The observed data deviates from the expected cumulative probability. Therefore: The observed frequency of the pronouns he and she referring to professionals in leadership roles deviates from the expected proportion of 50% each, indicating a gender bias.
pbinom(3,10,0.5)
3. Interpreting the results:
A p-value smaller than 0.05 indicates a significant deviation from the expected probability
The test helps to determine whether the observed data can be used to reject the H0.
Checking the confidence interval (CI) provides the range of likely values for the true success probability
This function is used to calculate the probability of observing exactly x successes in n trials, given a fixed probability of success P.
\[ \begin{align} f(x; n, P) & = \binom{n}{x} P^x (1-P)^{n-x} \\ \end{align} \]
To calculate the PMF binomial distributions in R, the following code should be used:
dbinom(x = 3, size = 10, prob = 0.5)
[1] 0.1171875
The choice between one-sided and two-sided tests depends on the research question.
One-Sided Test: Used only for analysing deviations in one direction.
Two-Sided Test: Used for analysing deviations in both directions.
Two-Sided Binomial Test
binom.test(x = 15, n = 20, p = 0.5, alternative = "two.sided")
Exact binomial test
data: 15 and 20
number of successes = 15, number of trials = 20, p-value = 0.04139
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5089541 0.9134285
sample estimates:
probability of success
0.75
One-Sided Binomial Test
binom.test(x = 15, n = 20, p = 0.5, alternative = "greater")
Exact binomial test
data: 15 and 20
number of successes = 15, number of trials = 20, p-value = 0.02069
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
0.5444176 1.0000000
sample estimates:
probability of success
0.75
The function binom.test()
in R only uses the CMF function.
For the comparison of proportions between two groups, the prop.test() function can be used.
Example: An experiment is being conducted, in which participants have to determine whether a word they are being shown is an actual lemma of the English language or not.
Group A achieved 24 correct answers out of 25 trials
Group B obtained 19 correct answers out of 25 trials.
To see, whether the proportions are significantly different, the following code can be used.
prop.test(c(24,19), c(25,24))
2-sample test for equality of proportions with continuity correction
data: c(24, 19) out of c(25, 24)
X-squared = 1.8525, df = 1, p-value = 0.1735
alternative hypothesis: two.sided
95 percent confidence interval:
-0.05222032 0.38888699
sample estimates:
prop 1 prop 2
0.9600000 0.7916667
For analysing event counts over time, such as pauses in speech, the Poisson distribution with the poisson.test()
function can be useful.
Example: An experiment is conducted in which speakers are asked to talk for a predetermined time. and it is being researched how often pauses occur in speech.
Participant A pauses 18 times in 5 minutes.
The researcher expects 2 pauses per minute.
To see whether the observed pause rate differs from the expected pause rate, the following test is highly convenient.
poisson.test(18, T=5, r=2)
Exact Poisson test
data: 18 time base: 5
number of events = 18, time base = 5, p-value = 0.01705
alternative hypothesis: true event rate is not equal to 2
95 percent confidence interval:
2.133588 5.689552
sample estimates:
event rate
3.6