Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Linguistic variables
    • 1.3 Research questions
    • 1.4 Set theory and mathematical notation
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 The CQP interface
    • 3.4 Data annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Chi-squared test
    • 4.6 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. 4. Introduction to Statistics
  2. 4.2 Probability theory
  • 4. Introduction to Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Chi-squared test
    • 4.6 t-test

On this page

  • Recommended reading
  • Preparation
  • What is probability?
    • Relative frequency
    • Conditional probability
  • Probability distributions
    • Discrete distributions
    • Continuous distributions
    • Gamma distribution
  • Exercises
    • Tier 1
    • Tier 2
    • Tier 3
  1. 4. Introduction to Statistics
  2. 4.2 Probability theory

4.2 Probability theory

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract
This section covers some essential concepts from probability theory.
Script

You can find the full R script associated with this unit here.

Recommended reading

Most accessible:

Baguley (2012): Chapter 2

Technical:

Heumann, Schomaker, and Shalabh (2022a): Chapter 7

Agresti and Kateri (2022): Chapter 2

Proof-based:

Casella and Berger (2002): Chapters 1 & 2

Preparation

We will draw on the genitive data published by Grafmiller (2014), who investigates the influence of a series of phonetic, semantic, contextual, and psycholinguistic factors on the choice between the synthetic s-genitive and the periphrastic of-genitive. Some representative examples are given below (Grafmiller 2014: 471)):

  1. and ran the Grizzlies’ winning streak to four straight. (Brown Corpus, A13)
  2. He was the sidekick of Gene Autry I believe (Switchboard Corpus, 2131)
library(tidyverse)
library(readxl)

genitive <- read_xlsx("Grafmiller_genitive_alternation.xlsx")
# Overview
glimpse(genitive)

What is probability?

The OED provides several non-technical definitions of term probability, which include

  • ‘the extent to which something is likely to happen or be the case’1 and

  • a ‘thing judged likely to be true, to exist, or to happen’2,

1 See https://doi.org/10.1093/OED/1639707847.

2 See https://doi.org/10.1093/OED/3638534852.

While these dictionary definitions capture the main intuition behind probability in everyday language, they leave important questions unanswered. How can we quantify probability? How can we describe the full range of possible outcomes — and their respective probabilities — in a systematic way? To address these concerns, we need to refine our current view of probability.

Relative frequency

Agresti & Kateri (2022: 29) adduce a frequency-based interpretation of probability:

“For an observation of a random phenomenon, the probability of a particular outcome is the proportion of times that outcome would occur in an indefinitely long sequence of like observations, under the same conditions.”

In other words, the probability of an event3 \(A\), denoted \(P(A)\), is equivalent to the long-term relative frequency of \(A\) as the number of observations \(n\) increases. The relative frequency \(f(A)\) is obtained by dividing the frequency of occurrence \(n(A)\) by the sample size \(n\), i.e.,

3 An event is understood as any subset of the sample space \(S\), comprising one or more outcomes.

\[ f(A) = \frac{n(A)}{n}. \tag{1}\]

Heumann et al. (2022b: 118) explains that \(f(A)\) converges to the probability \(P(A)\) as \(n\) approaches infinity:

\[ P(A) = \lim_{n\to\infty} \frac{n(A)}{n}. \tag{2}\]

Example 1 (Relative frequency) In the genitive data, we can use relative frequencies to estimate the probability that possession is indicated using the of or s genitive, respectively.

# Probability of genitive type
genitive %>% 
  count(Type) %>% 
  mutate(rel_freq = n/sum(n))
# A tibble: 2 × 3
  Type      n rel_freq
  <chr> <int>    <dbl>
1 of     3103    0.609
2 s      1995    0.391
Kolmogorov’s axioms

On the most abstract level, probabilities are defined as functions that associate elements from the sample space \(S = \{s_1, s_2, ..., s_n\}\) with values in the interval \([0, 1]\), subject to certain conditions.

The Russian Mathematician Andrei Kolmogorov (1903–1987) proposed three axioms that a probability function \(P\) must satisfy:

  1. Every event \(A\) in the sample space \(S\) has a probability

\[ P(A) \geq 0. \tag{3}\]

  1. The probability of the sample space \(S\)

\[ P(S) = 1. \tag{4}\]

  1. Assuming two disjoint (i.e., mutually exclusive) events \(A\) and \(B\), then

\[ P(A \cup B) = P(A) + P(B). \tag{5}\]

Note that this axiom can be generalised to infinite series of pairwise disjoint events \(A_i\):

\[ P\left(\bigcup\limits_{i=1}^{\infty} A_i\right) = \sum\limits_{i=1}^{\infty} P(A_i). \]

Conditional probability

In many linguistic contexts, we’re interested in the probability of one event occurring given that another event has already occurred. This concept is captured by conditional probability, which measures the probability of event \(A\) happening when we know that event \(B\) has taken place. It is a way of capturing prior knowledge.

The conditional probability of \(A\) given \(B\) is denoted \(P(A|B)\) and is defined as:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \tag{6}\]

provided that \(P(B) > 0\). This formula tells us that the conditional probability is the ratio of the probability that both events occur to the probability that the conditioning event occurs.

One of the most important results in probability theory is Bayes’ theorem, which allows us to “reverse” conditional probabilities:

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}. \tag{7}\]

This theorem is fundamental to Bayesian statistics.

Example 2 (Conditional probability) Using the genitive data, we will compute the probabilities:

  • \(P(\text{Genitive = s})\): The probability an NP takes the ‘s’ genitive.
# P(s)
genitive %>%
  count(Type) %>%
  mutate(prob = n / sum(n)) %>%
  filter(Type == "s")
# A tibble: 1 × 3
  Type      n  prob
  <chr> <int> <dbl>
1 s      1995 0.391
  • \(P(\text{Genitive = s} \mid \text{Possessor Animacy = inanimate})\): The probability that an NP takes the ‘s’ genitive, given that the NP is inanimate.
# P(s | inanimate)  
genitive %>%
  filter(Possessor_Animacy2 == "inanimate") %>%
  count(Type) %>%
  mutate(prob = n / sum(n)) %>% 
  filter(Type == "s")
# A tibble: 1 × 3
  Type      n  prob
  <chr> <int> <dbl>
1 s       544 0.178
  • \(P(\text{Genitive = s} \mid \text{Possessor Animacy = inanimate}, \text{Genre = Press})\): The probability that an NP takes the ‘s’ genitive, given that the NP is inanimate and that it occurs in the text category ‘Press’.
# P(s | inanimate, press)  
genitive %>%
  filter(Possessor_Animacy2 == "inanimate", Genre == "Press") %>%
  count(Type) %>%
  mutate(prob = n / sum(n)) %>% 
  filter(Type == "s")
# A tibble: 1 × 3
  Type      n  prob
  <chr> <int> <dbl>
1 s       288 0.448
  • What is the probability that a noun phrase is inanimate, given that the genitive is ‘s’, i.e. \(P(\text{inanimate} \mid \text{s})\)? We compute

\[ P(\text{inanimate} \mid \text{s}) = \frac{P(\text{s} \mid \text{inanimate}) \cdot P(\text{inanimate})}{P(\text{s})}. \]

Show the code
# Total number of observations
N_total <- nrow(genitive)

# Count of animacy = inanimate
n_inanimate <- genitive %>%
  filter(Possessor_Animacy2 == "inanimate") %>%
  count() %>%
  pull(n)

# Count of type = s
n_sgen <- genitive %>%
  filter(Type == "s") %>%
  count() %>%
  pull(n)

# Count of type = s AND animacy = inanimate
n_sgen_inanim <- genitive %>%
  filter(Possessor_Animacy2 == "inanimate", Type == "s") %>%
  count() %>%
  pull(n)

# Now compute:
# P(s | inanimate) = n_sgen_inanim / n_inanim
# P(inanimate) = n_inanim / N_total
# P(s) = n_sgen / N_total

p_inanim_sgen <- (n_sgen_inanim / n_inanimate) * (n_inanimate / N_total) / (n_sgen / N_total)

p_inanim_sgen
[1] 0.2726817

Probability distributions

Recall the concept of random variables introduced in 4.1 Data types. They describe random processes, which means that the outcomes of the experiment are not pre-determined in any way; there is always some degree of uncertainty involved. Each outcome occurs with a certain probability.

When we associate each outcome of a random variable with a probability, we obtain its probability distribution. A function that explicitly maps probabilities onto the discrete outcomes of a discrete random variable \(X\) is called probability mass function (\(pmf\)). In the continuous case, we speak of a probability density function (\(pdf\)).

Example 3 (Pmf) We can establish the probability distribution of genitive types as follows:

# Count tokens and compute relative frequencies
gen_type <- genitive %>% 
  count(Type) %>% 
  mutate(rel_freq = n/sum(n)) 

# Ensure that probabilities sum up to 1
sum(gen_type$rel_freq)
[1] 1
# Plot the PMF
gen_type %>% 
  ggplot(aes(x = Type, y = rel_freq)) +
  geom_col() +
  labs(title = "Probability mass function of genitive types")

Discrete distributions

Uniform distribution

If all discrete outcomes have the same probability, their distribution is called uniform. A well-known example is the toss of a coin: If the coin is fair, the outcomes \(\{\text{Heads}\}\) and \(\{\text{Tails}\}\) are equally likely. More generally, if a discrete random variable \(X\) has \(k\) different outcomes, then any outcome \(x_i\) has the probability

\[ P(X = x_i) = \frac{1}{k}. \tag{8}\]

If we toss a fair coin once, the sample space is \(S = \{\text{Heads}, \text{Tails}\}\), and we have \(P(\text{Heads}) = P(\text{Tails}) = \frac{1}{2}\).

Note that all probabilities have to add up to 1, i.e.,

\[ \sum_{x} P(x) = 1. \tag{9}\]

Example 4 (Uniform distribution) A single throw of a die has the sample space \(S = \{1, 2, 3, 4, 5, 6\}\). The long-term relative frequency of each outcome, and thus its probability, should be \(1/6 \approx 0.17\).

Show the code
# Load libraries
library(tidyverse)

# First, set a seed for reproducibility; this ensure that we always generate the 'same' random numbers
set.seed(123)

# Simulate 10,000 rolls of a fair six-sided die
die_rolls <- sample(1:6, size = 10000, replace = TRUE)

# Plot the probability mass function
ggplot(data.frame(die_rolls), aes(x = factor(die_rolls))) +
  geom_bar(aes(y = ..count../sum(..count..)), fill = "steelblue") +
  geom_hline(yintercept = 1/6, col = "black") +
  labs(title = "Rolling a die 10000 times (k = 6, p = 1/6)",
       y = "Relative frequency",
       x = "Outcome") +
  scale_y_continuous(limits = c(0, 0.25)) +
  theme_minimal()

Binomial distribution

Many corpus-linguistic studies are concerned with discrete random variables \(X\) that have exactly two outcomes, such as the dative alternation (give somebody something vs. give something to somebody), the particle placement alternation (pick something up vs. pick up something), or subject and object realisation (I’ve eaten something vs. I’ve eaten Ø).

Assume we make \(n\) independent binary observations (also known as Bernoulli trials) of \(X\). If one of the two outcomes of \(X\) has a fixed probability \(\pi\) (often denoted ‘success’) and the other one the probability \(1 - \pi\) (‘failure’), \(X\) follows a binomial distribution.4 We can also use the shorthand notation

4 The natural extension of the binomial distribution to \(k\) different outcomes is called the multinomial distribution with index \(n\) and probabilities \(\pi_1, \pi_2, \dots, \pi_k\).

\[ X \sim Binom(n, \pi). \]

The elements inside the parentheses are the parameters of the distribution, determining the outcomes of \(X\): the number \(n\) of independent observations and the probability \(\pi\) of ‘success’. As such, they affect the shape of the probability mass (or density) function.

Example 5 (Binomial distribution) Suppose we throw a fair coin \(30\) times and count how often we obtain ‘heads’. We’d expect to see this outcome \(30 \cdot 0.5 = 15\) times.

Show the code
# Set seed for reproducibility
set.seed(123)

# Set parameters for binomial distribution
n <- 30    # Number of trials
pi <- 0.5  # Probability of HEADS

# Set up the pmf
successes <- 0:n
prob_mass <- dbinom(successes, size = n, prob = pi)

# Create a data frame for plotting
binom_data <- data.frame(successes = successes, probability = prob_mass)

# Basic plot approach
ggplot(binom_data, aes(x = successes, y = probability)) +
  geom_segment(aes(xend = successes, yend = 0), linewidth = 1.2) +
  labs(title = "PMF of n = 30 coin tosses with P({Heads}) = 0.5",
       x = "Occurrences of {Heads}", y = "Relative frequency") +
  theme_minimal()

Poisson distribution

The Poisson distribution is particularly suitable for frequency data, which is ubiquitous in corpus linguistics. A Poisson-distributed random variable is fully determined by its parameter \(\lambda\), which determines how often events occur.

\[ X \sim Pois(\lambda). \]

Assume a word occurs 3 times per 1,000 words. We would define the rate parameter as \(\lambda = 3\), which is the expected and likeliest outcome.

Show the code
# Define the occurrence rate
lambda <- 3

# Define the range of x values
x_range <- 0:10

# Compute PMF values using dpois()
poisson_pmf <- data.frame(
  x = x_range,
  probability = dpois(x_range, lambda)
)

# Create the PMF plot using ggplot
ggplot(poisson_pmf, aes(x = x, y = probability)) +
  geom_segment(aes(xend = x, yend = 0), linewidth = 1.2) + 
  labs(title = "Poisson PMF with λ = 3",
       x = "Number of events",
       y = "Probability") +
  scale_x_continuous(breaks = x_range) +
  theme_minimal()

Continuous distributions

Normal distribution

A great number of numerical variables in the world follow the well-known normal (or Gaussian) distribution, which includes test scores, weight and height, among many others. The plot below illustrates its characteristic bell-shape: Most observations are in the middle, with considerably fewer near the fringes. For example, most people are rather “average” in height; there are only few people that are extremely short or extremely tall.

The normal distribution is typically described in terms of two parameters: The population mean \(\mu\) and the population variance \(\sigma\). If a random variable \(X\) is normally distributed, we typically use the notation in Equation 10.

\[ X \sim \mathcal{N}(\mu, \sigma^2). \tag{10}\]

The \(\mu\) parameter corresponds to the expected value \(E(X)\), which is a typical (or average) value of a distribution.

The spread of data points around the expectation is the population variance and corresponds to \(\sigma^2\):

\[ Var(X) = E(X-E(X))^2. \]

The population standard deviation \(\sigma\) is the average distance from the expectation and is defined as \(\sqrt{Var(X)}\).

Example 6 (Normal distribution) The plot illustrates a standard normal distribution for \(X \sim \mathcal{N}(0, 1)\). The \(y\)-axis indicates the density of population values; note that since the Gaussian distribution is a continuous distribution with technically infinite \(x\)-values, the probability of any given value must be 0. We can only obtain probabilities for intervals of values, which are given by

\[ P(a \leq X \leq b) = \int_a^b f(x)dx. \tag{11}\]

Show the code
# Set parameters for normal distribution
mu <- 0     # mean (expectation)
sigma <- 1  # standard deviation (square root of variance)
variance <- sigma^2

# Generate a sequence of x-values in a range of +/- 4 standard deviations from the mean
x_values <- seq(mu - 4*sigma, mu + 4*sigma, length.out = 1000)

# Collect everything in a data frame
norm_data <- data.frame(
  x = x_values,
  density = dnorm(x_values, mean = mu, sd = sigma)
)

# Plot the simulated data
ggplot(norm_data, aes(x = x, y = density)) +
  geom_line(linewidth = 1, color = "#0D47A1") +
  labs(
    title = "PDF for N(0, 1)",
    x = "x",
    y = "Probability density") +
  theme_minimal() 

Quick facts about the Gaussian bell curve

Quite interestingly,

  • 68% all values fall within one standard deviation of the mean,

  • 95% within two, and

  • 99.7% within three.

Lognormal distribution

While many variables follow a normal distribution, others exhibit a characteristic right-skewed pattern where most observations cluster near zero but some extend far into the positive tail. This is particularly common with reaction times or survival data.

The log-normal distribution describes variables whose natural logarithm follows a normal distribution. If \(\ln(X) \sim \mathcal{N}(\mu, \sigma^2)\), then \(X\) follows a log-normal distribution, denoted as:

\[ X \sim LogN(\mu, \sigma^2). \] The parameters \(\mu\) and \(\sigma^2\) refer to the mean and variance of the underlying normal distribution (i.e., of \(\ln(X)\)), not of \(X\) itself.

Example 7 (Lognormal distribution) Word frequencies in natural language corpora typically follow a log-normal distribution. Consider the frequency distribution of lemmas in a corpus, where most words occur rarely but a few occur very frequently (following Zipf’s law).

Show the code
# Set parameters for log-normal distribution
mu_log <- 2      # Mean of the underlying normal distribution (log scale)
sigma_log <- 1   # Standard deviation of the underlying normal distribution

# Generate x values on the positive real line
x_values <- seq(0.1, 50, length.out = 1000)

# Calculate the probability density
lognorm_data <- data.frame(
  x = x_values,
  density = dlnorm(x_values, meanlog = mu_log, sdlog = sigma_log)
)

# Plot the log-normal distribution
ggplot(lognorm_data, aes(x = x, y = density)) +
  geom_line(linewidth = 1, color = "purple") +
  labs(
    title = "PDF for Log-Normal Distribution (μ = 2, σ = 1)",
    x = "Frequency (occurrences per million words)",
    y = "Probability density"
  ) +
  theme_minimal()

Show the code
# Show the relationship between normal and log-normal
# Generate sample data
set.seed(123)
normal_sample <- rnorm(1000, mean = mu_log, sd = sigma_log)
lognormal_sample <- exp(normal_sample)

# Create comparison plot
par(mfrow = c(1, 2))
hist(normal_sample, main = "Normal Distribution\n(log scale)", 
     xlab = "ln(X)", col = "#90CAF9", breaks = 30)
hist(lognormal_sample, main = "Log-Normal Distribution\n(original scale)", 
     xlab = "X", col = "#FFCDD2", breaks = 30)

Gamma distribution

The Gamma distribution is a flexible family for continuous, strictly positive random variables. It is parameterised by a shape \(k > 0\) and a rate \(\lambda > 0\) (equivalently, a scale \(\theta = 1/\lambda\)):

\[ X \sim \text{Gamma}(k, \lambda). \]

The chi-squared distribution, which is central to many hypothesis tests (see the Chi-squared test), is a specific case of the Gamma distribution.

Example 8 (Gamma distribution) In linguistics, the Gamma family is well suited to ratio-scaled acceptability data collected via Magnitude Estimation, where participants assign positive real-valued scores to stimuli. Because such scores have a natural zero, are right-skewed, and are bounded below by zero, the normal distribution is a poor fit; the Gamma accommodates this naturally (see Horsch & Buskin 2026 for a recent study).

Show the code
gamma_data <- tibble(x = seq(0.01, 12, length.out = 500)) %>%
  mutate(density = dgamma(x, shape = 1.25, rate = 1.5))

ggplot(gamma_data, aes(x = x, y = density)) +
  geom_line(linewidth = 1, color = "steelblue2") +
  labs(
    title  = "Gamma PDFs for k = 1.25 and λ = 1.5",
    x      = "Acceptability rating",
    y      = "Probability density",
    colour = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Exercises

Solutions

You can find the solutions to the exercises here.

Tier 1

Exercise 1 Make suggestions for probability distributions that could be helpful for modelling the following linguistic variables and explain your choice:

  1. Analytic vs. synthetic comparatives (scarier vs. more scary)
  2. Number of pronunciation errors per minute
  3. Frequencies of modal verbs in American English
  4. Article choice (indefinite vs. definite vs. zero-article)
  5. The time it takes to read a word aloud (in ms)
  6. Grammatical acceptability ratings on a Likert-scale ranging from 1 to 5

Exercise 2 Let genitive Type be a binomially distributed variable, which is defined as

\[ \text{Type} = \begin{cases} \text{s} & \text{with probability } \pi \\ \text{of} & \text{with probability } 1-\pi \end{cases} \] for \(0 < \pi < 1\).

We can model this variable using the binomial probability mass function, which has the form

\[ f(x; n; \pi) = \binom{n}{x}\pi^x(1-\pi)^{n-x}, \] where \(n\) is the number of independent trials, \(\pi\) the probability of ‘success’, and \(x\) the observed number of successes.

Baguley (2012: 67) provides the code example below for plotting a discrete \(pmf\).

# Vector containing the numbers of successes
heads <- 0:10 # observing 'heads' 0, ..., 10 times

# Get probability mass for each success (fair coin -> P(Heads) = 0.5)
prob.mass <- dbinom(heads, 10, 0.5) # throwing the coin 10 times

# Plot the PMF
plot(heads, prob.mass, pch = NA, xlab = "Number of heads", ylab = "Probability")

# Create a spikey plot
segments(x0 = heads, y0 = 0, x1 = heads, y1 = prob.mass)

Adjust the code to plot the \(pmf\) for Type where \(\pi = 0.85\) and \(n = 50\). The binomial experiment is repeated 50 times. Provide a brief visual description of the distribution. What does the shape tell you about the likelihood of observing fewer than 40 successes?

Exercise 3 Let \(R\) be a continuous random variable measuring reaction times in lexical decision tasks. We approximate the measurements with a normal distribution parametrised by

\[ R \sim \mathcal{N}(\mu = 500, \sigma = 150), \]

which can be visualised as follows:

curve(dnorm(x, mean = 500, sd = 150), xlim = c(0, 1000))

Describe how the shape auf the bell curve changes for \(\sigma = 50\), \(\sigma = 200\) and \(\sigma = 350\). What do these changes mean for the distribution of reaction times?

Tier 2

Exercise 4 If a variable follows a binomial distribution and we know the number of trials \(n\) and the probability of success \(\pi\), we can compute:

  • the probability of observing a specific number of successes with dbinom():
# Example: P(4 heads from 10 ten tosses of a fair coin)
dbinom(x = 4, size = 10, prob = 0.5)
[1] 0.2050781
  • the cumulative probability of up to \(x\) successes with pbinom():
# Example: P(up to 4 heads from 10 ten tosses of a fair coin)
pbinom(q = 4, size = 10, prob = 0.5)
[1] 0.3769531
# P(5 or more heads from 10 ten tosses of a fair coin)
1 - pbinom(q = 4, size = 10, prob = 0.5)
[1] 0.6230469
  • random simulated successes with rbinom():
# Shows the numbers of successes in the first, second, third and fourth trial
rbinom(n = 4, size = 10, prob = 0.5)
[1] 3 3 3 5

Interpret the output of the two code chunks below with reference to the Coreness variable from Exercise 2.

1 - pbinom(q = 80, size = 100, prob = 0.85)
[1] 0.8934557
rbinom(n = 50, size = 15, prob = 0.15)
 [1] 2 3 2 0 0 1 4 1 2 3 6 1 3 1 2 0 7 2 7 1 1 1 3 2 2 3 2 2 2 1 1 3 3 5 5 3 2 1
[39] 2 3 4 4 3 1 0 2 2 1 5 1

Tier 3

Exercise 5 Let \(D\) be a Poisson-distributed random variable which counts the total number of fillers such as uhm, er, and like per 100 words. It has the \(pmf\)

\[ f(x, \lambda) = \frac{\lambda^{x} e^{-\lambda}}{x!}. \]

The study by Bortfeld et al. (2001) suggests that \(\lambda = 2\) would be a suitable rate parameter.

  1. Start by plotting the \(pmf\) using dpois() for \(x \in \{0, \dots, 10\}\). (Tip: Check the documentation by typing ?dpois into the console.)

  2. Using ppois(), which returns the cumulative probability \(P(X \leq x)\) for a Poisson-distributed variable, compute the probability of observing more than 5 fillers per 100 words. What does this suggest about the likelihood of highly disfluent speech under this model?

Exercise 6 The empirical study of dispersion has attracted significant attention in recent years (Sönning 2024; Gries 2024). A key challenge is finding dispersion measures that are minimally correlated with a word’s frequency of occurrence. One such measure is Kullback-Leibler divergence (KLD), which comes from information theory and is closely related to entropy.

Mathematically, KLD measures the difference between two probability distributions \(p\) and \(q\):

\[ KLD(p \parallel q) = \sum\limits_{x} p(x) \log \frac{p(x)}{q(x)} \]

For corpus dispersion (cf. Equation Equation 12), we compare the posterior (actual) distribution of a word across corpus parts \(\frac{v_i}{f}\) with the prior distribution that assumes uniform spread across parts (weighted by part size \(s_i\)):

\[ KLD = \sum\limits_{i=1}^n \frac{v_i}{f} \times \log_2\left({\frac{v_i}{f} \times \frac{1}{s_i}}\right) \tag{12}\]

where:

  • \(f\) = total frequency of the word in the corpus

  • \(v_i\) = frequency of the word in corpus part \(i\)

  • \(s_i\) = size of corpus part \(i\) (as fraction of total corpus)

  • \(n\) = number of corpus parts

First, let’s create simulated corpus data to work with:

library(tidyverse)

# Create simulated corpus data
set.seed(123)

# Simulate a corpus with 5 parts of different sizes
corpus_parts <- c("Fiction", "News", "Academic", "Spoken", "Web")

part_sizes <- c(0.25, 0.20, 0.15, 0.25, 0.15)  # Sizes of the corpus parts

total_tokens <- 1000000 # Corpus size

# Create some example words with different dispersion patterns
words_data <- tibble(
  word = rep(c("the", "however", "DNA", "like", "therefore"), each = 5),
  corpus_part = rep(corpus_parts, 5),
  frequency = c(
    # "the" - highly frequent, evenly distributed
    c(12500, 10000, 7500, 12500, 7500),
    # "however" - academic bias
    c(150, 200, 800, 100, 50),
    # "DNA" - strong academic bias  
    c(10, 5, 200, 2, 8),
    # "like" - spoken bias
    c(200, 150, 50, 600, 300),
    # "therefore" - academic/formal bias
    c(50, 100, 300, 30, 20)
  ),
  part_size = rep(part_sizes, 5)
)

# Show full dataset
print(words_data, n = Inf)
# A tibble: 25 × 4
   word      corpus_part frequency part_size
   <chr>     <chr>           <dbl>     <dbl>
 1 the       Fiction         12500      0.25
 2 the       News            10000      0.2 
 3 the       Academic         7500      0.15
 4 the       Spoken          12500      0.25
 5 the       Web              7500      0.15
 6 however   Fiction           150      0.25
 7 however   News              200      0.2 
 8 however   Academic          800      0.15
 9 however   Spoken            100      0.25
10 however   Web                50      0.15
11 DNA       Fiction            10      0.25
12 DNA       News                5      0.2 
13 DNA       Academic          200      0.15
14 DNA       Spoken              2      0.25
15 DNA       Web                 8      0.15
16 like      Fiction           200      0.25
17 like      News              150      0.2 
18 like      Academic           50      0.15
19 like      Spoken            600      0.25
20 like      Web               300      0.15
21 therefore Fiction            50      0.25
22 therefore News              100      0.2 
23 therefore Academic          300      0.15
24 therefore Spoken             30      0.25
25 therefore Web                20      0.15

Compute the dispersion values of the, however, DNA, like, and therefore based on the simulated corpus data. How does dispersion reflect their (lack of) register bias?

References

Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.
Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral Sciences. Houndmills, Basingstoke: Palgrave Macmillan.
Bortfeld, Heather, Silvia D. Leon, Jonathan E. Bloom, Michael F. Schober, and Susan E. Brennan. 2001. “Disfluency Rates in Conversation: Effects of Age, Relationship, Topic, Role, and Gender.” Language and Speech 44 (2): 123–47. https://doi.org/10.1177/00238309010440020101.
Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Pacific Grove, Calif.: Duxbury/Thomson Learning.
Grafmiller, Jason. 2014. “Variation in English Genitives Across Modality and Genres.” English Language and Linguistics 18 (3): 471–96.
Gries, Stefan Thomas. 2024. Frequency, Dispersion, Association, and Keyness: Revising and Tupleizing Corpus-Linguistic Measures. Amsterdam & Philadelphia: John Benjamins Publishing Company.
Heumann, Christian, Michael Schomaker, and Shalabh. 2022b. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in R. 2nd ed. Cham: Springer.
———. 2022a. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in r. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-031-11833-3.
Sönning, Lukas. 2024. “Evaluation of Keyness Metrics: Performance and Reliability,” Corpus Linguistics and Linguistic Theory, 20 (2): 263–88. https://doi.org/10.1515/cllt-2022-0116.
4.1 Data, variables, samples
4.3 Descriptive statistics