library(tidyverse)
library(readxl)
genitive <- read_xlsx("Grafmiller_genitive_alternation.xlsx")4.2 Probability theory
You can find the full R script associated with this unit here.
Recommended reading
Most accessible:
Baguley (2012): Chapter 2
Technical:
Heumann, Schomaker, and Shalabh (2022a): Chapter 7
Agresti and Kateri (2022): Chapter 2
Proof-based:
Casella and Berger (2002): Chapters 1 & 2
Preparation
We will draw on the genitive data published by Grafmiller (2014), who investigates the influence of a series of phonetic, semantic, contextual, and psycholinguistic factors on the choice between the synthetic s-genitive and the periphrastic of-genitive. Some representative examples are given below (Grafmiller 2014: 471)):
- and ran the Grizzlies’ winning streak to four straight. (Brown Corpus, A13)
- He was the sidekick of Gene Autry I believe (Switchboard Corpus, 2131)
# Overview
glimpse(genitive)What is probability?
The OED provides several non-technical definitions of term probability, which include
‘the extent to which something is likely to happen or be the case’1 and
a ‘thing judged likely to be true, to exist, or to happen’2,
While these dictionary definitions capture the main intuition behind probability in everyday language, they leave important questions unanswered. How can we quantify probability? How can we describe the full range of possible outcomes — and their respective probabilities — in a systematic way? To address these concerns, we need to refine our current view of probability.
Relative frequency
Agresti & Kateri (2022: 29) adduce a frequency-based interpretation of probability:
“For an observation of a random phenomenon, the probability of a particular outcome is the proportion of times that outcome would occur in an indefinitely long sequence of like observations, under the same conditions.”
In other words, the probability of an event3 \(A\), denoted \(P(A)\), is equivalent to the long-term relative frequency of \(A\) as the number of observations \(n\) increases. The relative frequency \(f(A)\) is obtained by dividing the frequency of occurrence \(n(A)\) by the sample size \(n\), i.e.,
3 An event is understood as any subset of the sample space \(S\), comprising one or more outcomes.
\[ f(A) = \frac{n(A)}{n}. \tag{1}\]
Heumann et al. (2022b: 118) explains that \(f(A)\) converges to the probability \(P(A)\) as \(n\) approaches infinity:
\[ P(A) = \lim_{n\to\infty} \frac{n(A)}{n}. \tag{2}\]
Example 1 (Relative frequency) In the genitive data, we can use relative frequencies to estimate the probability that possession is indicated using the of or s genitive, respectively.
# Probability of genitive type
genitive %>%
count(Type) %>%
mutate(rel_freq = n/sum(n))# A tibble: 2 × 3
Type n rel_freq
<chr> <int> <dbl>
1 of 3103 0.609
2 s 1995 0.391
On the most abstract level, probabilities are defined as functions that associate elements from the sample space \(S = \{s_1, s_2, ..., s_n\}\) with values in the interval \([0, 1]\), subject to certain conditions.
The Russian Mathematician Andrei Kolmogorov (1903–1987) proposed three axioms that a probability function \(P\) must satisfy:
- Every event \(A\) in the sample space \(S\) has a probability
\[ P(A) \geq 0. \tag{3}\]
- The probability of the sample space \(S\)
\[ P(S) = 1. \tag{4}\]
- Assuming two disjoint (i.e., mutually exclusive) events \(A\) and \(B\), then
\[ P(A \cup B) = P(A) + P(B). \tag{5}\]
Note that this axiom can be generalised to infinite series of pairwise disjoint events \(A_i\):
\[ P\left(\bigcup\limits_{i=1}^{\infty} A_i\right) = \sum\limits_{i=1}^{\infty} P(A_i). \]
Conditional probability
In many linguistic contexts, we’re interested in the probability of one event occurring given that another event has already occurred. This concept is captured by conditional probability, which measures the probability of event \(A\) happening when we know that event \(B\) has taken place. It is a way of capturing prior knowledge.
The conditional probability of \(A\) given \(B\) is denoted \(P(A|B)\) and is defined as:
\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \tag{6}\]
provided that \(P(B) > 0\). This formula tells us that the conditional probability is the ratio of the probability that both events occur to the probability that the conditioning event occurs.
One of the most important results in probability theory is Bayes’ theorem, which allows us to “reverse” conditional probabilities:
\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}. \tag{7}\]
This theorem is fundamental to Bayesian statistics.
Example 2 (Conditional probability) Using the genitive data, we will compute the probabilities:
- \(P(\text{Genitive = s})\): The probability an NP takes the ‘s’ genitive.
# P(s)
genitive %>%
count(Type) %>%
mutate(prob = n / sum(n)) %>%
filter(Type == "s")# A tibble: 1 × 3
Type n prob
<chr> <int> <dbl>
1 s 1995 0.391
- \(P(\text{Genitive = s} \mid \text{Possessor Animacy = inanimate})\): The probability that an NP takes the ‘s’ genitive, given that the NP is inanimate.
# P(s | inanimate)
genitive %>%
filter(Possessor_Animacy2 == "inanimate") %>%
count(Type) %>%
mutate(prob = n / sum(n)) %>%
filter(Type == "s")# A tibble: 1 × 3
Type n prob
<chr> <int> <dbl>
1 s 544 0.178
- \(P(\text{Genitive = s} \mid \text{Possessor Animacy = inanimate}, \text{Genre = Press})\): The probability that an NP takes the ‘s’ genitive, given that the NP is inanimate and that it occurs in the text category ‘Press’.
# P(s | inanimate, press)
genitive %>%
filter(Possessor_Animacy2 == "inanimate", Genre == "Press") %>%
count(Type) %>%
mutate(prob = n / sum(n)) %>%
filter(Type == "s")# A tibble: 1 × 3
Type n prob
<chr> <int> <dbl>
1 s 288 0.448
- What is the probability that a noun phrase is inanimate, given that the genitive is ‘s’, i.e. \(P(\text{inanimate} \mid \text{s})\)? We compute
\[ P(\text{inanimate} \mid \text{s}) = \frac{P(\text{s} \mid \text{inanimate}) \cdot P(\text{inanimate})}{P(\text{s})}. \]
Show the code
# Total number of observations
N_total <- nrow(genitive)
# Count of animacy = inanimate
n_inanimate <- genitive %>%
filter(Possessor_Animacy2 == "inanimate") %>%
count() %>%
pull(n)
# Count of type = s
n_sgen <- genitive %>%
filter(Type == "s") %>%
count() %>%
pull(n)
# Count of type = s AND animacy = inanimate
n_sgen_inanim <- genitive %>%
filter(Possessor_Animacy2 == "inanimate", Type == "s") %>%
count() %>%
pull(n)
# Now compute:
# P(s | inanimate) = n_sgen_inanim / n_inanim
# P(inanimate) = n_inanim / N_total
# P(s) = n_sgen / N_total
p_inanim_sgen <- (n_sgen_inanim / n_inanimate) * (n_inanimate / N_total) / (n_sgen / N_total)
p_inanim_sgen[1] 0.2726817
Probability distributions
Recall the concept of random variables introduced in 4.1 Data types. They describe random processes, which means that the outcomes of the experiment are not pre-determined in any way; there is always some degree of uncertainty involved. Each outcome occurs with a certain probability.
When we associate each outcome of a random variable with a probability, we obtain its probability distribution. A function that explicitly maps probabilities onto the discrete outcomes of a discrete random variable \(X\) is called probability mass function (\(pmf\)). In the continuous case, we speak of a probability density function (\(pdf\)).
Example 3 (Pmf) We can establish the probability distribution of genitive types as follows:
# Count tokens and compute relative frequencies
gen_type <- genitive %>%
count(Type) %>%
mutate(rel_freq = n/sum(n))
# Ensure that probabilities sum up to 1
sum(gen_type$rel_freq)[1] 1
# Plot the PMF
gen_type %>%
ggplot(aes(x = Type, y = rel_freq)) +
geom_col() +
labs(title = "Probability mass function of genitive types")Discrete distributions
Uniform distribution
If all discrete outcomes have the same probability, their distribution is called uniform. A well-known example is the toss of a coin: If the coin is fair, the outcomes \(\{\text{Heads}\}\) and \(\{\text{Tails}\}\) are equally likely. More generally, if a discrete random variable \(X\) has \(k\) different outcomes, then any outcome \(x_i\) has the probability
\[ P(X = x_i) = \frac{1}{k}. \tag{8}\]
If we toss a fair coin once, the sample space is \(S = \{\text{Heads}, \text{Tails}\}\), and we have \(P(\text{Heads}) = P(\text{Tails}) = \frac{1}{2}\).
Note that all probabilities have to add up to 1, i.e.,
\[ \sum_{x} P(x) = 1. \tag{9}\]
Example 4 (Uniform distribution) A single throw of a die has the sample space \(S = \{1, 2, 3, 4, 5, 6\}\). The long-term relative frequency of each outcome, and thus its probability, should be \(1/6 \approx 0.17\).
Show the code
# Load libraries
library(tidyverse)
# First, set a seed for reproducibility; this ensure that we always generate the 'same' random numbers
set.seed(123)
# Simulate 10,000 rolls of a fair six-sided die
die_rolls <- sample(1:6, size = 10000, replace = TRUE)
# Plot the probability mass function
ggplot(data.frame(die_rolls), aes(x = factor(die_rolls))) +
geom_bar(aes(y = ..count../sum(..count..)), fill = "steelblue") +
geom_hline(yintercept = 1/6, col = "black") +
labs(title = "Rolling a die 10000 times (k = 6, p = 1/6)",
y = "Relative frequency",
x = "Outcome") +
scale_y_continuous(limits = c(0, 0.25)) +
theme_minimal()Binomial distribution
Many corpus-linguistic studies are concerned with discrete random variables \(X\) that have exactly two outcomes, such as the dative alternation (give somebody something vs. give something to somebody), the particle placement alternation (pick something up vs. pick up something), or subject and object realisation (I’ve eaten something vs. I’ve eaten Ø).
Assume we make \(n\) independent binary observations (also known as Bernoulli trials) of \(X\). If one of the two outcomes of \(X\) has a fixed probability \(\pi\) (often denoted ‘success’) and the other one the probability \(1 - \pi\) (‘failure’), \(X\) follows a binomial distribution.4 We can also use the shorthand notation
4 The natural extension of the binomial distribution to \(k\) different outcomes is called the multinomial distribution with index \(n\) and probabilities \(\pi_1, \pi_2, \dots, \pi_k\).
\[ X \sim Binom(n, \pi). \]
The elements inside the parentheses are the parameters of the distribution, determining the outcomes of \(X\): the number \(n\) of independent observations and the probability \(\pi\) of ‘success’. As such, they affect the shape of the probability mass (or density) function.
Example 5 (Binomial distribution) Suppose we throw a fair coin \(30\) times and count how often we obtain ‘heads’. We’d expect to see this outcome \(30 \cdot 0.5 = 15\) times.
Show the code
# Set seed for reproducibility
set.seed(123)
# Set parameters for binomial distribution
n <- 30 # Number of trials
pi <- 0.5 # Probability of HEADS
# Set up the pmf
successes <- 0:n
prob_mass <- dbinom(successes, size = n, prob = pi)
# Create a data frame for plotting
binom_data <- data.frame(successes = successes, probability = prob_mass)
# Basic plot approach
ggplot(binom_data, aes(x = successes, y = probability)) +
geom_segment(aes(xend = successes, yend = 0), linewidth = 1.2) +
labs(title = "PMF of n = 30 coin tosses with P({Heads}) = 0.5",
x = "Occurrences of {Heads}", y = "Relative frequency") +
theme_minimal()Poisson distribution
The Poisson distribution is particularly suitable for frequency data, which is ubiquitous in corpus linguistics. A Poisson-distributed random variable is fully determined by its parameter \(\lambda\), which determines how often events occur.
\[ X \sim Pois(\lambda). \]
Assume a word occurs 3 times per 1,000 words. We would define the rate parameter as \(\lambda = 3\), which is the expected and likeliest outcome.
Show the code
# Define the occurrence rate
lambda <- 3
# Define the range of x values
x_range <- 0:10
# Compute PMF values using dpois()
poisson_pmf <- data.frame(
x = x_range,
probability = dpois(x_range, lambda)
)
# Create the PMF plot using ggplot
ggplot(poisson_pmf, aes(x = x, y = probability)) +
geom_segment(aes(xend = x, yend = 0), linewidth = 1.2) +
labs(title = "Poisson PMF with λ = 3",
x = "Number of events",
y = "Probability") +
scale_x_continuous(breaks = x_range) +
theme_minimal()Continuous distributions
Normal distribution
A great number of numerical variables in the world follow the well-known normal (or Gaussian) distribution, which includes test scores, weight and height, among many others. The plot below illustrates its characteristic bell-shape: Most observations are in the middle, with considerably fewer near the fringes. For example, most people are rather “average” in height; there are only few people that are extremely short or extremely tall.
The normal distribution is typically described in terms of two parameters: The population mean \(\mu\) and the population variance \(\sigma\). If a random variable \(X\) is normally distributed, we typically use the notation in Equation 10.
\[ X \sim \mathcal{N}(\mu, \sigma^2). \tag{10}\]
The \(\mu\) parameter corresponds to the expected value \(E(X)\), which is a typical (or average) value of a distribution.
The spread of data points around the expectation is the population variance and corresponds to \(\sigma^2\):
\[ Var(X) = E(X-E(X))^2. \]
The population standard deviation \(\sigma\) is the average distance from the expectation and is defined as \(\sqrt{Var(X)}\).
Example 6 (Normal distribution) The plot illustrates a standard normal distribution for \(X \sim \mathcal{N}(0, 1)\). The \(y\)-axis indicates the density of population values; note that since the Gaussian distribution is a continuous distribution with technically infinite \(x\)-values, the probability of any given value must be 0. We can only obtain probabilities for intervals of values, which are given by
\[ P(a \leq X \leq b) = \int_a^b f(x)dx. \tag{11}\]
Show the code
# Set parameters for normal distribution
mu <- 0 # mean (expectation)
sigma <- 1 # standard deviation (square root of variance)
variance <- sigma^2
# Generate a sequence of x-values in a range of +/- 4 standard deviations from the mean
x_values <- seq(mu - 4*sigma, mu + 4*sigma, length.out = 1000)
# Collect everything in a data frame
norm_data <- data.frame(
x = x_values,
density = dnorm(x_values, mean = mu, sd = sigma)
)
# Plot the simulated data
ggplot(norm_data, aes(x = x, y = density)) +
geom_line(linewidth = 1, color = "#0D47A1") +
labs(
title = "PDF for N(0, 1)",
x = "x",
y = "Probability density") +
theme_minimal() Quite interestingly,
68% all values fall within one standard deviation of the mean,
95% within two, and
99.7% within three.
Lognormal distribution
While many variables follow a normal distribution, others exhibit a characteristic right-skewed pattern where most observations cluster near zero but some extend far into the positive tail. This is particularly common with reaction times or survival data.
The log-normal distribution describes variables whose natural logarithm follows a normal distribution. If \(\ln(X) \sim \mathcal{N}(\mu, \sigma^2)\), then \(X\) follows a log-normal distribution, denoted as:
\[ X \sim LogN(\mu, \sigma^2). \] The parameters \(\mu\) and \(\sigma^2\) refer to the mean and variance of the underlying normal distribution (i.e., of \(\ln(X)\)), not of \(X\) itself.
Example 7 (Lognormal distribution) Word frequencies in natural language corpora typically follow a log-normal distribution. Consider the frequency distribution of lemmas in a corpus, where most words occur rarely but a few occur very frequently (following Zipf’s law).
Show the code
# Set parameters for log-normal distribution
mu_log <- 2 # Mean of the underlying normal distribution (log scale)
sigma_log <- 1 # Standard deviation of the underlying normal distribution
# Generate x values on the positive real line
x_values <- seq(0.1, 50, length.out = 1000)
# Calculate the probability density
lognorm_data <- data.frame(
x = x_values,
density = dlnorm(x_values, meanlog = mu_log, sdlog = sigma_log)
)
# Plot the log-normal distribution
ggplot(lognorm_data, aes(x = x, y = density)) +
geom_line(linewidth = 1, color = "purple") +
labs(
title = "PDF for Log-Normal Distribution (μ = 2, σ = 1)",
x = "Frequency (occurrences per million words)",
y = "Probability density"
) +
theme_minimal()Show the code
# Show the relationship between normal and log-normal
# Generate sample data
set.seed(123)
normal_sample <- rnorm(1000, mean = mu_log, sd = sigma_log)
lognormal_sample <- exp(normal_sample)
# Create comparison plot
par(mfrow = c(1, 2))
hist(normal_sample, main = "Normal Distribution\n(log scale)",
xlab = "ln(X)", col = "#90CAF9", breaks = 30)
hist(lognormal_sample, main = "Log-Normal Distribution\n(original scale)",
xlab = "X", col = "#FFCDD2", breaks = 30)Gamma distribution
The Gamma distribution is a flexible family for continuous, strictly positive random variables. It is parameterised by a shape \(k > 0\) and a rate \(\lambda > 0\) (equivalently, a scale \(\theta = 1/\lambda\)):
\[ X \sim \text{Gamma}(k, \lambda). \]
The chi-squared distribution, which is central to many hypothesis tests (see the Chi-squared test), is a specific case of the Gamma distribution.
Example 8 (Gamma distribution) In linguistics, the Gamma family is well suited to ratio-scaled acceptability data collected via Magnitude Estimation, where participants assign positive real-valued scores to stimuli. Because such scores have a natural zero, are right-skewed, and are bounded below by zero, the normal distribution is a poor fit; the Gamma accommodates this naturally (see Horsch & Buskin 2026 for a recent study).
Show the code
gamma_data <- tibble(x = seq(0.01, 12, length.out = 500)) %>%
mutate(density = dgamma(x, shape = 1.25, rate = 1.5))
ggplot(gamma_data, aes(x = x, y = density)) +
geom_line(linewidth = 1, color = "steelblue2") +
labs(
title = "Gamma PDFs for k = 1.25 and λ = 1.5",
x = "Acceptability rating",
y = "Probability density",
colour = NULL
) +
theme_minimal() +
theme(legend.position = "bottom")Exercises
You can find the solutions to the exercises here.
Tier 1
Exercise 1 Make suggestions for probability distributions that could be helpful for modelling the following linguistic variables and explain your choice:
- Analytic vs. synthetic comparatives (scarier vs. more scary)
- Number of pronunciation errors per minute
- Frequencies of modal verbs in American English
- Article choice (indefinite vs. definite vs. zero-article)
- The time it takes to read a word aloud (in ms)
- Grammatical acceptability ratings on a Likert-scale ranging from 1 to 5
Exercise 2 Let genitive Type be a binomially distributed variable, which is defined as
\[ \text{Type} = \begin{cases} \text{s} & \text{with probability } \pi \\ \text{of} & \text{with probability } 1-\pi \end{cases} \] for \(0 < \pi < 1\).
We can model this variable using the binomial probability mass function, which has the form
\[ f(x; n; \pi) = \binom{n}{x}\pi^x(1-\pi)^{n-x}, \] where \(n\) is the number of independent trials, \(\pi\) the probability of ‘success’, and \(x\) the observed number of successes.
Baguley (2012: 67) provides the code example below for plotting a discrete \(pmf\).
# Vector containing the numbers of successes
heads <- 0:10 # observing 'heads' 0, ..., 10 times
# Get probability mass for each success (fair coin -> P(Heads) = 0.5)
prob.mass <- dbinom(heads, 10, 0.5) # throwing the coin 10 times
# Plot the PMF
plot(heads, prob.mass, pch = NA, xlab = "Number of heads", ylab = "Probability")
# Create a spikey plot
segments(x0 = heads, y0 = 0, x1 = heads, y1 = prob.mass)Adjust the code to plot the \(pmf\) for Type where \(\pi = 0.85\) and \(n = 50\). The binomial experiment is repeated 50 times. Provide a brief visual description of the distribution. What does the shape tell you about the likelihood of observing fewer than 40 successes?
Exercise 3 Let \(R\) be a continuous random variable measuring reaction times in lexical decision tasks. We approximate the measurements with a normal distribution parametrised by
\[ R \sim \mathcal{N}(\mu = 500, \sigma = 150), \]
which can be visualised as follows:
curve(dnorm(x, mean = 500, sd = 150), xlim = c(0, 1000))Describe how the shape auf the bell curve changes for \(\sigma = 50\), \(\sigma = 200\) and \(\sigma = 350\). What do these changes mean for the distribution of reaction times?
Tier 2
Exercise 4 If a variable follows a binomial distribution and we know the number of trials \(n\) and the probability of success \(\pi\), we can compute:
- the probability of observing a specific number of successes with
dbinom():
# Example: P(4 heads from 10 ten tosses of a fair coin)
dbinom(x = 4, size = 10, prob = 0.5)[1] 0.2050781
- the cumulative probability of up to \(x\) successes with
pbinom():
# Example: P(up to 4 heads from 10 ten tosses of a fair coin)
pbinom(q = 4, size = 10, prob = 0.5)[1] 0.3769531
# P(5 or more heads from 10 ten tosses of a fair coin)
1 - pbinom(q = 4, size = 10, prob = 0.5)[1] 0.6230469
- random simulated successes with
rbinom():
# Shows the numbers of successes in the first, second, third and fourth trial
rbinom(n = 4, size = 10, prob = 0.5)[1] 3 3 3 5
Interpret the output of the two code chunks below with reference to the Coreness variable from Exercise 2.
1 - pbinom(q = 80, size = 100, prob = 0.85)[1] 0.8934557
rbinom(n = 50, size = 15, prob = 0.15) [1] 2 3 2 0 0 1 4 1 2 3 6 1 3 1 2 0 7 2 7 1 1 1 3 2 2 3 2 2 2 1 1 3 3 5 5 3 2 1
[39] 2 3 4 4 3 1 0 2 2 1 5 1
Tier 3
Exercise 5 Let \(D\) be a Poisson-distributed random variable which counts the total number of fillers such as uhm, er, and like per 100 words. It has the \(pmf\)
\[ f(x, \lambda) = \frac{\lambda^{x} e^{-\lambda}}{x!}. \]
The study by Bortfeld et al. (2001) suggests that \(\lambda = 2\) would be a suitable rate parameter.
Start by plotting the \(pmf\) using
dpois()for \(x \in \{0, \dots, 10\}\). (Tip: Check the documentation by typing?dpoisinto the console.)Using
ppois(), which returns the cumulative probability \(P(X \leq x)\) for a Poisson-distributed variable, compute the probability of observing more than 5 fillers per 100 words. What does this suggest about the likelihood of highly disfluent speech under this model?
Exercise 6 The empirical study of dispersion has attracted significant attention in recent years (Sönning 2024; Gries 2024). A key challenge is finding dispersion measures that are minimally correlated with a word’s frequency of occurrence. One such measure is Kullback-Leibler divergence (KLD), which comes from information theory and is closely related to entropy.
Mathematically, KLD measures the difference between two probability distributions \(p\) and \(q\):
\[ KLD(p \parallel q) = \sum\limits_{x} p(x) \log \frac{p(x)}{q(x)} \]
For corpus dispersion (cf. Equation Equation 12), we compare the posterior (actual) distribution of a word across corpus parts \(\frac{v_i}{f}\) with the prior distribution that assumes uniform spread across parts (weighted by part size \(s_i\)):
\[ KLD = \sum\limits_{i=1}^n \frac{v_i}{f} \times \log_2\left({\frac{v_i}{f} \times \frac{1}{s_i}}\right) \tag{12}\]
where:
\(f\) = total frequency of the word in the corpus
\(v_i\) = frequency of the word in corpus part \(i\)
\(s_i\) = size of corpus part \(i\) (as fraction of total corpus)
\(n\) = number of corpus parts
First, let’s create simulated corpus data to work with:
library(tidyverse)
# Create simulated corpus data
set.seed(123)
# Simulate a corpus with 5 parts of different sizes
corpus_parts <- c("Fiction", "News", "Academic", "Spoken", "Web")
part_sizes <- c(0.25, 0.20, 0.15, 0.25, 0.15) # Sizes of the corpus parts
total_tokens <- 1000000 # Corpus size
# Create some example words with different dispersion patterns
words_data <- tibble(
word = rep(c("the", "however", "DNA", "like", "therefore"), each = 5),
corpus_part = rep(corpus_parts, 5),
frequency = c(
# "the" - highly frequent, evenly distributed
c(12500, 10000, 7500, 12500, 7500),
# "however" - academic bias
c(150, 200, 800, 100, 50),
# "DNA" - strong academic bias
c(10, 5, 200, 2, 8),
# "like" - spoken bias
c(200, 150, 50, 600, 300),
# "therefore" - academic/formal bias
c(50, 100, 300, 30, 20)
),
part_size = rep(part_sizes, 5)
)
# Show full dataset
print(words_data, n = Inf)# A tibble: 25 × 4
word corpus_part frequency part_size
<chr> <chr> <dbl> <dbl>
1 the Fiction 12500 0.25
2 the News 10000 0.2
3 the Academic 7500 0.15
4 the Spoken 12500 0.25
5 the Web 7500 0.15
6 however Fiction 150 0.25
7 however News 200 0.2
8 however Academic 800 0.15
9 however Spoken 100 0.25
10 however Web 50 0.15
11 DNA Fiction 10 0.25
12 DNA News 5 0.2
13 DNA Academic 200 0.15
14 DNA Spoken 2 0.25
15 DNA Web 8 0.15
16 like Fiction 200 0.25
17 like News 150 0.2
18 like Academic 50 0.15
19 like Spoken 600 0.25
20 like Web 300 0.15
21 therefore Fiction 50 0.25
22 therefore News 100 0.2
23 therefore Academic 300 0.15
24 therefore Spoken 30 0.25
25 therefore Web 20 0.15
Compute the dispersion values of the, however, DNA, like, and therefore based on the simulated corpus data. How does dispersion reflect their (lack of) register bias?