4.2 Probability theory

Author

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract

This section covers some essential concepts from probability theory, such as the concept of probability, probability distributions, and exepectations.

Preparation

The framenet dataset observations from 193,349 annotated sentences in the FrameNet database.

Defining probability

The OED provides several non-technical definitions of term probability, which include

‘the extent to which something is likely to happen or be the case’¹ and
a ‘thing judged likely to be true, to exist, or to happen’²,

¹ See https://doi.org/10.1093/OED/1639707847.

² See https://doi.org/10.1093/OED/3638534852.

While these dictionary definitions capture the intuitive (and rather subjective) nature of probability in everyday language, they leave important questions unanswered. How can we quantify probability? How can we describe the full range of possible outcomes — and their respective probabilities — in a systematic way? To address these concerns, we need a more objective and mathematically grounded notion of probability.

Relative frequency

Agresti & Kateri (2022: 29) adduce a frequency-based interpretation of probability:

“For an observation of a random phenomenon, the probability of a particular outcome is the proportion of times that outcome would occur in an indefinitely long sequence of like observations, under the same conditions.”

In other words, the probability of an event³ \(A\), denoted \(P(A)\), is equivalent to the long-term relative frequency of \(A\) as the number of observations \(n\) increases. The relative frequency \(f(A)\) is obtained by dividing the frequency of occurrence \(n(A)\) by the sample size \(n\), i.e.,

³ An event is understood as any subset of the sample space \(S\), comprising one or more outcomes.

\[ f(A) = \frac{n(A)}{n}. \tag{1}\]

Heumann et al. (2022b: 118) explains that \(f(A)\) converges to the probability \(P(A)\) as \(n\) approaches infinity:

\[ P(A) = \lim_{n\to\infty} \frac{n(A)}{n}. \tag{2}\]

Example 1 In the FrameNet data, we can use relative frequencies to estimate the probability that an element is a core element, i.e., unique and essential to a frame.

# Probability that an element is a (non-)core element
framenet %>% 
  count(Coreness) %>% 
  mutate(rel_freq = n/sum(n))

# A tibble: 2 × 3
  Coreness      n rel_freq
  <chr>     <int>    <dbl>
1 core     159371    0.824
2 non-core  33978    0.176

Kolmogorov’s axions

Kolmogorov’s axioms

On the most abstract level, probabilities are defined as functions that associate elements from the sample space \(S = \{s_1, s_2, ..., s_n\}\) with values in the interval \([0, 1]\), subject to certain conditions.

The Russian Mathematician Andrei Kolmogorov (1903–1987) proposed three axioms that a probability function \(P\) must satisfy:

Every event \(A\) in the sample space \(S\) has a probability

\[ P(A) \geq 0. \tag{3}\]

The probability of the sample space \(S\)

\[ P(S) = 1. \tag{4}\]

Assuming two disjoint (i.e., mutually exclusive) events \(A\) and \(B\), then

Note that this axiom can be generalised to infinite series of pairwise disjoint events \(A_i\):

\[ P(A \cup B) = P(A) + P(B). \tag{5}\]

Note that this axiom can be generalised to infinite series of pairwise disjoint events \(A_i\):

\[ P\left(\bigcup\limits_{i=1}^{\infty} A_i\right) = \sum\limits_{i=1}^{\infty} P(A_i). \]

Conditional probability

In many linguistic contexts, we’re interested in the probability of one event occurring given that another event has already occurred. This concept is captured by conditional probability, which measures the probability of event \(A\) happening when we know that event \(B\) has taken place. It is a way of capturing prior knowledge.

The conditional probability of \(A\) given \(B\) is denoted \(P(A|B)\) and is defined as:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \tag{6}\]

provided that \(P(B) > 0\). This formula tells us that the conditional probability is the ratio of the probability that both events occur to the probability that the conditioning event occurs.

One of the most important results in probability theory is Bayes’ theorem, which allows us to “reverse” conditional probabilities:

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}. \tag{7}\]

This theorem is fundamental to Bayesian statistics.

Example 2 Using the framenet data, we will compute the probabilities:

\(P(\text{Agent})\): The probability that a frame element is an AGENT.

# P(Agent)
framenet %>%
  count(Frame.Element) %>%
  mutate(prob = n / sum(n)) %>%
  filter(Frame.Element == "Agent")

# A tibble: 1 × 3
  Frame.Element     n   prob
  <chr>         <int>  <dbl>
1 Agent         12421 0.0642

\(P(\text{Agent} \mid \text{Abandonment})\): The probability that a frame element is an AGENT, given the Abandonment frame.

# P(Agent | Abandonment)  
framenet %>%
  filter(Frame == "Abandonment") %>%
  count(Frame.Element) %>%
  mutate(prob = n / sum(n)) %>%
  filter(Frame.Element == "Agent")

# A tibble: 1 × 3
  Frame.Element     n  prob
  <chr>         <int> <dbl>
1 Agent            34 0.298

\(P(\text{Agent} \mid \text{Abandonment}, \text{leave})\): The probability that a frame element is an AGENT, given the Abandonment frame and the verb leave.

# P(Agent | Abandonment, leave)  
framenet %>%
  filter(Frame == "Abandonment", Verb == "leave.v") %>%
  count(Frame.Element) %>%
  mutate(prob = n / sum(n)) %>%
  filter(Frame.Element == "Agent")

# A tibble: 1 × 3
  Frame.Element     n  prob
  <chr>         <int> <dbl>
1 Agent            10 0.233

What is the probability that the frame is Abandonment, given that the frame element is an AGENT, i.e. \(P(\text{Abandonment} \mid \text{Agent})\)? We compute

\[ P(\text{Abandonment} \mid \text{Agent}) = \frac{P(\text{Agent} \mid \text{Abandonment}) \cdot P(\text{Abandonment})}{P(\text{Agent})}. \]

Show the code

# Total number of observations
N_total <- nrow(framenet)

# Count of Frame = "Abandonment"
n_abandonment <- framenet %>%
  filter(Frame == "Abandonment") %>%
  count() %>%
  pull(n)

# Count of Frame.Element = "Agent"
n_agent <- framenet %>%
  filter(Frame.Element == "Agent") %>%
  count() %>%
  pull(n)

# Count of Frame.Element = "Agent" and Frame = "Abandonment"
n_agent_abandonment <- framenet %>%
  filter(Frame == "Abandonment", Frame.Element == "Agent") %>%
  count() %>%
  pull(n)

# Now compute:
# P(Agent | Abandonment) = n_agent_abandonment / n_abandonment
# P(Abandonment) = n_abandonment / N_total
# P(Agent) = n_agent / N_total

p_abandonment_given_agent <- (n_agent_abandonment / n_abandonment) * (n_abandonment / N_total) / (n_agent / N_total)

p_abandonment_given_agent

[1] 0.0027373

Probability distributions

Recall the concept of random variables introduced in 4.1 Data types. They describe random processes, which means that the outcomes of the experiment are not pre-determined in any way; there is always some degree of uncertainty involved. Each outcome occurs with a certain probability.

When we associate each outcome of a random variable with a probability, we obtain its probability distribution. A function that explicitly maps probabilities onto the discrete outcomes of a discrete random variable \(X\) is called probability mass function (\(pmf\)). In the continuous case, we speak of a probability density function (\(pdf\)).

Example 3 We can establish the probability distribution of frame elements for the verb eat (INGESTION frame) as follows:

# Obtain observations on the verb "eat"
eat <- framenet %>% 
  filter(Verb == "eat.v", Frame == "Ingestion")

# Count tokens and compute relative frequencies
eat_data <- eat %>% 
  count(Frame.Element) %>% 
  mutate(rel_freq = n/sum(n)) %>% 
  arrange(desc(rel_freq))

# Ensure that probabilities sum up to 1
sum(eat_data$rel_freq)

[1] 1

# Plot the PMF
eat_data %>% 
  ggplot(aes(x = Frame.Element, y = rel_freq)) +
  geom_col() +
  labs(title = "PMF of frame elements for 'eat' (INGESTION frame) ")

Discrete distributions

Uniform distribution

If all discrete outcomes have the same probability, their distribution is called uniform. A well-known example is the toss of a coin: If the coin is fair, the outcomes \(\{\text{Heads}\}\) and \(\{\text{Tails}\}\) are equally likely. More generally, if a discrete random variable \(X\) has \(k\) different outcomes, then any outcome \(x_i\) has the probability

\[ P(X = x_i) = \frac{1}{k}. \tag{8}\]

If we toss a fair coin once, the sample space is \(S = \{\text{Heads}, \text{Tails}\}\), and we have \(P(\text{Heads}) = P(\text{Tails}) = \frac{1}{2}\).

Note that all probabilities have to add up to 1, i.e.,

\[ \sum_{x\in X} P(x) = 1. \tag{9}\]

Example 4 A single throw of a die has the sample space \(S = \{1, 2, 3, 4, 5, 6\}\). The long-term relative frequency of each outcome, and thus its probability, should be \(1/6 \approx 0.17\).

Show the code

# Load libraries
library(tidyverse)

# First, set a seed for reproducibility; this ensure that we always generate the 'same' random numbers
set.seed(123)

# Simulate 10,000 rolls of a fair six-sided die
die_rolls <- sample(1:6, size = 10000, replace = TRUE)

# Plot the probability mass function
ggplot(data.frame(die_rolls), aes(x = factor(die_rolls))) +
  geom_bar(aes(y = ..count../sum(..count..)), fill = "steelblue") +
  geom_hline(yintercept = 1/6, col = "black") +
  labs(title = "Rolling a die 10000 times (k = 6, p = 1/6)",
       y = "Relative frequency",
       x = "Outcome") +
  scale_y_continuous(limits = c(0, 0.25)) +
  theme_minimal()

Binomial distribution

Many corpus-linguistic studies are concerned with discrete random variables \(X\) that have exactly two outcomes, such as the dative alternation (give somebody something vs. give something to somebody), the particle placement alternation (pick something up vs. pick up something), or subject and object realisation (I’ve eaten something vs. I’ve eaten Ø).

Assume we make \(n\) independent binary observations (also known as Bernoulli trials) of \(X\). If one of the two outcomes of \(X\) has a fixed probability \(\pi\) (often denoted ‘success’) and the other one the probability \(1 - \pi\) (‘failure’), \(X\) follows a binomial distribution.⁴ We can also use the shorthand notation

⁴ The natural extension of the binomial distribution to \(k\) different outcomes is called the multinomial distribution with index \(n\) and probabilities \(\pi_1, \pi_2, \dots, \pi_k\).

\[ X \sim Binom(n, \pi). \]

The elements inside the parentheses are the parameters of the distribution, determining the outcomes of \(X\): the number \(n\) of independent observations and the probability \(\pi\) of ‘success’. As such, they affect the shape of the probability mass (or density) function.

Example 5 Suppose we throw a fair coin \(30\) times and count how often we obtain ‘heads’. We’d expect to see this outcome \(30 \cdot 0.5 = 15\) times.

Show the code

# Set seed for reproducibility
set.seed(123)

# Set parameters for binomial distribution
n <- 30    # Number of trials
pi <- 0.5  # Probability of HEADS

# Set up the pmf
successes <- 0:n
prob_mass <- dbinom(successes, size = n, prob = pi)

# Create a data frame for plotting
binom_data <- data.frame(successes = successes, probability = prob_mass)

# Basic plot approach
ggplot(binom_data, aes(x = successes, y = probability)) +
  geom_segment(aes(xend = successes, yend = 0), linewidth = 1.2) +
  labs(title = "PMF of n = 30 coin tosses with P({Heads}) = 0.5",
       x = "Occurrences of {Heads}", y = "Relative frequency") +
  theme_minimal()

Poisson distribution

The Poisson distribution is particularly suitable for frequency data, which is ubiquitous in corpus linguistics. A Poisson-distributed random variable is fully determined by its parameter \(\lambda\), which determines how often events occur.

\[ X \sim Pois(\lambda). \]

Assume a word occurs 3 times per 1,000 words. We would define the rate parameter as \(\lambda = 3\), which is the expected and likeliest outcome.

Show the code

# Define the occurrence rate
lambda <- 3

# Define the range of x values
x_range <- 0:10

# Compute PMF values using dpois()
poisson_pmf <- data.frame(
  x = x_range,
  probability = dpois(x_range, lambda)
)

# Create the PMF plot using ggplot
ggplot(poisson_pmf, aes(x = x, y = probability)) +
  geom_segment(aes(xend = x, yend = 0), linewidth = 1.2) + 
  labs(title = "Poisson PMF with λ = 3",
       x = "Number of events",
       y = "Probability") +
  scale_x_continuous(breaks = x_range) +
  theme_minimal()

Continuous distributions

The normal distribution

A great number of numerical variables in the world follow the well-known normal (or Gaussian) distribution, which includes test scores, weight and height, among many others. The plot below illustrates its characteristic bell-shape: Most observations are in the middle, with considerably fewer near the fringes. For example, most people are rather “average” in height; there are only few people that are extremely short or extremely tall.

The normal distribution is typically described in terms of two parameters: The population mean \(\mu\) and the population variance \(\sigma\). If a random variable \(X\) is normally distributed, we typically use the notation in Equation 10.

\[ X \sim \mathcal{N}(\mu, \sigma^2). \tag{10}\]

The \(\mu\) parameter corresponds to the expected value \(E(X)\), which is a typical (or average) value of a distribution.

The spread of data points around the expectation is the population variance and corresponds to \(\sigma^2\):

\[ Var(X) = E(X-E(X))^2. \]

The population standard deviation \(\sigma\) is the average distance from the expectation and is defined as \(\sqrt{Var(X)}\).

Example 6 The plot illustrates a standard normal distribution for \(X \sim \mathcal{N}(0, 1)\). The \(y\)-axis indicates the density of population values; note that since the Gaussian distribution is a continuous distribution with technically infinite \(x\)-values, the probability of any given value must be 0. We can only obtain probabilities for intervals of values, which are given by

\[ P(a \leq X \leq b) = \int_a^b f(x)dx. \tag{11}\]

Show the code

# Set parameters for normal distribution
mu <- 0     # mean (expectation)
sigma <- 1  # standard deviation (square root of variance)
variance <- sigma^2

# Generate a sequence of x-values in a range of +/- 4 standard deviations from the mean
x_values <- seq(mu - 4*sigma, mu + 4*sigma, length.out = 1000)

# Collect everything in a data frame
norm_data <- data.frame(
  x = x_values,
  density = dnorm(x_values, mean = mu, sd = sigma)
)

# Plot the simulated data
ggplot(norm_data, aes(x = x, y = density)) +
  geom_line(linewidth = 1, color = "#0D47A1") +
  labs(
    title = "PDF for N(0, 1)",
    x = "x",
    y = "Probability density") +
  theme_minimal()

Quick facts about the Gaussian bell curve

Quite interestingly,

68% all values fall within one standard deviation of the mean,
95% within two, and
99.7% within three.

The lognormal distribution

While many variables follow a normal distribution, others exhibit a characteristic right-skewed pattern where most observations cluster near zero but some extend far into the positive tail. This is particularly common with reaction times or survival data.

The log-normal distribution describes variables whose natural logarithm follows a normal distribution. If \(\ln(X) \sim \mathcal{N}(\mu, \sigma^2)\), then \(X\) follows a log-normal distribution, denoted as:

\[ X \sim LogN(\mu, \sigma^2). \] The parameters \(\mu\) and \(\sigma^2\) refer to the mean and variance of the underlying normal distribution (i.e., of \(\ln(X)\)), not of \(X\) itself.

Example 7 Word frequencies in natural language corpora typically follow a log-normal distribution. Consider the frequency distribution of lemmas in a corpus, where most words occur rarely but a few occur very frequently (following Zipf’s law).

Show the code

# Set parameters for log-normal distribution
mu_log <- 2      # Mean of the underlying normal distribution (log scale)
sigma_log <- 1   # Standard deviation of the underlying normal distribution

# Generate x values on the positive real line
x_values <- seq(0.1, 50, length.out = 1000)

# Calculate the probability density
lognorm_data <- data.frame(
  x = x_values,
  density = dlnorm(x_values, meanlog = mu_log, sdlog = sigma_log)
)

# Plot the log-normal distribution
ggplot(lognorm_data, aes(x = x, y = density)) +
  geom_line(linewidth = 1, color = "purple") +
  labs(
    title = "PDF for Log-Normal Distribution (μ = 2, σ = 1)",
    x = "Frequency (occurrences per million words)",
    y = "Probability density"
  ) +
  theme_minimal()

Show the code

# Show the relationship between normal and log-normal
# Generate sample data
set.seed(123)
normal_sample <- rnorm(1000, mean = mu_log, sd = sigma_log)
lognormal_sample <- exp(normal_sample)

# Create comparison plot
par(mfrow = c(1, 2))
hist(normal_sample, main = "Normal Distribution\n(log scale)", 
     xlab = "ln(X)", col = "#90CAF9", breaks = 30)
hist(lognormal_sample, main = "Log-Normal Distribution\n(original scale)", 
     xlab = "X", col = "#FFCDD2", breaks = 30)

Exercises

Tier 1

Exercise 1 Make suggestions for probability distributions that could be helpful for modelling the following variables:

Analytic vs. synthetic comparatives (scarier vs. more scary)
Number of pronunciation errors per minute
Frequencies of modal verbs in American English
Article choice (indefinite vs. definite vs. zero-article)
The time it takes to read a word aloud (in ms)

Exercise 2 Let Coreness be a binomially distributed variable, which is defined as

\[ \text{Coreness} = \begin{cases} \text{core} & \text{with probability } \pi \\ \text{non-core} & \text{with probability } 1-\pi \end{cases} \] for \(0 < \pi < 1\).

We can model this variable using the binomial probability mass function, which has the form

\[ f(x; n; \pi) = \binom{n}{x}\pi^x(1-\pi)^{n-x}, \] where \(n\) is the number of independent trials, \(\pi\) the probability of ‘success’, and \(x\) the observed number of successes.

Baguley (2012: 67) provides the code example below for plotting a discrete \(pmf\).

# Vector containing the numbers of successes
heads <- 0:10 # observing 'heads' 0, ..., 10 times

# Get probability mass for each success (fair coin -> P(Heads) = 0.5)
prob.mass <- dbinom(heads, 10, 0.5) # throwing the coin 10 times

# Plot the PMF
plot(heads, prob.mass, pch = NA, xlab = "Number of heads", ylab = "Probability")

# Create a spikey plot
segments(x0 = heads, y0 = 0, x1 = heads, y1 = prob.mass)

Adjust the code to plot the \(pmf\) for Coreness where \(\pi = 0.85\) and \(n = 50\). The binomial experiment is repeated 50 times. Provide a brief visual description of the distribution.

Exercise 3 Let \(R\) be a continuous random variable measuring reaction times in lexical decision tasks. We approximate the measurements with a normal distribution parametrised by

\[ R \sim \mathcal{N}(\mu = 500, \sigma = 150), \]

which can be visualised as follows:

curve(dnorm(x, mean = 500, sd = 150), xlim = c(0, 1000))

Describe how the shape auf the bell curve changes for \(\sigma = 50\), \(\sigma = 200\) and \(\sigma = 350\). What do these changes mean for the distribution of reaction times?

Tier 2

Exercise 4 If a variable follows a binomial distribution and we know the number of trials \(n\) and the probability of success \(\pi\), we can compute:

the probability of observing a specific number of successes with dbinom():

# Example: P(4 heads from 10 ten tosses of a fair coin)
dbinom(x = 4, size = 10, prob = 0.5)

[1] 0.2050781

the cumulative probability of up to \(x\) successes with pbinom():

# Example: P(up to 4 heads from 10 ten tosses of a fair coin)
pbinom(q = 4, size = 10, prob = 0.5)

[1] 0.3769531

# P(5 or more heads from 10 ten tosses of a fair coin)
1 - pbinom(q = 4, size = 10, prob = 0.5)

[1] 0.6230469

random simulated successes with rbinom():

# Shows the numbers of successes in the first, second, third and fourth trial
rbinom(n = 4, size = 10, prob = 0.5)

[1] 3 3 3 5

Interpret the output of the two code chunks below with reference to the Coreness variable from Exercise 2.

1 - pbinom(q = 80, size = 100, prob = 0.85)

[1] 0.8934557

rbinom(n = 50, size = 15, prob = 0.15)

 [1] 2 3 2 0 0 1 4 1 2 3 6 1 3 1 2 0 7 2 7 1 1 1 3 2 2 3 2 2 2 1 1 3 3 5 5 3 2 1
[39] 2 3 4 4 3 1 0 2 2 1 5 1

Exercise 5 Let \(D\) be a Poisson-distributed random variable which counts the total number of fillers such as uhm, er, and like per 100 words. It has the \(pmf\)

\[ f(x, \lambda) = \frac{\lambda^{x} e^{-\lambda}}{x!}. \]

The study by Bortfeld et al. (2001) suggests that \(\lambda = 2\) would be a suitable rate parameter. Plot the \(pmf\) using dpois() for \(x \in \{0, \dots, 10\}\). (Tip: Check the documentation by typing ?dpois into the console.)

Exercise 6 Consider the distribution of frame elements for the verb eat (INGESTION):

eat_data

# A tibble: 6 × 3
  Frame.Element     n rel_freq
  <chr>         <int>    <dbl>
1 Ingestor         24   0.414 
2 Ingestibles      23   0.397 
3 Place             5   0.0862
4 Manner            3   0.0517
5 Source            2   0.0345
6 Time              1   0.0172

What probability distribution would be suitable for modelling the variable Frame.Element?

Tier 3

Exercise 7 The empirical study of dispersion has attracted significant attention in recent years (Sönning 2024; Gries 2024). A key challenge is finding dispersion measures that are minimally correlated with token frequency. One such measure is the Kullback-Leibler divergence (KLD), which comes from information theory and is closely related to entropy.

Mathematically, KLD measures the difference between two probability distributions \(p\) and \(q\):

\[ KLD(p \parallel q) = \sum\limits_{x} p(x) \log \frac{p(x)}{q(x)} \]

For corpus dispersion (cf. Equation Equation 12), we compare the posterior (actual) distribution of a word across corpus parts \(\frac{v_i}{f}\) with the prior distribution that assumes uniform spread across parts (weighted by part size \(s_i\)):

\[ KLD = \sum\limits_{i=1}^n \frac{v_i}{f} \times \log_2\left({\frac{v_i}{f} \times \frac{1}{s_i}}\right) \tag{12}\]

where:

\(f\) = total frequency of the word in the corpus
\(v_i\) = frequency of the word in corpus part \(i\)
\(s_i\) = size of corpus part \(i\) (as fraction of total corpus)
\(n\) = number of corpus parts

First, let’s create simulated corpus data to work with:

library(tidyverse)

# Create simulated corpus data
set.seed(123)

# Simulate a corpus with 5 parts of different sizes
corpus_parts <- c("Fiction", "News", "Academic", "Spoken", "Web")

part_sizes <- c(0.25, 0.20, 0.15, 0.25, 0.15)  # Sizes of the corpus parts

total_tokens <- 1000000 # Corpus size

# Create some example words with different dispersion patterns
words_data <- tibble(
  word = rep(c("the", "however", "DNA", "like", "therefore"), each = 5),
  corpus_part = rep(corpus_parts, 5),
  frequency = c(
    # "the" - highly frequent, evenly distributed
    c(12500, 10000, 7500, 12500, 7500),
    # "however" - academic bias
    c(150, 200, 800, 100, 50),
    # "DNA" - strong academic bias  
    c(10, 5, 200, 2, 8),
    # "like" - spoken bias
    c(200, 150, 50, 600, 300),
    # "therefore" - academic/formal bias
    c(50, 100, 300, 30, 20)
  ),
  part_size = rep(part_sizes, 5)
)

# Show full dataset
print(words_data, n = Inf)

# A tibble: 25 × 4
   word      corpus_part frequency part_size
   <chr>     <chr>           <dbl>     <dbl>
 1 the       Fiction         12500      0.25
 2 the       News            10000      0.2 
 3 the       Academic         7500      0.15
 4 the       Spoken          12500      0.25
 5 the       Web              7500      0.15
 6 however   Fiction           150      0.25
 7 however   News              200      0.2 
 8 however   Academic          800      0.15
 9 however   Spoken            100      0.25
10 however   Web                50      0.15
11 DNA       Fiction            10      0.25
12 DNA       News                5      0.2 
13 DNA       Academic          200      0.15
14 DNA       Spoken              2      0.25
15 DNA       Web                 8      0.15
16 like      Fiction           200      0.25
17 like      News              150      0.2 
18 like      Academic           50      0.15
19 like      Spoken            600      0.25
20 like      Web               300      0.15
21 therefore Fiction            50      0.25
22 therefore News              100      0.2 
23 therefore Academic          300      0.15
24 therefore Spoken             30      0.25
25 therefore Web                20      0.15

Compute the dispersion values of the, however, DNA, like, and therefore based on the simulated corpus data. How does dispersion reflect their (lack of) register bias?

References

Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.

Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral Sciences. Houndmills, Basingstoke: Palgrave Macmillan.

Bortfeld, Heather, Silvia D. Leon, Jonathan E. Bloom, Michael F. Schober, and Susan E. Brennan. 2001. “Disfluency Rates in Conversation: Effects of Age, Relationship, Topic, Role, and Gender.” Language and Speech 44 (2): 123–47. https://doi.org/10.1177/00238309010440020101.

Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Pacific Grove, Calif.: Duxbury/Thomson Learning.

Fillmore, Charles J., Christopher R. Johnson, and Miriam R. L. Petruck. 2003. “Background to FrameNet.” International Journal of Lexicography 16 (3): 235–50.

Gries, Stefan Thomas. 2024. Frequency, Dispersion, Association, and Keyness: Revising and Tupleizing Corpus-Linguistic Measures. Amsterdam & Philadelphia: John Benjamins Publishing Company.

Heumann, Christian, Michael Schomaker, and Shalabh. 2022b. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in R. 2nd ed. Cham: Springer.

———. 2022a. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in r. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-031-11833-3.

Ruppenhofer, Josef, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, Collin F. Baker, and Jan Scheffczyk. 2016. “FrameNet II: Extended Theory and Practice.” 2016. https://framenet2.icsi.berkeley.edu/docs/r1.7/book.pdf.

Sönning, Lukas. 2024. “Evaluation of Keyness Metrics: Performance and Reliability,” Corpus Linguistics and Linguistic Theory, 20 (2): 263–88. https://doi.org/10.1515/cllt-2022-0116.

Recommended reading

Preparation

Defining probability

Relative frequency

Kolmogorov’s axioms

Conditional probability

Probability distributions

Discrete distributions

Uniform distribution

Binomial distribution

Poisson distribution

Continuous distributions

The normal distribution

The lognormal distribution

Exercises

Tier 1

Tier 2

Tier 3

References