Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Research questions
    • 1.3 Linguistic variables
    • 1.4 Set theory and mathematical notation
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 Data annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Binomial test
    • 4.6 Chi-squared test
    • 4.7 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. 4. Introduction to Statistics
  2. 4.2 Probability theory
  • 4. Introduction to Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Binomial test
    • 4.6 Chi-squared test
    • 4.7 t-test

On this page

  • Recommended reading
  • Preparation
  • Defining probability
    • Relative frequency
    • Conditional probability
  • Probability distributions
    • Discrete distributions
    • Continuous distributions
  • Exercises
  • Tier 1
  • Tier 2
  • Tier 3
  1. 4. Introduction to Statistics
  2. 4.2 Probability theory

4.2 Probability theory

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract
This section covers some essential concepts from probability theory, such as the concept of probability, probability distributions, and exepectations.

Recommended reading

Most accessible:

Baguley (2012): Chapter 2

Technical:

Heumann, Schomaker, and Shalabh (2022a): Chapter 7

Agresti and Kateri (2022): Chapter 2

Proof-based:

Casella and Berger (2002): Chapters 1 & 2

Preparation

The framenet dataset observations from 193,349 annotated sentences in the FrameNet database.

More on FrameNet

FrameNet emerged as part of Charles Fillmore’s research on frame semantics (Fillmore, Johnson, and Petruck 2003) and is currently maintained by Ruppenhofer et al. (2016). It is fundamentally based on on the idea that lexical items can be meaningfully grouped based on the semantic roles they evoke.

For instance, lexical units such as boil, brown, or fry are said to evoke the APPLY_HEAT frame, in which a COOK applies heat to FOOD using a HEATING_INSTRUMENT or a CONTAINER (cf.exs. a–c). Speakers may moreover include circumstantial elements like MANNER or PLACE.

  1. [They]\(_\text{COOK}\) [boiled] [them]\(_\text{FOOD}\) [in an iron saucepan]\(_\text{CONTAINER}\).

  2. [You]\(_\text{COOK}\) [can brown] [it]\(_\text{FOOD}\) [in the hot fat]\(_\text{MEDIUM}\).

  3. [She]\(_\text{COOK}\) [was frying] [eggs and bacon and mushrooms]\(_\text{FOOD}\) [on a camp stove]\(_\text{MEDIUM}\) [in Woolley’s billet]\(_\text{PLACE}\).

Elements that are essential for identifying as well as understanding a frame are called core; they can be considered “obligatory” complements of the verb (here: COOK and FOOD) By contrast, non-core are not unique to frames and can always be added or dropped (similar to adjuncts; here: PLACE), respectively.

library(tidyverse)
library(readxl)

framenet <- read_xlsx("FrameNet_full.xlsx")
# Overview
str(framenet)
head(framenet)

# How many distinct verbs?
length(unique(framenet$Verb))

# How many distinct frames?
length(unique(framenet$Frame))

# How many distinct frame elements?
length(unique(framenet$Frame.Element))

Defining probability

The OED provides several non-technical definitions of term probability, which include

  • ‘the extent to which something is likely to happen or be the case’1 and

  • a ‘thing judged likely to be true, to exist, or to happen’2,

  • 1 See https://doi.org/10.1093/OED/1639707847.

  • 2 See https://doi.org/10.1093/OED/3638534852.

  • While these dictionary definitions capture the intuitive (and rather subjective) nature of probability in everyday language, they leave important questions unanswered. How can we quantify probability? How can we describe the full range of possible outcomes — and their respective probabilities — in a systematic way? To address these concerns, we need a more objective and mathematically grounded notion of probability.

    Relative frequency

    Agresti & Kateri (2022: 29) adduce a frequency-based interpretation of probability:

    “For an observation of a random phenomenon, the probability of a particular outcome is the proportion of times that outcome would occur in an indefinitely long sequence of like observations, under the same conditions.”

    In other words, the probability of an event3 \(A\), denoted \(P(A)\), is equivalent to the long-term relative frequency of \(A\) as the number of observations \(n\) increases. The relative frequency \(f(A)\) is obtained by dividing the frequency of occurrence \(n(A)\) by the sample size \(n\), i.e.,

  • 3 An event is understood as any subset of the sample space \(S\), comprising one or more outcomes.

  • \[ f(A) = \frac{n(A)}{n}. \tag{1}\]

    Heumann et al. (2022b: 118) explains that \(f(A)\) converges to the probability \(P(A)\) as \(n\) approaches infinity:

    \[ P(A) = \lim_{n\to\infty} \frac{n(A)}{n}. \tag{2}\]

    Example 1 In the FrameNet data, we can use relative frequencies to estimate the probability that an element is a core element, i.e., unique and essential to a frame.

    # Probability that an element is a (non-)core element
    framenet %>% 
      count(Coreness) %>% 
      mutate(rel_freq = n/sum(n))
    # A tibble: 2 × 3
      Coreness      n rel_freq
      <chr>     <int>    <dbl>
    1 core     159371    0.824
    2 non-core  33978    0.176
    Kolmogorov’s axions

    Kolmogorov’s axioms

    On the most abstract level, probabilities are defined as functions that associate elements from the sample space \(S = \{s_1, s_2, ..., s_n\}\) with values in the interval \([0, 1]\), subject to certain conditions.

    The Russian Mathematician Andrei Kolmogorov (1903–1987) proposed three axioms that a probability function \(P\) must satisfy:

    1. Every event \(A\) in the sample space \(S\) has a probability

    \[ P(A) \geq 0. \tag{3}\]

    1. The probability of the sample space \(S\)

    \[ P(S) = 1. \tag{4}\]

    1. Assuming two disjoint (i.e., mutually exclusive) events \(A\) and \(B\), then

    Note that this axiom can be generalised to infinite series of pairwise disjoint events \(A_i\):

    \[ P(A \cup B) = P(A) + P(B). \tag{5}\]

    Note that this axiom can be generalised to infinite series of pairwise disjoint events \(A_i\):

    \[ P\left(\bigcup\limits_{i=1}^{\infty} A_i\right) = \sum\limits_{i=1}^{\infty} P(A_i). \]

    Conditional probability

    In many linguistic contexts, we’re interested in the probability of one event occurring given that another event has already occurred. This concept is captured by conditional probability, which measures the probability of event \(A\) happening when we know that event \(B\) has taken place. It is a way of capturing prior knowledge.

    The conditional probability of \(A\) given \(B\) is denoted \(P(A|B)\) and is defined as:

    \[ P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \tag{6}\]

    provided that \(P(B) > 0\). This formula tells us that the conditional probability is the ratio of the probability that both events occur to the probability that the conditioning event occurs.

    One of the most important results in probability theory is Bayes’ theorem, which allows us to “reverse” conditional probabilities:

    \[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}. \tag{7}\]

    This theorem is fundamental to Bayesian statistics.

    Example 2 Using the framenet data, we will compute the probabilities:

    • \(P(\text{Agent})\): The probability that a frame element is an AGENT.
    # P(Agent)
    framenet %>%
      count(Frame.Element) %>%
      mutate(prob = n / sum(n)) %>%
      filter(Frame.Element == "Agent")
    # A tibble: 1 × 3
      Frame.Element     n   prob
      <chr>         <int>  <dbl>
    1 Agent         12421 0.0642
    • \(P(\text{Agent} \mid \text{Abandonment})\): The probability that a frame element is an AGENT, given the Abandonment frame.
    # P(Agent | Abandonment)  
    framenet %>%
      filter(Frame == "Abandonment") %>%
      count(Frame.Element) %>%
      mutate(prob = n / sum(n)) %>%
      filter(Frame.Element == "Agent")
    # A tibble: 1 × 3
      Frame.Element     n  prob
      <chr>         <int> <dbl>
    1 Agent            34 0.298
    • \(P(\text{Agent} \mid \text{Abandonment}, \text{leave})\): The probability that a frame element is an AGENT, given the Abandonment frame and the verb leave.
    # P(Agent | Abandonment, leave)  
    framenet %>%
      filter(Frame == "Abandonment", Verb == "leave.v") %>%
      count(Frame.Element) %>%
      mutate(prob = n / sum(n)) %>%
      filter(Frame.Element == "Agent")
    # A tibble: 1 × 3
      Frame.Element     n  prob
      <chr>         <int> <dbl>
    1 Agent            10 0.233

    What is the probability that the frame is Abandonment, given that the frame element is an AGENT, i.e. \(P(\text{Abandonment} \mid \text{Agent})\)? We compute

    \[ P(\text{Abandonment} \mid \text{Agent}) = \frac{P(\text{Agent} \mid \text{Abandonment}) \cdot P(\text{Abandonment})}{P(\text{Agent})}. \]

    Show the code
    # Total number of observations
    N_total <- nrow(framenet)
    
    # Count of Frame = "Abandonment"
    n_abandonment <- framenet %>%
      filter(Frame == "Abandonment") %>%
      count() %>%
      pull(n)
    
    # Count of Frame.Element = "Agent"
    n_agent <- framenet %>%
      filter(Frame.Element == "Agent") %>%
      count() %>%
      pull(n)
    
    # Count of Frame.Element = "Agent" and Frame = "Abandonment"
    n_agent_abandonment <- framenet %>%
      filter(Frame == "Abandonment", Frame.Element == "Agent") %>%
      count() %>%
      pull(n)
    
    # Now compute:
    # P(Agent | Abandonment) = n_agent_abandonment / n_abandonment
    # P(Abandonment) = n_abandonment / N_total
    # P(Agent) = n_agent / N_total
    
    p_abandonment_given_agent <- (n_agent_abandonment / n_abandonment) * (n_abandonment / N_total) / (n_agent / N_total)
    
    p_abandonment_given_agent
    [1] 0.0027373

    Probability distributions

    Recall the concept of random variables introduced in 4.1 Data types. They describe random processes, which means that the outcomes of the experiment are not pre-determined in any way; there is always some degree of uncertainty involved. Each outcome occurs with a certain probability.

    When we associate each outcome of a random variable with a probability, we obtain its probability distribution. A function that explicitly maps probabilities onto the discrete outcomes of a discrete random variable \(X\) is called probability mass function (\(pmf\)). In the continuous case, we speak of a probability density function (\(pdf\)).

    Example 3 We can establish the probability distribution of frame elements for the verb eat (INGESTION frame) as follows:

    # Obtain observations on the verb "eat"
    eat <- framenet %>% 
      filter(Verb == "eat.v", Frame == "Ingestion")
    
    # Count tokens and compute relative frequencies
    eat_data <- eat %>% 
      count(Frame.Element) %>% 
      mutate(rel_freq = n/sum(n)) %>% 
      arrange(desc(rel_freq))
    
    # Ensure that probabilities sum up to 1
    sum(eat_data$rel_freq)
    [1] 1
    # Plot the PMF
    eat_data %>% 
      ggplot(aes(x = Frame.Element, y = rel_freq)) +
      geom_col() +
      labs(title = "PMF of frame elements for 'eat' (INGESTION frame) ")

    Discrete distributions

    Uniform distribution

    If all discrete outcomes have the same probability, their distribution is called uniform. A well-known example is the toss of a coin: If the coin is fair, the outcomes \(\{\text{Heads}\}\) and \(\{\text{Tails}\}\) are equally likely. More generally, if a discrete random variable \(X\) has \(k\) different outcomes, then any outcome \(x_i\) has the probability

    \[ P(X = x_i) = \frac{1}{k}. \tag{8}\]

    If we toss a fair coin once, the sample space is \(S = \{\text{Heads}, \text{Tails}\}\), and we have \(P(\text{Heads}) = P(\text{Tails}) = \frac{1}{2}\).

    Note that all probabilities have to add up to 1, i.e.,

    \[ \sum_{x\in X} P(x) = 1. \tag{9}\]

    Example 4 A single throw of a die has the sample space \(S = \{1, 2, 3, 4, 5, 6\}\). The long-term relative frequency of each outcome, and thus its probability, should be \(1/6 \approx 0.17\).

    Show the code
    # Load libraries
    library(tidyverse)
    
    # First, set a seed for reproducibility; this ensure that we always generate the 'same' random numbers
    set.seed(123)
    
    # Simulate 10,000 rolls of a fair six-sided die
    die_rolls <- sample(1:6, size = 10000, replace = TRUE)
    
    # Plot the probability mass function
    ggplot(data.frame(die_rolls), aes(x = factor(die_rolls))) +
      geom_bar(aes(y = ..count../sum(..count..)), fill = "steelblue") +
      geom_hline(yintercept = 1/6, col = "black") +
      labs(title = "Rolling a die 10000 times (k = 6, p = 1/6)",
           y = "Relative frequency",
           x = "Outcome") +
      scale_y_continuous(limits = c(0, 0.25)) +
      theme_minimal()

    Binomial distribution

    Many corpus-linguistic studies are concerned with discrete random variables \(X\) that have exactly two outcomes, such as the dative alternation (give somebody something vs. give something to somebody), the particle placement alternation (pick something up vs. pick up something), or subject and object realisation (I’ve eaten something vs. I’ve eaten Ø).

    Assume we make \(n\) independent binary observations (also known as Bernoulli trials) of \(X\). If one of the two outcomes of \(X\) has a fixed probability \(\pi\) (often denoted ‘success’) and the other one the probability \(1 - \pi\) (‘failure’), \(X\) follows a binomial distribution.4 We can also use the shorthand notation

  • 4 The natural extension of the binomial distribution to \(k\) different outcomes is called the multinomial distribution with index \(n\) and probabilities \(\pi_1, \pi_2, \dots, \pi_k\).

  • \[ X \sim Binom(n, \pi). \]

    The elements inside the parentheses are the parameters of the distribution, determining the outcomes of \(X\): the number \(n\) of independent observations and the probability \(\pi\) of ‘success’. As such, they affect the shape of the probability mass (or density) function.

    Example 5 Suppose we throw a fair coin \(30\) times and count how often we obtain ‘heads’. We’d expect to see this outcome \(30 \cdot 0.5 = 15\) times.

    Show the code
    # Set seed for reproducibility
    set.seed(123)
    
    # Set parameters for binomial distribution
    n <- 30    # Number of trials
    pi <- 0.5  # Probability of HEADS
    
    # Set up the pmf
    successes <- 0:n
    prob_mass <- dbinom(successes, size = n, prob = pi)
    
    # Create a data frame for plotting
    binom_data <- data.frame(successes = successes, probability = prob_mass)
    
    # Basic plot approach
    ggplot(binom_data, aes(x = successes, y = probability)) +
      geom_segment(aes(xend = successes, yend = 0), linewidth = 1.2) +
      labs(title = "PMF of n = 30 coin tosses with P({Heads}) = 0.5",
           x = "Occurrences of {Heads}", y = "Relative frequency") +
      theme_minimal()

    Poisson distribution

    The Poisson distribution is particularly suitable for frequency data, which is ubiquitous in corpus linguistics. A Poisson-distributed random variable is fully determined by its parameter \(\lambda\), which determines how often events occur.

    \[ X \sim Pois(\lambda). \]

    Assume a word occurs 3 times per 1,000 words. We would define the rate parameter as \(\lambda = 3\), which is the expected and likeliest outcome.

    Show the code
    # Define the occurrence rate
    lambda <- 3
    
    # Define the range of x values
    x_range <- 0:10
    
    # Compute PMF values using dpois()
    poisson_pmf <- data.frame(
      x = x_range,
      probability = dpois(x_range, lambda)
    )
    
    # Create the PMF plot using ggplot
    ggplot(poisson_pmf, aes(x = x, y = probability)) +
      geom_segment(aes(xend = x, yend = 0), linewidth = 1.2) + 
      labs(title = "Poisson PMF with λ = 3",
           x = "Number of events",
           y = "Probability") +
      scale_x_continuous(breaks = x_range) +
      theme_minimal()

    Continuous distributions

    The normal distribution

    A great number of numerical variables in the world follow the well-known normal (or Gaussian) distribution, which includes test scores, weight and height, among many others. The plot below illustrates its characteristic bell-shape: Most observations are in the middle, with considerably fewer near the fringes. For example, most people are rather “average” in height; there are only few people that are extremely short or extremely tall.

    The normal distribution is typically described in terms of two parameters: The population mean \(\mu\) and the population variance \(\sigma\). If a random variable \(X\) is normally distributed, we typically use the notation in Equation 10.

    \[ X \sim \mathcal{N}(\mu, \sigma^2). \tag{10}\]

    The \(\mu\) parameter corresponds to the expected value \(E(X)\), which is a typical (or average) value of a distribution.

    The spread of data points around the expectation is the population variance and corresponds to \(\sigma^2\):

    \[ Var(X) = E(X-E(X))^2. \]

    The population standard deviation \(\sigma\) is the average distance from the expectation and is defined as \(\sqrt{Var(X)}\).

    Example 6 The plot illustrates a standard normal distribution for \(X \sim \mathcal{N}(0, 1)\). The \(y\)-axis indicates the density of population values; note that since the Gaussian distribution is a continuous distribution with technically infinite \(x\)-values, the probability of any given value must be 0. We can only obtain probabilities for intervals of values, which are given by

    \[ P(a \leq X \leq b) = \int_a^b f(x)dx. \tag{11}\]

    Show the code
    # Set parameters for normal distribution
    mu <- 0     # mean (expectation)
    sigma <- 1  # standard deviation (square root of variance)
    variance <- sigma^2
    
    # Generate a sequence of x-values in a range of +/- 4 standard deviations from the mean
    x_values <- seq(mu - 4*sigma, mu + 4*sigma, length.out = 1000)
    
    # Collect everything in a data frame
    norm_data <- data.frame(
      x = x_values,
      density = dnorm(x_values, mean = mu, sd = sigma)
    )
    
    # Plot the simulated data
    ggplot(norm_data, aes(x = x, y = density)) +
      geom_line(linewidth = 1, color = "#0D47A1") +
      labs(
        title = "PDF for N(0, 1)",
        x = "x",
        y = "Probability density") +
      theme_minimal() 

    Quick facts about the Gaussian bell curve

    Quite interestingly,

    • 68% all values fall within one standard deviation of the mean,

    • 95% within two, and

    • 99.7% within three.

    The lognormal distribution

    While many variables follow a normal distribution, others exhibit a characteristic right-skewed pattern where most observations cluster near zero but some extend far into the positive tail. This is particularly common with reaction times or survival data.

    The log-normal distribution describes variables whose natural logarithm follows a normal distribution. If \(\ln(X) \sim \mathcal{N}(\mu, \sigma^2)\), then \(X\) follows a log-normal distribution, denoted as:

    \[ X \sim LogN(\mu, \sigma^2). \] The parameters \(\mu\) and \(\sigma^2\) refer to the mean and variance of the underlying normal distribution (i.e., of \(\ln(X)\)), not of \(X\) itself.

    Example 7 Word frequencies in natural language corpora typically follow a log-normal distribution. Consider the frequency distribution of lemmas in a corpus, where most words occur rarely but a few occur very frequently (following Zipf’s law).

    Show the code
    # Set parameters for log-normal distribution
    mu_log <- 2      # Mean of the underlying normal distribution (log scale)
    sigma_log <- 1   # Standard deviation of the underlying normal distribution
    
    # Generate x values on the positive real line
    x_values <- seq(0.1, 50, length.out = 1000)
    
    # Calculate the probability density
    lognorm_data <- data.frame(
      x = x_values,
      density = dlnorm(x_values, meanlog = mu_log, sdlog = sigma_log)
    )
    
    # Plot the log-normal distribution
    ggplot(lognorm_data, aes(x = x, y = density)) +
      geom_line(linewidth = 1, color = "purple") +
      labs(
        title = "PDF for Log-Normal Distribution (μ = 2, σ = 1)",
        x = "Frequency (occurrences per million words)",
        y = "Probability density"
      ) +
      theme_minimal()

    Show the code
    # Show the relationship between normal and log-normal
    # Generate sample data
    set.seed(123)
    normal_sample <- rnorm(1000, mean = mu_log, sd = sigma_log)
    lognormal_sample <- exp(normal_sample)
    
    # Create comparison plot
    par(mfrow = c(1, 2))
    hist(normal_sample, main = "Normal Distribution\n(log scale)", 
         xlab = "ln(X)", col = "#90CAF9", breaks = 30)
    hist(lognormal_sample, main = "Log-Normal Distribution\n(original scale)", 
         xlab = "X", col = "#FFCDD2", breaks = 30)

    Exercises

    Tier 1

    Exercise 1 Make suggestions for probability distributions that could be helpful for modelling the following variables:

    1. Analytic vs. synthetic comparatives (scarier vs. more scary)
    2. Number of pronunciation errors per minute
    3. Frequencies of modal verbs in American English
    4. Article choice (indefinite vs. definite vs. zero-article)
    5. The time it takes to read a word aloud (in ms)

    Exercise 2 Let Coreness be a binomially distributed variable, which is defined as

    \[ \text{Coreness} = \begin{cases} \text{core} & \text{with probability } \pi \\ \text{non-core} & \text{with probability } 1-\pi \end{cases} \] for \(0 < \pi < 1\).

    We can model this variable using the binomial probability mass function, which has the form

    \[ f(x; n; \pi) = \binom{n}{x}\pi^x(1-\pi)^{n-x}, \] where \(n\) is the number of independent trials, \(\pi\) the probability of ‘success’, and \(x\) the observed number of successes.

    Baguley (2012: 67) provides the code example below for plotting a discrete \(pmf\).

    # Vector containing the numbers of successes
    heads <- 0:10 # observing 'heads' 0, ..., 10 times
    
    # Get probability mass for each success (fair coin -> P(Heads) = 0.5)
    prob.mass <- dbinom(heads, 10, 0.5) # throwing the coin 10 times
    
    # Plot the PMF
    plot(heads, prob.mass, pch = NA, xlab = "Number of heads", ylab = "Probability")
    
    # Create a spikey plot
    segments(x0 = heads, y0 = 0, x1 = heads, y1 = prob.mass)

    Adjust the code to plot the \(pmf\) for Coreness where \(\pi = 0.85\) and \(n = 50\). The binomial experiment is repeated 50 times. Provide a brief visual description of the distribution.

    Exercise 3 Let \(R\) be a continuous random variable measuring reaction times in lexical decision tasks. We approximate the measurements with a normal distribution parametrised by

    \[ R \sim \mathcal{N}(\mu = 500, \sigma = 150), \]

    which can be visualised as follows:

    curve(dnorm(x, mean = 500, sd = 150), xlim = c(0, 1000))

    Describe how the shape auf the bell curve changes for \(\sigma = 50\), \(\sigma = 200\) and \(\sigma = 350\). What do these changes mean for the distribution of reaction times?

    Tier 2

    Exercise 4 If a variable follows a binomial distribution and we know the number of trials \(n\) and the probability of success \(\pi\), we can compute:

    • the probability of observing a specific number of successes with dbinom():
    # Example: P(4 heads from 10 ten tosses of a fair coin)
    dbinom(x = 4, size = 10, prob = 0.5)
    [1] 0.2050781
    • the cumulative probability of up to \(x\) successes with pbinom():
    # Example: P(up to 4 heads from 10 ten tosses of a fair coin)
    pbinom(q = 4, size = 10, prob = 0.5)
    [1] 0.3769531
    # P(5 or more heads from 10 ten tosses of a fair coin)
    1 - pbinom(q = 4, size = 10, prob = 0.5)
    [1] 0.6230469
    • random simulated successes with rbinom():
    # Shows the numbers of successes in the first, second, third and fourth trial
    rbinom(n = 4, size = 10, prob = 0.5)
    [1] 3 3 3 5

    Interpret the output of the two code chunks below with reference to the Coreness variable from Exercise 2.

    1 - pbinom(q = 80, size = 100, prob = 0.85)
    [1] 0.8934557
    rbinom(n = 50, size = 15, prob = 0.15)
     [1] 2 3 2 0 0 1 4 1 2 3 6 1 3 1 2 0 7 2 7 1 1 1 3 2 2 3 2 2 2 1 1 3 3 5 5 3 2 1
    [39] 2 3 4 4 3 1 0 2 2 1 5 1

    Exercise 5 Let \(D\) be a Poisson-distributed random variable which counts the total number of fillers such as uhm, er, and like per 100 words. It has the \(pmf\)

    \[ f(x, \lambda) = \frac{\lambda^{x} e^{-\lambda}}{x!}. \]

    The study by Bortfeld et al. (2001) suggests that \(\lambda = 2\) would be a suitable rate parameter. Plot the \(pmf\) using dpois() for \(x \in \{0, \dots, 10\}\). (Tip: Check the documentation by typing ?dpois into the console.)

    Exercise 6 Consider the distribution of frame elements for the verb eat (INGESTION):

    eat_data
    # A tibble: 6 × 3
      Frame.Element     n rel_freq
      <chr>         <int>    <dbl>
    1 Ingestor         24   0.414 
    2 Ingestibles      23   0.397 
    3 Place             5   0.0862
    4 Manner            3   0.0517
    5 Source            2   0.0345
    6 Time              1   0.0172

    What probability distribution would be suitable for modelling the variable Frame.Element?

    Tier 3

    Exercise 7 The empirical study of dispersion has attracted significant attention in recent years (Sönning 2024; Gries 2024). A key challenge is finding dispersion measures that are minimally correlated with token frequency. One such measure is the Kullback-Leibler divergence (KLD), which comes from information theory and is closely related to entropy.

    Mathematically, KLD measures the difference between two probability distributions \(p\) and \(q\):

    \[ KLD(p \parallel q) = \sum\limits_{x} p(x) \log \frac{p(x)}{q(x)} \]

    For corpus dispersion (cf. Equation Equation 12), we compare the posterior (actual) distribution of a word across corpus parts \(\frac{v_i}{f}\) with the prior distribution that assumes uniform spread across parts (weighted by part size \(s_i\)):

    \[ KLD = \sum\limits_{i=1}^n \frac{v_i}{f} \times \log_2\left({\frac{v_i}{f} \times \frac{1}{s_i}}\right) \tag{12}\]

    where:

    • \(f\) = total frequency of the word in the corpus

    • \(v_i\) = frequency of the word in corpus part \(i\)

    • \(s_i\) = size of corpus part \(i\) (as fraction of total corpus)

    • \(n\) = number of corpus parts

    First, let’s create simulated corpus data to work with:

    library(tidyverse)
    
    # Create simulated corpus data
    set.seed(123)
    
    # Simulate a corpus with 5 parts of different sizes
    corpus_parts <- c("Fiction", "News", "Academic", "Spoken", "Web")
    
    part_sizes <- c(0.25, 0.20, 0.15, 0.25, 0.15)  # Sizes of the corpus parts
    
    total_tokens <- 1000000 # Corpus size
    
    # Create some example words with different dispersion patterns
    words_data <- tibble(
      word = rep(c("the", "however", "DNA", "like", "therefore"), each = 5),
      corpus_part = rep(corpus_parts, 5),
      frequency = c(
        # "the" - highly frequent, evenly distributed
        c(12500, 10000, 7500, 12500, 7500),
        # "however" - academic bias
        c(150, 200, 800, 100, 50),
        # "DNA" - strong academic bias  
        c(10, 5, 200, 2, 8),
        # "like" - spoken bias
        c(200, 150, 50, 600, 300),
        # "therefore" - academic/formal bias
        c(50, 100, 300, 30, 20)
      ),
      part_size = rep(part_sizes, 5)
    )
    
    # Show full dataset
    print(words_data, n = Inf)
    # A tibble: 25 × 4
       word      corpus_part frequency part_size
       <chr>     <chr>           <dbl>     <dbl>
     1 the       Fiction         12500      0.25
     2 the       News            10000      0.2 
     3 the       Academic         7500      0.15
     4 the       Spoken          12500      0.25
     5 the       Web              7500      0.15
     6 however   Fiction           150      0.25
     7 however   News              200      0.2 
     8 however   Academic          800      0.15
     9 however   Spoken            100      0.25
    10 however   Web                50      0.15
    11 DNA       Fiction            10      0.25
    12 DNA       News                5      0.2 
    13 DNA       Academic          200      0.15
    14 DNA       Spoken              2      0.25
    15 DNA       Web                 8      0.15
    16 like      Fiction           200      0.25
    17 like      News              150      0.2 
    18 like      Academic           50      0.15
    19 like      Spoken            600      0.25
    20 like      Web               300      0.15
    21 therefore Fiction            50      0.25
    22 therefore News              100      0.2 
    23 therefore Academic          300      0.15
    24 therefore Spoken             30      0.25
    25 therefore Web                20      0.15

    Compute the dispersion values of the, however, DNA, like, and therefore based on the simulated corpus data. How does dispersion reflect their (lack of) register bias?

    References

    Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.
    Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral Sciences. Houndmills, Basingstoke: Palgrave Macmillan.
    Bortfeld, Heather, Silvia D. Leon, Jonathan E. Bloom, Michael F. Schober, and Susan E. Brennan. 2001. “Disfluency Rates in Conversation: Effects of Age, Relationship, Topic, Role, and Gender.” Language and Speech 44 (2): 123–47. https://doi.org/10.1177/00238309010440020101.
    Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Pacific Grove, Calif.: Duxbury/Thomson Learning.
    Fillmore, Charles J., Christopher R. Johnson, and Miriam R. L. Petruck. 2003. “Background to FrameNet.” International Journal of Lexicography 16 (3): 235–50.
    Gries, Stefan Thomas. 2024. Frequency, Dispersion, Association, and Keyness: Revising and Tupleizing Corpus-Linguistic Measures. Amsterdam & Philadelphia: John Benjamins Publishing Company.
    Heumann, Christian, Michael Schomaker, and Shalabh. 2022b. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in R. 2nd ed. Cham: Springer.
    ———. 2022a. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in r. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-031-11833-3.
    Ruppenhofer, Josef, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, Collin F. Baker, and Jan Scheffczyk. 2016. “FrameNet II: Extended Theory and Practice.” 2016. https://framenet2.icsi.berkeley.edu/docs/r1.7/book.pdf.
    Sönning, Lukas. 2024. “Evaluation of Keyness Metrics: Performance and Reliability,” Corpus Linguistics and Linguistic Theory, 20 (2): 263–88. https://doi.org/10.1515/cllt-2022-0116.
    4.1 Data, variables, samples
    4.3 Descriptive statistics