Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Research questions
    • 1.3 Linguistic variables
    • 1.4 Set theory and mathematical notation
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 Data annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Categorical data
    • 4.4 Continuous data
    • 4.5 Hypothesis testing
    • 4.6 Binomial test
    • 4.7 Chi-squared test
    • 4.8 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. 1. Fundamentals
  2. 1.4 Formal aspects
  • 1. Fundamentals
    • 1.1 Basics
    • 1.2 Research Questions
    • 1.3 Linguistic Variables
    • 1.4 Formal aspects

On this page

  • Introduction
  • Set theory
    • Subset, union, and intersection
  • Large operators
    • Sums and products
    • Union and intersection over indexed sets
  • Exercises
  1. 1. Fundamentals
  2. 1.4 Formal aspects

1.4 Set theory and mathematical notation

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract
This unit introduces key concepts from set theory and mathematical notation – including sets, subsets, unions, intersections, sums, and products—to build foundational skills for formal reasoning in corpus linguistics.

Introduction

Throughout this reader—particularly in later chapters—we will frequently rely on fundamental concepts from set theory and mathematics. These frameworks provide a precise and concise way to express complex ideas, making them essential tools for statistical science. The aim of this unit is to introduce key formalisms and build foundational literacy in their use. While a comprehensive treatment is beyond the scope of this reader, the following resources offer further guidance:

Partee, Meulen, and Wall (1990)

Halmos (1974)

Spivak (1994)

Hammack (2018)

Set theory

Sets are finite or infinite collections of elements. Typically, they are denoted by capital letters (\(A\), \(B\), etc.). Their elements can virtually anything – letters, numbers, functions, or even more sets. For example, define a set \(L\) which contains the words eat, drink, sleep, bathe:1

  • 1 Note that the order of the elements in a set is irrelevant; even if the elements of \(L\) were rearranged, the elements themselves would remain identical: \(\{eat, drink, sleep, bathe\} = \{drink, bathe, sleep, eat\}\).

  • \[ L = \{eat, drink, sleep, bathe\}. \]

    Show R code
    # Define L
    L <- c("eat", "drink", "sleep", "bathe") 

    The cardinality (or ‘size’) of \(L\) is 4 because it contains 4 unique elements. This is usually written as \(\mid L \mid = 4\).

    Show R code
    # Number of elements in L
    length(L)

    We can characterise the membership of individual elements with regard to a certain set. The element eat is in \(L\), i.e., \(eat \in L\). By contrast, convince is not, which is written \(convince \notin L\).

    Show R code
    # Check element membership
    "eat" %in% L
    
    "convince" %in% L

    A set that does not contain any elements is an empty set \(\emptyset = \{\}\). As such, it as a subset of every set – even of itself!2

  • 2 Here is another strange observation: \(\emptyset\) and \(\{\emptyset\}\) don’t express the same thing. While \(\emptyset\) contains nothing (\(\mid\emptyset\mid = 0\)), \(\{ \emptyset \}\) is a set that has one element (\(\mid\{ \emptyset \}\mid = 1\)), and this one element is actually nothing.

  • Subset, union, and intersection

    Let’s define two new sets

    \[ G = \{eat, drink\} \]

    and

    \[ M = \{eat, run\}. \]

    Show R code
    # Define new sets
    G <- c("eat", "drink")
    
    M <- c("eat", "run")

    When comparing \(G\) with \(L\), we can see that all of the elements in \(G\) are also represented in \(L\). This renders \(G\) a subset of \(L\), written \(G \subseteq L\). However, not all elements of \(M\) are also in \(G\). Specifically, run is in \(M\), but not in \(F\); therefore, \(M \nsubseteq G\).

    Show R code
    # Check if F is a subset of L
    all(F %in% L)
    
    
    # Check if M is a subset of F
    all(M %in% F)

    If we are looking for an element that is in \(M\) or in \(L\) or in both, we are talking about the union \(M \cup L\):

    \[ M \cup L = \{eat, drink, sleep, bathe, run\}. \]

    Show R code
    # Union of M and L
    union(M, L)

    By contrast, if we are interested in the elements that all sets \(L\), \(G\) and \(M\) have in common, we are dealing with their intersection \(L \cap G \cap M\). In fact, there is only one element that satisfies this condition:

    \[ L \cap G \cap M = \{eat\}. \]

    Show R code
    # Intersection of two sets
    intersect(L, G)
    
    # Intersection of multiple sets
    Reduce(intersect, list(L, G, M))

    Large operators

    Sums and products

    Long summation expressions tend to be rather unwieldy. For example, if we are taking the sum of all natural numbers from 1 up to 15, writing this out would take up a lot of space (and time):

    \[ 1+2+3+4+5+6+7+8+9+10+11+12+13+14+15. \] For this very reason, mathematicians have introduced a more compact notation that only takes a fraction of the time to write while communicating the exact same idea. Using the Greek letter \(\Sigma\) (‘sigma’), we could rewrite it as

    \[ \sum_{i=1}^{15} i. \] This notation can be read as the sum from \(i = 1\) to \(15\). It is the sum of the numbers obtained by letting \(i = 1, 2, \dots, 15\).

    Show R code
    # Sum of the number from 1 to 15
    sum(1:15)

    The same principle applies if the terms of the sum are letters. If we have

    \[ a_1 + a_2 + a_3 + a_4 + a_5, \] then we can define a sum where the index \(i\) iterates through the elements in the set \(\{1, 2, 3, 4, 5\}\). The letter \(a\) remains constant.

    \[ \sum_{i=1}^5 a_i \] Alternatively, to make it even more explicit, we could write

    \[ \sum_{i\in \{1, 2, 3, 4, 5\}} a_i. \]

    The index can be denoted by any letter (\(j\), \(k\) etc.) but not \(n\), which is reserved for the upper limit:

    \[ \sum_{i=1}^n a_i = a_1 + a_2 + a_3 + \dots + a_n. \]

    We can apply the same reasoning to other operations such as products. The only difference is that we’d have to use \(\Pi\) (‘pi’) instead of \(\Sigma\). If we are taking the product of all natural numbers from 5 to 10, we’d write

    \[ \prod_{i=5}^{10} i = 5 \cdot 6 \cdot 7 \cdot 8 \cdot 9 \cdot 10. \]

    Show R code
    # Product of the numbers from 5 to 10
    prod(5:10)

    The set of all values that the index below the operator can take on is known as the index set. It can be defined as one sees fit. Say, we need the product of all numbers in \(K = \{1, 3, 5, 7\}\). Then the corresponding notation would be

    \[ \prod_{i\in K} i = 1 \cdot 3 \cdot 5 \cdot 7. \]

    Union and intersection over indexed sets

    So far, we’ve seen how the large-operator symbols \(\Sigma\) and \(\Pi\) can express sums and products over index ranges or sets. The same notation style can also be used for set operations, such as unions and intersections across a large number of sets.

    Assume we have indexed sets \(A_1, A_2, \dots, A_n\). Instead of writing out the full union

    \[ A_1 \cup A_2 \cup A_3 \cup \dots \cup A_n, \] we can use a large union symbol to express this more compactly:

    \[ \bigcup_{i=1}^n A_i. \]

    Show R code
    # Define multiple sets
    A1 <- c("a", "b", "c")
    A2 <- c("a", "e", "f")
    A3 <- c("a", "h", "i")
    A4 <- c("a", "k", "l")
    
    # Union of all sets
    Reduce(union, list(A1, A2, A3, A4))

    Similarly, if we’re interested in the elements common to all sets in the sequence—their intersection—we write:

    \[ \bigcap_{i=1}^n A_i = A_1 \cap A_2 \cap A_3 \cap \dots \cap A_n. \]

    Show R code
    # Intersection of all sets
    Reduce(intersect, list(A1, A2, A3, A4))

    Exercises

    Exercise 1 Define a set that lists the frequencies of words in a corpus.

    Exercise 2 How could you then express the size of the corpus using \(\Sigma\)-notation?

    Exercise 3 Given two corpora \(C_1\) and \(C_2\) with vocabulary sets \(V_1\) and \(V_2\), define

    1. the set of words that appear in both corpora.
    2. the formula to calculate what percentage of each corpus vocabulary this shared set represents.

    Exercise 4 Using set notation, express how to calculate the Type-Token Ratio (TTR) of a corpus.

    Exercise 5 Consider a corpus divided into \(n\) different text genres (e.g., news, fiction, academic writing). Each genre \(i\) has its own vocabulary set \(V_i\). Express mathematically

    • the complete vocabulary of the entire corpus using the large union operator \(\bigcup\).

    • the core vocabulary (words that appear in all genres) using the large intersection operator \(\bigcap\).

    References

    Halmos, Paul R. 1974. Naive Set Theory. Repr. Heidelberg: Springer.
    Hammack, Richard. 2018. Book of Proof. 3rd ed. Richmond, Virginia: Richard Hammack. https://richardhammack.github.io/BookOfProof/.
    Partee, Barbara H., Alice G. ter Meulen, and Robert E. Wall. 1990. Mathematical Methods in Linguistics. Dordrecht: Kluwer Acad. Press.
    Spivak, Michael. 1994. Calculus. 3rd ed. Houston, TX: Publish or Perish.
    1.3 Linguistic Variables