Show R code
# Define L
<- c("eat", "drink", "sleep", "bathe") L
Throughout this reader—particularly in later chapters—we will frequently rely on fundamental concepts from set theory and mathematics. These frameworks provide a precise and concise way to express complex ideas, making them essential tools for statistical science. The aim of this unit is to introduce key formalisms and build foundational literacy in their use. While a comprehensive treatment is beyond the scope of this reader, the following resources offer further guidance:
Partee, Meulen, and Wall (1990)
Halmos (1974)
Spivak (1994)
Hammack (2018)
Sets are finite or infinite collections of elements. Typically, they are denoted by capital letters (\(A\), \(B\), etc.). Their elements can virtually anything – letters, numbers, functions, or even more sets. For example, define a set \(L\) which contains the words eat, drink, sleep, bathe:1
1 Note that the order of the elements in a set is irrelevant; even if the elements of \(L\) were rearranged, the elements themselves would remain identical: \(\{eat, drink, sleep, bathe\} = \{drink, bathe, sleep, eat\}\).
\[ L = \{eat, drink, sleep, bathe\}. \]
The cardinality (or ‘size’) of \(L\) is 4 because it contains 4 unique elements. This is usually written as \(\mid L \mid = 4\).
We can characterise the membership of individual elements with regard to a certain set. The element eat is in \(L\), i.e., \(eat \in L\). By contrast, convince is not, which is written \(convince \notin L\).
A set that does not contain any elements is an empty set \(\emptyset = \{\}\). As such, it as a subset of every set – even of itself!2
2 Here is another strange observation: \(\emptyset\) and \(\{\emptyset\}\) don’t express the same thing. While \(\emptyset\) contains nothing (\(\mid\emptyset\mid = 0\)), \(\{ \emptyset \}\) is a set that has one element (\(\mid\{ \emptyset \}\mid = 1\)), and this one element is actually nothing.
Let’s define two new sets
\[ G = \{eat, drink\} \]
and
\[ M = \{eat, run\}. \]
When comparing \(G\) with \(L\), we can see that all of the elements in \(G\) are also represented in \(L\). This renders \(G\) a subset of \(L\), written \(G \subseteq L\). However, not all elements of \(M\) are also in \(G\). Specifically, run is in \(M\), but not in \(F\); therefore, \(M \nsubseteq G\).
If we are looking for an element that is in \(M\) or in \(L\) or in both, we are talking about the union \(M \cup L\):
\[ M \cup L = \{eat, drink, sleep, bathe, run\}. \]
By contrast, if we are interested in the elements that all sets \(L\), \(G\) and \(M\) have in common, we are dealing with their intersection \(L \cap G \cap M\). In fact, there is only one element that satisfies this condition:
\[ L \cap G \cap M = \{eat\}. \]
Long summation expressions tend to be rather unwieldy. For example, if we are taking the sum of all natural numbers from 1 up to 15, writing this out would take up a lot of space (and time):
\[ 1+2+3+4+5+6+7+8+9+10+11+12+13+14+15. \] For this very reason, mathematicians have introduced a more compact notation that only takes a fraction of the time to write while communicating the exact same idea. Using the Greek letter \(\Sigma\) (‘sigma’), we could rewrite it as
\[ \sum_{i=1}^{15} i. \] This notation can be read as the sum from \(i = 1\) to \(15\). It is the sum of the numbers obtained by letting \(i = 1, 2, \dots, 15\).
The same principle applies if the terms of the sum are letters. If we have
\[ a_1 + a_2 + a_3 + a_4 + a_5, \] then we can define a sum where the index \(i\) iterates through the elements in the set \(\{1, 2, 3, 4, 5\}\). The letter \(a\) remains constant.
\[ \sum_{i=1}^5 a_i \] Alternatively, to make it even more explicit, we could write
\[ \sum_{i\in \{1, 2, 3, 4, 5\}} a_i. \]
The index can be denoted by any letter (\(j\), \(k\) etc.) but not \(n\), which is reserved for the upper limit:
\[ \sum_{i=1}^n a_i = a_1 + a_2 + a_3 + \dots + a_n. \]
We can apply the same reasoning to other operations such as products. The only difference is that we’d have to use \(\Pi\) (‘pi’) instead of \(\Sigma\). If we are taking the product of all natural numbers from 5 to 10, we’d write
\[ \prod_{i=5}^{10} i = 5 \cdot 6 \cdot 7 \cdot 8 \cdot 9 \cdot 10. \]
The set of all values that the index below the operator can take on is known as the index set. It can be defined as one sees fit. Say, we need the product of all numbers in \(K = \{1, 3, 5, 7\}\). Then the corresponding notation would be
\[ \prod_{i\in K} i = 1 \cdot 3 \cdot 5 \cdot 7. \]
So far, we’ve seen how the large-operator symbols \(\Sigma\) and \(\Pi\) can express sums and products over index ranges or sets. The same notation style can also be used for set operations, such as unions and intersections across a large number of sets.
Assume we have indexed sets \(A_1, A_2, \dots, A_n\). Instead of writing out the full union
\[ A_1 \cup A_2 \cup A_3 \cup \dots \cup A_n, \] we can use a large union symbol to express this more compactly:
\[ \bigcup_{i=1}^n A_i. \]
Similarly, if we’re interested in the elements common to all sets in the sequence—their intersection—we write:
\[ \bigcap_{i=1}^n A_i = A_1 \cap A_2 \cap A_3 \cap \dots \cap A_n. \]
Exercise 1 Define a set that lists the frequencies of words in a corpus.
Exercise 2 How could you then express the size of the corpus using \(\Sigma\)-notation?
Exercise 3 Given two corpora \(C_1\) and \(C_2\) with vocabulary sets \(V_1\) and \(V_2\), define
Exercise 4 Using set notation, express how to calculate the Type-Token Ratio (TTR) of a corpus.
Exercise 5 Consider a corpus divided into \(n\) different text genres (e.g., news, fiction, academic writing). Each genre \(i\) has its own vocabulary set \(V_i\). Express mathematically
the complete vocabulary of the entire corpus using the large union operator \(\bigcup\).
the core vocabulary (words that appear in all genres) using the large intersection operator \(\bigcap\).