# Load libraries
library(readxl) # install this library prior to loading
library(tidyverse)
# Load data
<- read_xlsx("Paquot_Larsson_2020_data.xlsx")
cl.order
# Inspect data
str(cl.order)
head(cl.order)
4.1 Data, variables, samples
This handout introduces key statistical concepts—such as samples, populations, variables, datasets, and data types—with a focus on their application to empirical linguistic research, using R to explore and illustrate these ideas.
Recommended reading
Heumann, Schomaker, and Shalabh (2022): Chapters 1 & 7
Baguley (2012): Chapter 1
Agresti and Kateri (2022): Chapter 1.2
Samples and populations
In order to investigate one or more linguistic features, we need to collect relevant observations. Actions that generate observations are called experiments, such as decision tasks or even corpus queries. Since it is not feasible to examine, for instance, the entirety of all linguistic utterances ever produced, i.e., the virtually infinite population, we rely on subsets of it: samples. Good research is characterised by good sampling procedures that limit the bias present in any sample, thus improving the generalisability of potential findings.
To state it more formally, let \(\omega\) (‘lower-case Omega’) denote a single observation of interest. The full set of theoretically possible observations is contained in \(\Omega\) (‘upper-case Omega’), with
\[ \Omega = \{\omega_1, \omega_2, \dots, \omega_n \}. \]
Here \(n\) represents the \(n\)-th observation and should be a natural number. A sample is then simply a selection of elements from \(\Omega\).
(Random) Variables
The concept of the variable is very handy in that it enables researchers to quantify different aspects of linguistic observations, such as the age or gender of the speaker, the register of the speech situation, the variety of English, the length of an utterance, a syntactic construction or a grammatical category, among many other theoretically possible features.
For each observation, we should be able to assign a specific outcome to our variables. While the variable Genitive
could have a small set of outcomes such as ‘s’ and ‘of’, other variables such as Utterance length
or Reaction time
could technically assume any value in the range [0, \(\infty\)). The set of all possible outcomes in an experiment is known as the sample space \(S\).
Given a single observed NP which is marked for possessive case, we may characterise its formal marking as either ‘s’ (synthetic genitive) or as ‘of’ (analytic/periphrastic). For simplicity, we will call these options \(S\) and \(O\). The sample space of Genitive
could be thus described as
\[ S_{\text{Genitive}} = \{\text{S}, \text{O}\}. \]
Experiments normally do not terminate after obtaining the first observation. As a result, the sample space becomes considerably more complex (and unwieldy), the more observations are gathered. For \(n = 3\) possessive NPs, the sample space would comprise the following combinations of outcomes:
\[ S_{\text{Genitive (3 NPs)}} = \{\text{SSS}, \text{SSO},\text{SOS}, \text{SOO},\text{OSS}, \text{OSO}, \text{OOS}, \text{OOO}\}. \]
With \(n = 10\) we’d be dealing with \(2^{10} = 1024\) possible outcomes. If, for instance, we’re only interested in the total number of s-genitives, this isn’t a particularly efficient approach.
Random variables allow us to map outcomes from the sample space onto the real numbers \(\mathbb{R}\). Quite conveniently, we could define a discrete random variable \(X\) that counts the total number of s-genitives in a sample of size \(n\). Note, however, that the random variable is not random – it is the result of a random process. Other random variables could be
- the number of ‘heads’ after tossing a coin 30 times,
- getting a ‘6’ when rolling a die 500 times, or
- how many years it takes for your TV to break down.
A random variable is a function \(X\) which maps onto each outcome \(s \in S\) exactly one number \(X(s) = x\) with \(x \in \mathbb{R}\).
In formal terms:
\[ X : S \to \mathbb{R}. \]
Datasets
Information on variables and their values is conventionally arranged in a dataset, which is essentially a matrix with \(n\) rows (= observations) and \(p\) columns (= variables). The data matrix has the general form in Equation 1 (see Heumann (2022: Chap. 1.4) for details).
\[ \begin{pmatrix} \text{Observation } \omega & \text{Variable } X_1 & \text{Variable } X_2 & \cdots & X_p \\ 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 2 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & & \vdots \\ n & x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix} \tag{1}\]
Data types
In general, we distinguish between discrete variables, which can only take a limited set of unique values, and continuous variables, which can take infinitely many values within a specified range.
We can further subdivide discrete and continuous variables into into nominal, ordinal, interval-scaled and ratio-scaled ones:
These variables comprise a limited number of discrete categories which cannot be ordered in a meaningful way. For instance, the genitive forms of and ’s cannot be ordered like the numbers (e.g., 1, 2, 3 …).
Ordinal variables are ordered. However, the intervals between their individual values are not interpretable. Heumann (2022: 6) provides a pertinent example:
[T]he satisfaction with a product (unsatisfied–satisfied–very satisfied) is an ordinal variable because the values this variable can take can be ordered but the differences between ‘unsatisfied–satisfied’ and ‘satisfied–very satisfied’ cannot be compared in a numerical way.
- In the case of interval-scaled variables, the differences between the values can be interpreted, but their ratios must be treated with caution. A temperature of 4°C is 6 degrees warmer than -2°C; however, this does not imply that 4°C is three times warmer than -2°C. This is because the temperature scale has no true zero point; 0°C simply signifies another point on the scale and not the absence of temperature altogether.
- Ratio-scaled variables allow both a meaningful interpretation of the differences between their values and (!) of the ratios between them. Within the context of clause length,
LENGTH_DIFF
values such as 4 and 8 not only suggest that the latter is four units greater than the former but also that their ratio \(\frac{8}{4} = 2\) is a valid way to describe the relationship between these values. Here aLENGTH_DIFF
of 0 can be clearly viewed as the absence of a length difference.
Dependent vs. independent variables
In empirical studies, it is often of interest whether one variable is linked to changes in the values of another variable. When exploring such associations (or correlations), we need to take another heuristic step to clarify the direction of the influence.
In a linguistic context, we denote the variable whose usage patterns we’d like to explain as the dependent or response variable. A list of possible dependent variables is provided in the section on Linguistic variables).
Its outcomes are said to depend on one or more independent variables. These are also often referred to as explanatory variables as they are supposed to explain variation in the response variable. In (socio-)linguistics, these include but are not limited to a speaker’s age, gender, socio-economic class (e.g., working class vs. middle class), their local variety, or the register of the speech context (e.g., casual exchange vs. business transaction).
Exercises
Tier 1
Exercise 1 Consider the following statement:
This paper examines the influence of clause length on the ordering of main and subordinate clauses.
What is the dependent variable?
What is the independent variable?
What would it mean if they were reversed?
Exercise 2 Explain why frequency data is actually discrete.
Tier 2
Exercise 3 Download the dataset Paquot_Larsson_2020_data.xlsx
from the supplementary materials and store it in your working directory.1
1 The dataset was published as part of Paquot & Larsson (2020).
Load it into R using the code below (see also 2.6 Importing/Exporting for details). Then identify the variable types (nominal, ordinal etc.) for all columns in the cl.order
dataset.
str()
function
The easiest way to get a general overview of the full data set is to apply the str()
function to the respective data frame.
str(cl.order)
tibble [403 × 8] (S3: tbl_df/tbl/data.frame)
$ CASE : num [1:403] 4777 1698 953 1681 4055 ...
$ ORDER : chr [1:403] "sc-mc" "mc-sc" "sc-mc" "mc-sc" ...
$ SUBORDTYPE : chr [1:403] "temp" "temp" "temp" "temp" ...
$ LEN_MC : num [1:403] 4 7 12 6 9 9 9 4 6 4 ...
$ LEN_SC : num [1:403] 10 6 7 15 5 5 12 2 24 11 ...
$ LENGTH_DIFF: num [1:403] -6 1 5 -9 4 4 -3 2 -18 -7 ...
$ CONJ : chr [1:403] "als/when" "als/when" "als/when" "als/when" ...
$ MORETHAN2CL: chr [1:403] "no" "no" "yes" "no" ...
This shows us that the data frame has 8 columns, as the $
operators indicate ($ Case
, $ ORDER
, …). The column names are followed by
the data type (
num
for numeric andchr
for character strings)the number of values (
`[1:403]`
) andthe first few observations.
Another intuitive way to display the structure of a data matrix is to simply show the first few rows:
head(cl.order)
# A tibble: 6 × 8
CASE ORDER SUBORDTYPE LEN_MC LEN_SC LENGTH_DIFF CONJ MORETHAN2CL
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 4777 sc-mc temp 4 10 -6 als/when no
2 1698 mc-sc temp 7 6 1 als/when no
3 953 sc-mc temp 12 7 5 als/when yes
4 1681 mc-sc temp 6 15 -9 als/when no
5 4055 sc-mc temp 9 5 4 als/when yes
6 967 sc-mc temp 9 5 4 als/when yes
ORDER
: Does the subordinate clause come before or after the main clause?SUBORDTYPE
: Is the subordinate clause temporal or causal?MORETHAN2CL
: Are there more clauses in the sentence than just one subordinate clause and one main clause?LEN_MC
: How many words does the main clause contain?LEN_SC
: How many words does the subordinate clause contain?LENGTH_DIFF
: What is the length difference in words between the main clause and subordinate clause?
To count the number of items in a vector, which correspond to the total number \(n\) of attested values of \(X\), we can use length()
:
length(cl.order$SUBORDTYPE)
[1] 403
In fact, it is equivalent to the number of rows in the full data frame:
nrow(cl.order)
[1] 403
The function unique()
shows all unique items (= types) in a vector, reflecting the possible outcomes for a single observation:
unique(cl.order$SUBORDTYPE)
[1] "temp" "caus"
Exercise 4 For each of the following linguistic variables, identify whether it is discrete or continuous and then specify the type (nominal, ordinal, interval, or ratio):
- Particle placement (Cut off the flowers vs. Cut the flowers off)
- Number of subordinate clauses in a paragraph
- Speaker’s country of origin
- Reaction time in a language processing task (measured in milliseconds)
- Likert scale rating of grammaticality (possible ratings: 1, 2, 3, 4, 5)
Tier 3
Exercise 5 The tidyverse
function slice_sample()
allows users to randomly sample rows from a data frame, i.e., with every row having the same probability of being selected. We will do so without replacement for the cl.order
data, choosing 50 rows at random.
library(tidyverse)
# For reproducibility, it is highly recommended to set a random seed. This makes sure we will get the "same" random result every time we execute our code.
set.seed(123)
# Extract a random sample of 50 rows
<- cl.order %>%
cl.order.random slice_sample(n = 50, replace = FALSE)
print(cl.order.random)
# A tibble: 50 × 8
CASE ORDER SUBORDTYPE LEN_MC LEN_SC LENGTH_DIFF CONJ MORETHAN2CL
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 3490 sc-mc temp 15 10 5 nachdem/after no
2 4608 mc-sc temp 5 11 -6 als/when yes
3 2458 sc-mc temp 29 4 25 nachdem/after no
4 3178 sc-mc caus 4 19 -15 weil/because yes
5 3616 sc-mc temp 6 5 1 bevor/before yes
6 2405 mc-sc caus 6 5 1 weil/because no
7 1799 mc-sc caus 3 7 -4 weil/because no
8 2752 mc-sc caus 13 16 -3 weil/because no
9 350 mc-sc caus 8 11 -3 weil/because no
10 342 mc-sc caus 10 15 -5 weil/because no
# ℹ 40 more rows
Another common approach is stratified sampling, which is a method of sampling from a population where the population is divided into distinct subgroups (strata) based on a characteristic, and a sample is drawn from each subgroup to ensure proportional representation.
library(sampling)
library(tidyverse)
<- function(data, variable, size) {
strat_sample
# Capture the variable symbol
<- enquo(variable)
var_sym <- rlang::as_name(var_sym)
var_name
# Proportions in the population (= sampling weights)
<- table(pull(data, !!var_sym)) / nrow(data)
data_prop
# Sizes of the stratified sample
<- round(size * data_prop)
strat_sample_sizes
# Convert variable of interest to factor
<- as.factor(data[[var_name]])
data[[var_name]]
# Draw the sample
<- strata(data,
clause_strat_sample stratanames = var_name,
size = strat_sample_sizes,
# Stratified sampling without replacement
method = "srswor")
# Output df
<- tibble(getdata(data, clause_strat_sample)) %>%
output_sample select(-Prob, -Stratum, -ID_unit)
return(output_sample)
}
# Apply the function
<- strat_sample(cl.order, ORDER, size = 50) cl.order.strat
Let’s compare the results of the different sampling procedures:
# Check proportions in the original dataset
%>%
cl.order count(ORDER) %>%
mutate(pct = n/sum(n))
# A tibble: 2 × 3
ORDER n pct
<chr> <int> <dbl>
1 mc-sc 275 0.682
2 sc-mc 128 0.318
# Check proportions in the random sample
%>%
cl.order.random count(ORDER) %>%
mutate(pct = n/sum(n))
# A tibble: 2 × 3
ORDER n pct
<chr> <int> <dbl>
1 mc-sc 29 0.58
2 sc-mc 21 0.42
# Check proportions in the stratified sample
%>%
cl.order.strat count(ORDER) %>%
mutate(pct = n/sum(n))
# A tibble: 2 × 3
ORDER n pct
<fct> <int> <dbl>
1 mc-sc 16 0.32
2 sc-mc 34 0.68
Discuss possible linguistic research scenarios when stratified sampling may be preferable over random sampling, and vice versa.
Exercise 6 Assuming three randomly chosen observations from the cl.order
dataset, describe the sample space \(S\) for the variable ORDER
. What random variables could you define with domain \(S\)?