14 Data types

Author

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

14.1 Recommended reading

Heumann, Schomaker, and Shalabh (2022): Chapter 1

Baguley (2012): Chapter 1

Agresti and Kateri (2022): Chapter 1.2

14.2 Preparation

Script

You can find the full R script associated with this unit here.

Please download the file Paquot_Larsson_2020_data.xlsx (Paquot and Larsson 2020) and store it in your working directory.

# Load libraries
library(readxl)
library(tidyverse)

# Load data
cl.order <- read_xlsx("Paquot_Larsson_2020_data.xlsx")

# Inspect data
str(cl.order)
head(cl.order)

What’s in the file? The str() function

The easiest way to get a general overview of the full data set is to apply the str() function to the respective data frame.

str(cl.order)

tibble [403 × 8] (S3: tbl_df/tbl/data.frame)
 $ CASE       : num [1:403] 4777 1698 953 1681 4055 ...
 $ ORDER      : chr [1:403] "sc-mc" "mc-sc" "sc-mc" "mc-sc" ...
 $ SUBORDTYPE : chr [1:403] "temp" "temp" "temp" "temp" ...
 $ LEN_MC     : num [1:403] 4 7 12 6 9 9 9 4 6 4 ...
 $ LEN_SC     : num [1:403] 10 6 7 15 5 5 12 2 24 11 ...
 $ LENGTH_DIFF: num [1:403] -6 1 5 -9 4 4 -3 2 -18 -7 ...
 $ CONJ       : chr [1:403] "als/when" "als/when" "als/when" "als/when" ...
 $ MORETHAN2CL: chr [1:403] "no" "no" "yes" "no" ...

This shows us that the data frame has 8 columns, as the $ operators indicate ($ Case, $ ORDER, …). The column names are followed by

the data type (num for numeric and chr for character strings)
the number of values (`[1:403]`) and
the first few observations.

Another intuitive way to display the structure of a data matrix is to simply show the first few rows:

head(cl.order)

# A tibble: 6 × 8
   CASE ORDER SUBORDTYPE LEN_MC LEN_SC LENGTH_DIFF CONJ     MORETHAN2CL
  <dbl> <chr> <chr>       <dbl>  <dbl>       <dbl> <chr>    <chr>      
1  4777 sc-mc temp            4     10          -6 als/when no         
2  1698 mc-sc temp            7      6           1 als/when no         
3   953 sc-mc temp           12      7           5 als/when yes        
4  1681 mc-sc temp            6     15          -9 als/when no         
5  4055 sc-mc temp            9      5           4 als/when yes        
6   967 sc-mc temp            9      5           4 als/when yes

Further details on the variables

ORDER: Does the subordinate clause come before or after the main clause?
SUBORDTYPE: Is the subordinate clause temporal or causal?
MORETHAN2CL: Are there most clauses in the sentence than just one subordinate clause and one main clause?
LEN_MC: How many words does the main clause contain?
LEN_SC: How many words does the subordinate clause contain?
LENGTH_DIFF: What is the length difference in words between the main clause and subordinate clause?

14.3 The big picture: Populations and samples

In order to investigate one or more linguistic features, we first need to collect relevant observations, which typically correspond to linguistic utterances in a corpus-linguistic context. Since it is not feasible to examine, for instance, the entirety of all linguistic utterances ever produced, i.e., the virtually infinite population, we rely on subsets of it: samples. Good research is characterised by good sampling procedures that limit the bias present in any sample, thus improving the generalisability of potential findings.

To state this more formally, let $\omega$ (‘lower-case Omega’) denote a single observation of interest. The full set of theoretically possible observations is contained in $\Omega$ (‘upper-case Omega’), with $\Omega = \{\omega_1, \omega_2, \dots, \omega_n \}$. Here $n$ represents the $n$-th observation and should be a natural number. A sample is then simply a selection of elements from $\Omega$.

14.4 Variables

The concept of the variable is very handy in that it allows us to quantify different aspects our linguistic observations, such the age or gender of the speaker, the register of the speech situation, the variety of English, the length of an utterance, a syntactic construction or a grammatical category, among many other theoretically possible features. For each observation, we should be able to assign a specific value to our variables; for instance, a variable Register could assume a value such as informal, or Utterance length a hypothetical value of 5 (e.g., 5 words).

In the statistical literature, variables are usually represented by upper-case letters, such as $X$. For each observation $\omega \in \Omega$, $X$ has a value $x$. The set $S$ comprises all possible unique values of $X$, e.g.,

$S_{\text{Register}} = \{\text{informal}, \text{formal}, ...\}$,
$S_{\text{Utterance length}} = \{1, 2, 3, \dots \}$,
…

Number of (unique) values in R

To count the number of items in a vector, which correspond to the total number $n$ of attested values of $X$, we can use length():

length(cl.order$SUBORDTYPE)

[1] 403

In fact, it is equivalent to the number of rows in the full data frame:

nrow(cl.order)

[1] 403

The function unique() shows all unique items (= types) in a vector, which reflect the possible outcomes in $S$:

unique(cl.order$SUBORDTYPE)

[1] "temp" "caus"

14.4.1 Datasets

Information on variables and their values is conventionally arranged in a dataset, which is essentially a matrix with rows (= observations) and columns (= variables). Consider the head() of the clause data:

	CASE	ORDER	SUBORDTYPE	LEN_MC	LEN_SC	LENGTH_DIFF	CONJ	MORETHAN2CL
1	4777	sc-mc	temp	4	10	-6	als/when	no
2	1698	mc-sc	temp	7	6	1	als/when	no
3	953	sc-mc	temp	12	7	5	als/when	yes
4	1681	mc-sc	temp	6	15	-9	als/when	no
5	4055	sc-mc	temp	9	5	4	als/when	yes
6	967	sc-mc	temp	9	5	4	als/when	yes

Assuming that $n$ denotes the row number and $p$ the column number, the abstract form of a dataset is an $n \times p$ matrix of the form in Equation 14.1.

\[ \begin{pmatrix} \omega & X_1 & X_2 & \cdots & X_p \\ 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 2 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & & \vdots \\ n & x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix} \tag{14.1}\]

14.4.2 Data types

In general, we distinguish between discrete variables, which can only take a limited set of unique values, and continuous variables, which can take infinitely many values within a specified range.

We can further subdivide discrete and continuous variables into into nominal, ordinal, interval-scaled and ratio-scaled ones:

Nominal/categorical

These variables comprise a limited number of categories which cannot be ordered in a meaningful way. For instance, it does not matter which value of SUBORDTYPE or MORETHAN2CL comes first or last:

    unique(cl.order$SUBORDTYPE)

[1] "temp" "caus"

    unique(cl.order$MORETHAN2CL)

[1] "no"  "yes"

Ordinal/categorical

Ordinal variables are ordered. However, the intervals between their individual values are not interpretable. Heumann (2022: 6) provides a pertinent example:

[T]he satisfaction with a product (unsatisfied–satisfied–very satisfied) is an ordinal variable because the values this variable can take can be ordered but the differences between ‘unsatisfied–satisfied’ and ‘satisfied–very satisfied’ cannot be compared in a numerical way.

Interval-scaled/continuous

In the case of interval-scaled variables, the differences between the values can be interpreted, but their ratios must be treated with caution. A temperature of 4°C is 6 degrees warmer than -2°C; however, this does not imply that 4°C is three times warmer than -2°C. This is because the temperature scale has no true zero point; 0°C simply signifies another point on the scale and not the absence of temperature altogether.

Ratio-scaled/continuous

Ratio-scaled variables allow both a meaningful interpretation of the differences between their values and (!) of the ratios between them. Within the context of clause length, LENGTH_DIFF values such as 4 and 8 not only suggest that the latter is four units greater than the former but also that their ratio $\frac{8}{4} = 2$ is a valid way to describe the relationship between these values. Here a LENGTH_DIFF of 0 can be clearly viewed as the absence of a length difference.

14.4.3 Dependent vs. independent variables

In empirical studies, it is often of interest whether one variable leads to changes in the values of another variable. When exploring such associations (or correlations), we need to take another heuristic step to clarify the direction of the influence.

In a linguistic context, we denote the variable whose usage patterns we’d like to explain as the dependent or response variable. A list of possible dependent variables is provided in the section on Linguistic variables).

Its outcomes are said to depend on one or more independent variables. These are also often referred to as explanatory variables as they are supposed to explain variation in the response variable. These can be AGE, SEX or the VARIETY of English at hand.

14.5 Exercises

Exercise 14.1 Consider the following statement:

This paper examines the influence of clause length on the ordering of main and subordinate clauses.

What is the dependent variable?
What is the independent variable?
What would it mean if they were reversed?

Exercise 14.2 Is frequency data a discrete or a continuous variable?

Exercise 14.3 Identify the variable types (nominal, ordinal etc.) for all columns in the cl.order dataset.

Exercise 14.4 Consider the general form of the data matrix from Equation 14.1.

What would the labels $\omega$, $X$ and $x$ correspond to in the cl.order dataset?
What would be $\Omega$ and $S$?