Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Linguistic variables
    • 1.3 Research questions
    • 1.4 Set theory and mathematical notation
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 The CQP interface
    • 3.4 Data annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Chi-squared test
    • 4.6 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. 1. Fundamentals
  2. 1.3 Linguistic Variables
  • 1. Fundamentals
    • 1.1 Basics
    • 1.2 Research Questions
    • 1.3 Linguistic Variables
    • 1.4 Formal aspects

On this page

  • What is a linguistic variable?
  • Subtypes of variables
    • Linguistic perspective
    • Sociolinguistic perspective
  • Many morphosyntactic variables in English
  • A statistical perspective
  • Exercises
  1. 1. Fundamentals
  2. 1.3 Linguistic Variables

1.2 Linguistic variables

Theory
Authors
Affiliation

Thomas Brunner

Catholic University of Eichstätt-Ingolstadt

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract
This handout introduces linguistic variables from classical and sociolinguistic perspectives, explores their subtypes and salience, discusses the principle of accountability, and provides examples of morphosyntactic variation in English.

What is a linguistic variable?

  1. The classical view: Labov (1972: 7-8) explains that the linguistic variable should be salient on several levels:
  • It occurs often enough that small data samples could already hint at its distributional idiosyncrasies.
  • Previous research hints at the influence of extra-linguistic factors.
  • On the level of the individual speaker, a limited awareness of the variable realisations is desirable, but should not be too high, lest it skew the nature of their linguistic output.
  • The variable is deeply embedded in the language system.
  1. A restriction: Meyerhoff (2009: 11) summarises: “In sum, a sociolinguistic variable can be defined as a linguistic variable that is constrained by social or non-linguistic factors […]”

  2. A more open view: Kiesling (2011) argued, “Given the variability of what counts as a variable, we must define what counts as a variable more broadly than ‘two or more ways of saying the same thing’. We will simply say that a linguistic variable is a choice or option about speaking in a speech community… Note that this definition does not in any way require us to state that the meaning be the same, although there should be some kind of equivalence noted.”

Subtypes of variables

Linguistic perspective

  1. Phonetic/phonological
  2. Morphological
  3. Syntactic
  4. Pragmatic

Sociolinguistic perspective

Sociolinguistic variables also differ with regard to their salience in society.

  1. Stereotypes are strongly socially marked and part of popular discourse about language.
    • h-dropping in Cockney
    • Canadian eh at the end of sentences
    • Australian dinkum: I was fair dinkum about my interest in their culture ‘authentic, genuine’
  2. Markers show both social and style stratification; all members of a society react similarly in taking care to avoid the pattern in formal registers.
    • (r)

    • (th)

  3. Indicators differentiate social groups. However, people are not aware of them and therefore do not avoid them in formal registers.
    • Same vowel in God and Guard in New York City

Cf. Mesthrie (2011).

Many morphosyntactic variables in English

The table lists some of the most well-known (morpho-)syntactic alternations in English. For an up-to-date overview with various case studies, see Szmrescanyi & Grafmiller (2023).

Variable Example
Indefinite Pronouns everybody vs. everyone
Case and order of coordinated pronouns my husband and I vs. my husband and me vs. me and my husband
that vs. zero complementation I don’t think that/Ø it’s a problem.
that vs. gerundial complementation remember that vs. remember V-ing; try to vs. try and vs. try V-ing
Particle placement alternation set the computer up vs. set up the computer
The dative alternation give the book to John vs. give John the book
The genitive alternation John’s house vs. the house of John
Relativization strategies wh-word vs. that vs. Ø
Analytic vs. synthetic comparatives warmer vs. more scary
Plural existentials there are some places vs. there’s some places
Future temporal reference will vs. going to vs. progressive etc.
Deontic modality must vs. have to vs. need to vs. got to etc.
Stative possession have vs. have got vs. got
Quotatives say vs. be like vs. go etc.
not vs. no not anybody vs. nobody; not anyone vs. no one; not anything vs. nothing
NOT vs. AUX contraction that’s not vs. that isn’t etc.

Cf. Gardner et al. (2021).

A statistical perspective

Sociolinguistic variables are fundamentally sets of discrete (i.e., clearly delimited) outcomes. This means each token in the data represents a choice among several possible, mutually exclusive variants. There may be two outcomes (binary: binomial) or more (multinomial), depending on the specific variable under study.

The distribution of outcomes for sociolinguistic variables is never truly continuous. Continuous variables like Age or vowel formant measurements such as F1 are treated separately – sociolinguistic variation typically focuses on the selection among categorical alternatives.

Quantitative analyses attempt to model the probabilities of these binomial or multinomial outcomes. Some common statistical questions involve:

  • Are all outcomes equally likely?

  • If we condition the sociolinguistic variable on another independent variable (such as Gender, Variety, Social Class, …), does its probability distribution change?

  • Are there any statistically meaningful patterns (hypothesis testing)?

Exercises

Exercise 1 Which of the following variables could be considered ‘good’ sociolinguistic variables, and which of them poor ones? Justify your answer.

  1. /fɔːθ flɔː/ vs. /fɔːrθ flɔːr/
  2. This enables him to preside over the process which I have described vs. This enables him to preside over the process that I have described vs. This enables him to preside over the process ∅ I have described.
  3. The pair found the briefcase on a bus station bench at Bath central bus station. vs. The briefcase was found on a bus station bench at Bath central bus station by the pair.
  4. Art is after all the subject of attention for both critic and historian, even though the functions and methods of the two sorts of writer have drawn apart. vs. Art histories often make an attempt to keep to chronology, although the difficulties include the crucial fact that in art there is no clear sequence of events. vs. Many of his readers approved his sensitive and appreciative understanding of paintings, though without sharing his political views.
  5. /pleɪɪŋ/ vs. /pleɪɪn/
  6. [tʰ] in /tɔp/ vs. [t] in stop.

Exercise 2 Two linguists aim to study the preference for passives among men and women. They extract all the passives from 500,000 words of male speech and all passives from 500,000 words of female speech and report the results. What is wrong with this approach?

References

Gardner, Matt Hunt et al. 2021. “Variation Isn’t That Hard: Morphosyntactic Choice Does Not Predict Production Difficulty.” PloS One 16 (6): e0252602–2.
Kiesling, Scott F. 2011. Linguistic Variation and Change. Edinburgh: Edinburgh University Press.
Labov, William. 1972. Sociolinguistic Patterns. Philadelphia: University of Pennsylvania Press.
Mesthrie, Rajend. 2011. Introducing Sociolinguistics. 2nd ed. Edinburgh: Edinburgh University Press.
Meyerhoff, Miriam. 2009. Introducing Sociolinguistics. London: Routledge.
Szmrecsanyi, Benedikt, and Jason Grafmiller. 2023. Comparative Variation Analysis: Grammatical Alternations in World Englishes. Cambridge: Cambridge University Press.
1.2 Research Questions
1.4 Formal aspects