Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Research questions
    • 1.3 Linguistic variables
    • 1.4 Set theory and mathematical notation
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 Data annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Binomial test
    • 4.6 Chi-squared test
    • 4.7 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. Overview
  • Overview

On this page

  • Fundamentals of Corpus-based Research
  • Introduction to R
  • NLP with R
  • Statistics
  • Models
  • Machine Learning

Overview

Authors
Affiliations

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Thomas Brunner

Catholic University of Eichstätt-Ingolstadt

Philippa Adolf

University of Vienna

This collection of handouts provides a hands-on introduction to data analysis and statistical methods in quantitative corpus linguistics with R. It is designed with accessibility in mind, assuming no prior knowledge of programming or statistics. All you need to get started is a laptop; everything else will be explained within these pages.

Primarily, this reader is geared towards students attending the classes Language Variation (BA) and Statistics for Linguistics (MA) at the Catholic University of Eichstätt-Ingolstadt (Germany). However, it is also meant to equip students currently working on their BA/MA/PhD theses with the tools they need to conduct empirical studies on a wide array of linguistic phenomena. The methods presented here reflect the state-of-the-art in corpus-linguistic research, providing readers with current and relevant analytical techniques.

Fundamentals of Corpus-based Research

Title Description
1.1 Basics A short introduction to the basic structure of a sociolinguistic study.
1.2 Research questions What is a research question and do I write a good one?
1.3 Linguistic variables This handout introduces linguistic variables from classical and sociolinguistic perspectives, explores their subtypes and salience, discusses the principle of accountability, and provides examples of morphosyntactic variation in English.
1.4 Set theory and mathematical notation This unit introduces key concepts from set theory and mathematical notation – including sets, subsets, unions, intersections, sums, and products—to build foundational skills for formal reasoning in corpus linguistics.
No matching items

Introduction to R

Title Description
2.1 First steps When it comes to data analysis, learning R offers an overwhelming number of short- and long-term advantages over conventional spreadsheet software such as Microsoft Excel or…
2.2 Exploring RStudio Introduction to the RStudio interface, illustrating how to interact with R via the console, store and reuse values using variables, and preserve your work using R scripts.
2.3 Vectors We introduce vectors as our first data structure in R and explore some common applications and manipulations.
2.4 Data frames This unit introduces data frames in R, covering their creation, subsetting, filtering (including with base R and tidyverse), and includes practice exercises on accessing and manipulating structured linguistic data.
2.5 Libraries Winter (2020): Chapter 1.13
2.6 Import/export data You can find the full R script associated with this unit here.
No matching items

NLP with R

Title Description
3.1 Concordancing Schweinberger (2024)
3.2 Regular expressions You can find the full R script associated with this unit here.
3.3 Data annotation Exemplifying the data collection and annotation workflow.
No matching items

Statistics

Title Description
4.1 Data, variables, samples This handout introduces key statistical concepts—such as samples, populations, variables, datasets, and data types—with a focus on their application to empirical linguistic research, using R to explore and illustrate these ideas.
4.2 Probability theory This section covers some essential concepts from probability theory, such as the concept of probability, probability distributions, and exepectations.
4.3 Descriptive statistics Theoretical introduction:
4.4 Continuous data Heumann et al. (2022: Chapter 3)
4.4 Hypothesis testing For linguists:
4.5 Binomial test A binomial distribution is a probability distribution which can be found in an experiment with only two possible outcomes (0 and 1, also referred to as failure and success, respectively) and which are independent of each other in case of repetition.
4.6 Chi-squared test The chi-squared (\(\chi^2\)) test helps determine if there is a statistically significant association between two categorical variables. It compares the observed frequencies of categories with those expected under the null hypothesis. The \(\chi^2\) (chi-squared) score quantifies the difference between observed and expected frequencies for every cell in a contingency table. The greater the difference between observed and expected, the higher the \(\chi^2\) score and the lower the \(p\)-value, given the degrees of freedom. It is recommended to compute effect size measures and inspect the residuals to assess the nature of the association.
4.7 t-test Agresti & Kateri (2022: Chapter 5.3)
No matching items

Models

Title Description
6.1 Linear regression Modelling continuous response variables.
6.2 Logistic Regression Modelling categorical (binary) response variables.
6.3 Mixed-effects regression For linguists:
6.4 Poisson regression Modelling count data.
6.5 Ordinal regression Modelling categorical (ordinal) response variables.
No matching items

Machine Learning

Title Description
7.1 Tree-based methods We familiarise ourselves with powerful non-parametric models based on recursive partitioning.
7.2 Gradient boosting Gradient boosting constitutes a powerful extension of tree-based methods and is generally appreciated for its high predictive performance. Nevertheless, this family of methods, which includes implementations such as AdaBoost, XGBoost, and CatBoost, among many others, is not yet established in corpus-linguistic statistics. A practical scenario is presented to introduce the core ideas of gradient boosting, demonstrate its application to linguistic data as well as point out its advantages and drawbacks.
7.3 Principal Components Analysis For linguists:
7.4 Exploratory Factor Analysis For linguists:
7.5 Clustering James et al. (2021): Chapter 12
No matching items