Overview

Authors

Affiliations

Catholic University of Eichstätt-Ingolstadt

University of Vienna

This collection of handouts provides a hands-on introduction to data analysis and statistical methods in quantitative corpus linguistics with R. It is designed with accessibility in mind, assuming no prior knowledge of programming or statistics. All you need to get started is a laptop; everything else will be explained within these pages.

Primarily, this reader is geared towards students attending the classes Language Variation (BA) and Statistics for Linguistics (MA) at the Catholic University of Eichstätt-Ingolstadt (Germany). However, it is also meant to equip students currently working on their BA/MA/PhD theses with the tools they need to conduct empirical studies on a wide array of linguistic phenomena. The methods presented here reflect the state-of-the-art in corpus-linguistic research, providing readers with current and relevant analytical techniques.

Fundamentals of Corpus-based Research

Title	Description
1.1 Basics	A short introduction to the basic structure of a sociolinguistic study.
1.2 Linguistic variables	This handout introduces linguistic variables from classical and sociolinguistic perspectives, explores their subtypes and salience, discusses the principle of accountability, and provides examples of morphosyntactic variation in English.
1.3 Research questions	What is a research question and how do I write a good one?
1.4 Set theory and mathematical notation	This unit introduces key concepts from set theory and mathematical notation – including sets, subsets, unions, intersections, sums, and products—to build foundational skills for formal reasoning in corpus linguistics.

Introduction to R

Title	Description
2.1 First steps	When it comes to data analysis, learning R offers an overwhelming number of short- and long-term advantages over conventional spreadsheet software such as Microsoft Excel or…
2.2 Exploring RStudio	Introduction to the RStudio interface, illustrating how to interact with R via the console, store and reuse values using variables, and preserve your work using R scripts.
2.3 Vectors	We introduce vectors as our first data structure in R and explore some common applications and manipulations.
2.4 Data frames	This unit introduces data frames in R, covering their creation, subsetting, filtering (including with base R and tidyverse), and includes practice exercises on accessing and manipulating structured linguistic data.
2.5 Libraries	This handout introduces the installation, loading, and citation of R packages essential for data analysis and linguistic applications.
2.6 Import/export data	This handout provides practical guidance on importing, exporting, and storing data in R using CSV, Excel, and RDS formats, with strategies for troubleshooting common issues.

NLP with R

Title	Description
3.1 Concordancing	This unit introduces concordancing with R using the `quanteda` package, demonstrating keyword-in-context searches, dispersion analysis, and data export for transparent corpus-linguistic research.
3.2 Regular expressions	This handout introduces regular expressions for advanced corpus queries in R and on corpus platforms, showing how to construct, refine, and apply search patterns to linguistic data.
3.3 The CQP interface	This handout introduces regular expressions for advanced corpus queries in R and on corpus platforms, showing how to construct, refine, and apply search patterns to linguistic data.
3.4 Data annotation	Exemplifying the data collection and annotation workflow.

Statistics

Title	Description
4.1 Data, variables, samples	This handout introduces key statistical concepts—such as samples, populations, variables, datasets, and data types—with a focus on their application to empirical linguistic research, using R to explore and illustrate these ideas.
4.2 Probability theory	This section covers some essential concepts from probability theory, such as the concept of probability, probability distributions, and exepectations.
4.3 Descriptive statistics	This handout provides a comprehensive, hands-on introduction to descriptive and bivariate statistics for both categorical and continuous linguistic data, including data visualisation, summary measures, correlation, and exporting tables, with practical exercises in R using real-world datasets.
4.4 Hypothesis testing	This handout introduces the principles of scientific inference and null hypothesis significance testing (NHST) for both categorical and continuous linguistic data, covering hypothesis formulation, test statistics, sampling distributions, \(p\)-values, Type I/II errors, statistical power, and common misinterpretations of significance, with illustrative examples from linguistic research.
4.5 Chi-squared test	The chi-squared (\(\chi^2\)) test helps determine if there is a statistically significant association between two categorical variables. It compares the observed frequencies of categories with those expected under the null hypothesis. The \(\chi^2\) (chi-squared) score quantifies the difference between observed and expected frequencies for every cell in a contingency table. The greater the difference between observed and expected, the higher the \(\chi^2\) score and the lower the \(p\)-value, given the degrees of freedom. It is recommended to compute effect size measures and inspect the residuals to assess the nature of the association.
4.6 t-test	This handout covers the application of \(t\)-tests for independent and paired samples, effect size calculation, and ANOVA for comparing multiple groups, with practical examples from phonetic and psycholinguistic data, including visualization, assumption checks, post-hoc testing, and exercises on statistical interpretation and reporting.

Models

Title	Description
6.1 Linear regression	Modelling continuous response variables.
6.2 Logistic Regression	Modelling categorical (binary) response variables.
6.3 Mixed-effects regression	Introduction to multilevel models.
6.4 Poisson regression	Modelling count data.
6.5 Ordinal regression	Modelling categorical (ordinal) response variables.

Machine Learning

Title	Description
7.1 Tree-based methods	We familiarise ourselves with powerful non-parametric models based on recursive partitioning.
7.2 Gradient boosting	Gradient boosting constitutes a powerful extension of tree-based methods and is generally appreciated for its high predictive performance. Nevertheless, this family of methods, which includes implementations such as AdaBoost, XGBoost, and CatBoost, among many others, is not yet established in corpus-linguistic statistics. A practical scenario is presented to introduce the core ideas of gradient boosting, demonstrate its application to linguistic data as well as point out its advantages and drawbacks.
7.3 Principal Components Analysis	For linguists:
7.4 Exploratory Factor Analysis	For linguists:
7.5 Clustering	James et al. (2021): Chapter 12