This collection of handouts provides a hands-on introduction to data analysis and statistical methods in quantitative corpus linguistics with R. It is designed with accessibility in mind, assuming no prior knowledge of programming or statistics. All you need to get started is a laptop; everything else will be explained within these pages.
Primarily, this reader is geared towards students attending the classes Language Variation (BA) and Statistics for Linguistics (MA) at the Catholic University of Eichstätt-Ingolstadt (Germany). However, it is also meant to equip students currently working on their BA/MA/PhD theses with the tools they need to conduct empirical studies on a wide array of linguistic phenomena. The methods presented here reflect the state-of-the-art in corpus-linguistic research, providing readers with current and relevant analytical techniques.
Fundamentals of Corpus-based Research
Title
|
Description
|
1.1 Basics
|
A short introduction to the basic structure of a sociolinguistic study.
|
1.2 Research questions
|
What is a research question and do I write a good one?
|
1.3 Linguistic variables
|
This handout introduces linguistic variables from classical and sociolinguistic perspectives, explores their subtypes and salience, discusses the principle of accountability, and provides examples of morphosyntactic variation in English.
|
1.4 Set theory and mathematical notation
|
This unit introduces key concepts from set theory and mathematical notation – including sets, subsets, unions, intersections, sums, and products—to build foundational skills for formal reasoning in corpus linguistics.
|
No matching items
Introduction to R
Title
|
Description
|
2.1 First steps
|
When it comes to data analysis, learning R offers an overwhelming number of short- and long-term advantages over conventional spreadsheet software such as Microsoft Excel or…
|
2.2 Exploring RStudio
|
Introduction to the RStudio interface, illustrating how to interact with R via the console, store and reuse values using variables, and preserve your work using R scripts.
|
2.3 Vectors
|
We introduce vectors as our first data structure in R and explore some common applications and manipulations.
|
2.4 Data frames
|
This unit introduces data frames in R, covering their creation, subsetting, filtering (including with base R and tidyverse), and includes practice exercises on accessing and manipulating structured linguistic data.
|
2.5 Libraries
|
Winter (2020): Chapter 1.13
|
2.6 Import/export data
|
You can find the full R script associated with this unit here.
|
No matching items
Statistics
Title
|
Description
|
4.1 Data, variables, samples
|
This handout introduces key statistical concepts—such as samples, populations, variables, datasets, and data types—with a focus on their application to empirical linguistic research, using R to explore and illustrate these ideas.
|
4.2 Probability theory
|
This section covers some essential concepts from probability theory, such as the concept of probability, probability distributions, and exepectations.
|
4.3 Descriptive statistics
|
Theoretical introduction:
|
4.4 Continuous data
|
Heumann et al. (2022: Chapter 3)
|
4.4 Hypothesis testing
|
For linguists:
|
4.5 Binomial test
|
A binomial distribution is a probability distribution which can be found in an experiment with only two possible outcomes (0 and 1 , also referred to as failure and success , respectively) and which are independent of each other in case of repetition.
|
4.6 Chi-squared test
|
The chi-squared (\(\chi^2\)) test helps determine if there is a statistically significant association between two categorical variables. It compares the observed frequencies of categories with those expected under the null hypothesis. The \(\chi^2\) (chi-squared) score quantifies the difference between observed and expected frequencies for every cell in a contingency table. The greater the difference between observed and expected, the higher the \(\chi^2\) score and the lower the \(p\)-value, given the degrees of freedom. It is recommended to compute effect size measures and inspect the residuals to assess the nature of the association.
|
4.7 t-test
|
Agresti & Kateri (2022: Chapter 5.3)
|
No matching items
Machine Learning
Title
|
Description
|
7.1 Tree-based methods
|
We familiarise ourselves with powerful non-parametric models based on recursive partitioning.
|
7.2 Gradient boosting
|
Gradient boosting constitutes a powerful extension of tree-based methods and is generally appreciated for its high predictive performance. Nevertheless, this family of methods, which includes implementations such as AdaBoost, XGBoost, and CatBoost, among many others, is not yet established in corpus-linguistic statistics. A practical scenario is presented to introduce the core ideas of gradient boosting, demonstrate its application to linguistic data as well as point out its advantages and drawbacks.
|
7.3 Principal Components Analysis
|
For linguists:
|
7.4 Exploratory Factor Analysis
|
For linguists:
|
7.5 Clustering
|
James et al. (2021): Chapter 12
|
No matching items