This collection of handouts provides a hands-on introduction to data analysis and statistical methods in quantitative corpus linguistics with R. It is designed with accessibility in mind, assuming no prior knowledge of programming or statistics. All you need to get started is a laptop; everything else will be explained within these pages.
Primarily, this reader is geared towards students attending the classes Language Variation (BA) and Statistics for Linguistics (MA) at the Catholic University of Eichstätt-Ingolstadt (Germany). However, it is also meant to equip students currently working on their BA/MA/PhD theses with the tools they need to conduct empirical studies on a wide array of linguistic phenomena. The methods presented here reflect the state-of-the-art in corpus-linguistic research, providing readers with current and relevant analytical techniques.
Fundamentals of Corpus-based Research
|
Title
|
Description
|
|
1.1 Basics
|
A short introduction to the basic structure of a sociolinguistic study.
|
|
1.2 Linguistic variables
|
This handout introduces linguistic variables from classical and sociolinguistic perspectives, explores their subtypes and salience, discusses the principle of accountability, and provides examples of morphosyntactic variation in English.
|
|
1.3 Research questions
|
What is a research question and how do I write a good one?
|
|
1.4 Set theory and mathematical notation
|
This unit introduces key concepts from set theory and mathematical notation – including sets, subsets, unions, intersections, sums, and products—to build foundational skills for formal reasoning in corpus linguistics.
|
No matching items
Introduction to R
|
Title
|
Description
|
|
2.1 First steps
|
When it comes to data analysis, learning R offers an overwhelming number of short- and long-term advantages over conventional spreadsheet software such as Microsoft Excel or…
|
|
2.2 Exploring RStudio
|
Introduction to the RStudio interface, illustrating how to interact with R via the console, store and reuse values using variables, and preserve your work using R scripts.
|
|
2.3 Vectors
|
We introduce vectors as our first data structure in R and explore some common applications and manipulations.
|
|
2.4 Data frames
|
This unit introduces data frames in R, covering their creation, subsetting, filtering (including with base R and tidyverse), and includes practice exercises on accessing and manipulating structured linguistic data.
|
|
2.5 Libraries
|
This handout introduces the installation, loading, and citation of R packages essential for data analysis and linguistic applications.
|
|
2.6 Import/export data
|
This handout provides practical guidance on importing, exporting, and storing data in R using CSV, Excel, and RDS formats, with strategies for troubleshooting common issues.
|
No matching items
NLP with R
|
Title
|
Description
|
|
3.1 Concordancing
|
This unit introduces concordancing with R using the quanteda package, demonstrating keyword-in-context searches, dispersion analysis, and data export for transparent corpus-linguistic research.
|
|
3.2 Regular expressions
|
This handout introduces regular expressions for advanced corpus queries in R and on corpus platforms, showing how to construct, refine, and apply search patterns to linguistic data.
|
|
3.3 The CQP interface
|
This handout introduces regular expressions for advanced corpus queries in R and on corpus platforms, showing how to construct, refine, and apply search patterns to linguistic data.
|
|
3.4 Data annotation
|
Exemplifying the data collection and annotation workflow.
|
No matching items
Statistics
|
Title
|
Description
|
|
4.1 Data, variables, samples
|
This handout introduces key statistical concepts—such as samples, populations, variables, datasets, and data types—with a focus on their application to empirical linguistic research, using R to explore and illustrate these ideas.
|
|
4.2 Probability theory
|
This section covers some essential concepts from probability theory, such as the concept of probability, probability distributions, and exepectations.
|
|
4.3 Descriptive statistics
|
This handout provides a comprehensive, hands-on introduction to descriptive and bivariate statistics for both categorical and continuous linguistic data, including data visualisation, summary measures, correlation, and exporting tables, with practical exercises in R using real-world datasets.
|
|
4.4 Hypothesis testing
|
This handout introduces the principles of scientific inference and null hypothesis significance testing (NHST) for both categorical and continuous linguistic data, covering hypothesis formulation, test statistics, sampling distributions, \(p\)-values, Type I/II errors, statistical power, and common misinterpretations of significance, with illustrative examples from linguistic research.
|
|
4.5 Chi-squared test
|
The chi-squared (\(\chi^2\)) test helps determine if there is a statistically significant association between two categorical variables. It compares the observed frequencies of categories with those expected under the null hypothesis. The \(\chi^2\) (chi-squared) score quantifies the difference between observed and expected frequencies for every cell in a contingency table. The greater the difference between observed and expected, the higher the \(\chi^2\) score and the lower the \(p\)-value, given the degrees of freedom. It is recommended to compute effect size measures and inspect the residuals to assess the nature of the association.
|
|
4.6 t-test
|
This handout covers the application of \(t\)-tests for independent and paired samples, effect size calculation, and ANOVA for comparing multiple groups, with practical examples from phonetic and psycholinguistic data, including visualization, assumption checks, post-hoc testing, and exercises on statistical interpretation and reporting.
|
No matching items
Machine Learning
|
Title
|
Description
|
|
7.1 Tree-based methods
|
We familiarise ourselves with powerful non-parametric models based on recursive partitioning.
|
|
7.2 Gradient boosting
|
Gradient boosting constitutes a powerful extension of tree-based methods and is generally appreciated for its high predictive performance. Nevertheless, this family of methods, which includes implementations such as AdaBoost, XGBoost, and CatBoost, among many others, is not yet established in corpus-linguistic statistics. A practical scenario is presented to introduce the core ideas of gradient boosting, demonstrate its application to linguistic data as well as point out its advantages and drawbacks.
|
|
7.3 Principal Components Analysis
|
For linguists:
|
|
7.4 Exploratory Factor Analysis
|
For linguists:
|
|
7.5 Clustering
|
James et al. (2021): Chapter 12
|
No matching items