Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Research Questions
    • 1.3 Linguistic Variables
    • 1.4 Formal Aspects
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 Data Annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Categorical data
    • 4.4 Continuous data
    • 4.5 Hypothesis testing
    • 4.6 Binomial test
    • 4.7 Chi-squared test
    • 4.8 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. 2. Introduction to R
  2. 2.4 Data frames
  • 2. Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting

On this page

  • Preparation
  • Recommended reading
  • Word frequencies II
    • Essential R concepts
    • Filtering
    • This looks unintuitive – is there another way to filter in R?
  • Exercises
    • Tier 1
    • Tier 2
    • Tier 3
  1. 2. Introduction to R
  2. 2.4 Data frames

2.4 Data frames

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract
This unit introduces data frames in R, covering their creation, subsetting, filtering (including with base R and tidyverse), and includes practice exercises on accessing and manipulating structured linguistic data.

Preparation

Script

You can find the full R script associated with this unit here.

Recommended reading

Winter (2020): Chapter 1.10-1.16

Suggested video tutorial:

Using the Data Frame in R (DataCamp, 5min)

Learn How to Subset, Extend & Sort Data Frames in R (DataCamp, 7min)

Word frequencies II

Recall our simple linguistic dataset from the previous unit:

Table 1: Verb lemma frequencies
Verb Frequency
start 418
enjoy 139
begin 337
help 281

We thought of the columns as one-dimensional, indexed lists of elements:

lemma <- c("start", "enjoy", "begin", "help")

frequency <- c(418, 139, 337, 281)

In fact, R allows us to combine these two vectors into an actual spreadsheet. To this end, we need to apply the data.frame() function to two vectors of our choice. Note that they need to have the same length:

data <- data.frame(lemma, frequency)

print(data)
  lemma frequency
1 start       418
2 enjoy       139
3 begin       337
4  help       281

Essential R concepts

The variable data is no longer a vector, but a data frame (often abbreviated as ‘df’). Once again, each element carries its own label and can be, therefore, accessed or manipulated.

Since data frames are two-dimensional objects, the subsetting notation in square brackets [ ] needs to reflect that. This is the general pattern:

\[ \text{df[row, column]} \tag{1}\]

Say, we’re looking for the element at the intersection of the first row and first column. Applying the pattern above, we can access it like so:

data[1,1]
[1] "start"

But what if we needed the entire first row? We’d simply omit the column part. Note, however, that the comma , needs to remain:

data[1,]
  lemma frequency
1 start       418

Subsetting by columns is interesting. We can either use the square bracket notation [ ] or the column operator $:

data[,1]
[1] "start" "enjoy" "begin" "help" 
data$lemma
[1] "start" "enjoy" "begin" "help" 

Filtering

Not all the observations contained in a data frame are necessarily relevant for our research. In such cases, it may be important to subset the rows and columns according to certain criteria.

Assume we only need those observations where the lemma frequencies are greater than 300. We can filter the dataset accordingly by specifying

  1. the data frame,
  2. the column of interest, and
  3. the condition to apply to the rows.

You can read the code below as

‘Take the data frame data and subset it according to the column data$frequency. Show me those rows where the values of data$frequency are greater than 300.’

data[data$frequency > 300, ]
  lemma frequency
1 start       418
3 begin       337

What if we wanted to filter by lemma instead? To make it more concrete, assume we’re looking for frequency data on the verbs start and help ( but not on begin and help).

We can start by accessing the rows with data on start first:

data[data$lemma == "start", ]
  lemma frequency
1 start       418

Next, we add a second, analogous condition. Combining multiple statements requires a logical operator. In this code chunk, we’re using | , which corresponds to a logical ‘or’ (also known as a “disjunction”).

data[data$lemma == "start" | data$lemma == "help", ]
  lemma frequency
1 start       418
4  help       281
Why do we need to use “or” (|) and not “and” (&)?

The idea of combining statements somewhat naturally suggests a conjunction, which could be achieved via &. How come R doesn’t return anything if we do it that way?

data[data$lemma == "start" & data$lemma == "help", ]
[1] lemma     frequency
<0 rows> (or 0-length row.names)

This looks unintuitive – is there another way to filter in R?

Yes, absolutely. The callouts below demonstrate a few popular alternatives. In the end, the exact way you filter doesn’t really matter, so long as you (as well as the people who have to work with your script) can understand what you’re trying to achieve with your code. Always make sure to add comments to your filtering operations!

subset()

Almost every subsetting operation we perform with square brackets can also be performed using the subset() function. Here are some expressions that are synonymous to the ones above:

subset(data, frequency > 300)
  lemma frequency
1 start       418
3 begin       337
subset(data, lemma == "start" | lemma == "help")
  lemma frequency
1 start       418
4  help       281
tidyverse

The tidyverse-ecosystem is a collection of packages specifically designed for handling typical data science tasks as comfortably and elegantly as possible, supplying countless helper functions for data manipulation, transformation and visualisation. Installation instructions are provided in 2.3 Libraries.

A extensive guide to the main tidyverse functions is provided in Chapter 3 of the free eBook R For Data Science (2nd edition). Due to its clarity, most of the more advanced code in this reader will draw on tidyverse syntax.

Let’s generate a tidyverse-style data frame, the tibble:

library(tidyverse)

data2 <- tibble(
  lemma = c("start", "enjoy", "begin", "help"),
  frequency = c(418, 139, 337, 281)
)

print(data2)
# A tibble: 4 × 2
  lemma frequency
  <chr>     <dbl>
1 start       418
2 enjoy       139
3 begin       337
4 help        281

We can single out certain columns with select():

select(data2, lemma)
# A tibble: 4 × 1
  lemma
  <chr>
1 start
2 enjoy
3 begin
4 help 

It is very easy to filter the data frame according to certain criteria:

filter(data2, frequency > 300)
# A tibble: 2 × 2
  lemma frequency
  <chr>     <dbl>
1 start       418
2 begin       337
filter(data2, lemma == "start" | lemma == "help")
# A tibble: 2 × 2
  lemma frequency
  <chr>     <dbl>
1 start       418
2 help        281

The tidyverse features a special pipe operator %>% that can be used to pass the output of one function on to the next one. It is conceptually similar to the coordinating conjunction and. The code can be rewritten in pipe notation as follows:

# Read as: "Take data2 and select the column with the name 'lemma'."
data2 %>% 
  select(lemma)

# Read as: "Take data2 and show me those rows where frequency is greater than 300."
data2 %>% 
  filter(frequency > 300)

# Read as: "Take data2 and show me those rows that correspond to the lemma  'start' or 'help' or both."
data %>% 
  filter(lemma == "start" | lemma == "help")

Exercises

Solutions

You can find the solutions to the exercises here.

Tier 1

Exercise 1 Recreate the barplot from the previous unit by subsetting the data variable accordingly.

Exercise 2 Print the following elements by subsetting the data frame data accordingly.

  • 337

  • begin

  • enjoy

  • enjoy 139

  • the entire frequency column

Tier 2

Exercise 3 (Extension of Ex. 3 from Vectors) Verify that the following verbs are represented in the lemma column: enjoy, hit, find, begin. If they are in the data frame, print their frequency information.

Exercise 4 (Extension of Ex. 4 from Vectors) Use which() to find the rows where the frequency is greater than 200, and then print the lemma and frequency of those rows only.

Tier 3

Exercise 5 Diachronic corpora comprise data on language use across different time periods. This data frame indicates the frequencies of certain modal verbs across three time periods:

modals_df <- data.frame(
  modal = c("can", "could", "may", "might", "must", "shall", "should", "will", "would"),
  period1 = c(128, 68, 55, 21, 44, 19, 35, 85, 97),
  period2 = c(142, 83, 41, 30, 39, 12, 52, 94, 119)
)
  • Find the most and least frequent modal verb in each time period.

  • Calculate the percentage change in frequency for each modal verb between period1 and period2.

  • Create a new column trend with the values "increasing" and "decreasing" based on whether the frequency increased or decreased across periods.

Exercise 6 (Extension of Ex. 8 in Vectors.) Write a function that performs part-of-speech (POS) annotation on the sentence The quick brown fox jumps over the lazy dog. Here are a few code snippets to help you get started:

  • You can split up sentences into tokens using tokenize_words() from the tokenizers library.
library(tokenizers)
library(tidyverse)

text <- "Colorless green ideas sleep furiously."
text_tokenized <- tokenize_words(text)

# To lowercase the tokens
tokens_lower <- tolower(text_tokenized[[1]])
  • Vectors can have name attributes:
word <- "read"

# Give it a name
names(word) <- "verb"

# Get rid of its name
word <- unname(word)
  • There are several ways to apply conditional logic:
things <- c("apple", "cherry", "pear", "cucumber", "coconut")
fruits <- c("apple", "cherry", "pear")
vegetables <- c("cabbage", "carrot", "cucumber")

# Base R
food_analysis <- ifelse(things %in% fruits, "fruit", "not_fruit")

# Tidyverse
food_analysis2 <- case_when(
  things %in% fruits ~ "yes", # if elements from "things" are in "fruits", print "yes", else
  TRUE ~ "no"                 # print "no" (default)
)
  • If multiple conditions should be checked, the statements/cases have to be nested appropriately:
# Base R
complex_food_analysis <- ifelse(things %in% fruits, "fruit",
       ifelse(things %in% vegetables, "vegatable",
              "unknown"))

# Tidyverse
complex_food_analysis2 <- case_when(
  things %in% fruits ~ "fruit", # if elements from "things" are in "fruits", print "yes", else
  things %in% vegetables ~ "vegetable", # if they're in "vegetables", print "yes", else
  TRUE ~ "unknown"     # print "unknown"           
)

References

Winter, Bodo. 2020. Statistics for Linguists: An Introduction Using r. New York; London: Routledge.
2.3 Vectors
2.5 Libraries