8  Data frames

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

8.1 Preparation

Script

You can find the full R script associated with this unit here.

8.3 Word frequencies II

Recall our corpus-linguistic data from the previous unit:

Lemma Frequency
start 418
enjoy 139
begin 337
help 281

We thought of the columns as one-dimensional, indexed lists of elements:

lemma <- c("start", "enjoy", "begin", "help")

frequency <- c(418, 139, 337, 281)

Actually, R allows us to combine these two vectors into something that resembles a real spreadsheet. To this end, we apply the data.frame() function to two vectors of our choice.

data <- data.frame(lemma, frequency)

print(data)
  lemma frequency
1 start       418
2 enjoy       139
3 begin       337
4  help       281

8.3.1 Essential R concepts

The variable data is no longer a vector, but a data frame (often abbreviated as ‘df’). Once again, each element carries its own label and can, therefore, be accessed or manipulated.

Since we now have two dimensions, the subsetting notation in square brackets [ ] has to reflect that. This is the general pattern:

\[ \text{df[row, column]} \tag{8.1}\]

Say, we’re looking for the element at the intersection of the first row and first column. Applying the pattern above, we can access it like so:

data[1,1]
[1] "start"

But what if we need the entire first row? We simply omit the column part. Note, however, that the comma , needs to remain:

data[1,]
  lemma frequency
1 start       418

Subsetting by columns is interesting. We can either use the square bracket notation [ ] or the column operator $:

data[,1]
[1] "start" "enjoy" "begin" "help" 
data$lemma
[1] "start" "enjoy" "begin" "help" 

8.3.2 Filtering

Not all the information contained in a data frame is always relevant for our research. In those cases, it’s important to subset the rows and columns according to certain criteria.

Assume we only need those observations where the lemma frequencies are greater than 300. We can obtain those by specifying

  1. the data frame,
  2. the column of interest and
  3. the condition to apply.

You can read the code below as

Take the data frame data and subset it according to the column data$frequency. Show me those rows where the values of data$frequency are greater than 300.

data[data$frequency > 300, ]
  lemma frequency
1 start       418
3 begin       337

What if we wanted to filter by lemma instead? Let’s say we’re looking for frequency data on the verbs start and help.

This will give us the row associated with start:

data[data$lemma == "start", ]
  lemma frequency
1 start       418

Combining multiple statements requires a logical operator. Here we’re using | , which corresponds to a logical ‘or’ (disjunction).

data[data$lemma == "start" | data$lemma == "help", ]
  lemma frequency
1 start       418
4  help       281
Why do we need to use “or” (|) and not “and” (&)?

The idea of combining statements somewhat naturally suggests a conjunction, which could be achieved via &. How come R doesn’t return anything if we do it that way?

data[data$lemma == "start" & data$lemma == "help", ]
[1] lemma     frequency
<0 rows> (or 0-length row.names)

8.3.3 I don’t like the way this looks – is there another way to filter in R?

Yes, absolutely. The subsections below demonstrate a few popular alternatives. In the end, the exact way you filter doesn’t really matter, so long as you (as well as the people who have to work with your script) can understand what you’re trying to achieve. Always consider adding comments to your filtering operations!

Almost every subsetting operation we perform with square brackets can also be performed using the subset() function. Here are some expressions that are synonymous to the ones above:

subset(data, frequency > 300)
  lemma frequency
1 start       418
3 begin       337
subset(data, lemma == "start" | lemma == "help")
  lemma frequency
1 start       418
4  help       281

The tidyverse-ecosystem is a collection of packages specifically designed for handling typical data science tasks as comfortably and elegantly as possible, supplying countless helper functions for data manipulation, transformation and visualisation. Installation instructions are provided in Section 9.2.1.

It offers some appealing alternatives to the Base R subsetting functions. Let’s generate a tidyverse-style data frame, the tibble:

library(tidyverse)

data2 <- tibble(
  lemma = c("start", "enjoy", "begin", "help"),
  frequency = c(418, 139, 337, 281)
)

print(data2)
# A tibble: 4 × 2
  lemma frequency
  <chr>     <dbl>
1 start       418
2 enjoy       139
3 begin       337
4 help        281

We can single out certain columns by using select():

select(data2, lemma)
# A tibble: 4 × 1
  lemma
  <chr>
1 start
2 enjoy
3 begin
4 help 

It is very easy to filter the data frame according to certain criteria:

filter(data2, frequency > 300)
# A tibble: 2 × 2
  lemma frequency
  <chr>     <dbl>
1 start       418
2 begin       337
filter(data2, lemma == "start" | lemma == "help")
# A tibble: 2 × 2
  lemma frequency
  <chr>     <dbl>
1 start       418
2 help        281

A extensive guide to the main tidyverse functions is provided in Chapter 3 of the free eBook R For Data Science (2nd edition).

8.4 Exercises

Solutions

You can find the solutions to the exercises here.

Exercise 8.1 Recreate the barplot from the previous unit by subsetting the data variable accordingly.

Exercise 8.2 Print the following elements by subsetting the data frame data accordingly.

  • 337

  • begin

  • enjoy

  • enjoy 139

  • the entire frequency column

Exercise 8.3 Extension of Exercise 7.3. Verify that the following verbs are represented in the lemma column: enjoy, hit, find, begin. If they are in the data frame, print their frequency information.

Exercise 8.4 Extension of Exercise 7.4. Use which() to find the rows where the frequency is greater than 200, and then print the lemma and frequency of only those rows.