<- c("start", "enjoy", "begin", "help") lemma
7 Vectors
7.1 Preparation
You can find the full R script associated with this unit here.
7.2 Recommended reading
Winter (2020): Chapter 1.1–1.9
Suggested video tutorial:
How to Create and Name Vectors in R (DataCamp; 5min)
7.3 Word frequencies I
You are given the following token counts of English verb lemmas in the International Corpus of English.
Lemma | Frequency |
---|---|
start | 418 |
enjoy | 139 |
begin | 337 |
help | 281 |
While this table is relatively small and easy to interpret, it is still a good idea to supply readers with a simple visual representation of the frequency distributions (e.g., a barplot.). Quite conveniently, R happens to provide us with an abundance of plotting functions! In order to make use of them, all we need to do is communicate to R the data we want to visualise. We can supply the data either
by manually listing all the elements of interest or
automatically by importing it from an existing spreadsheet file (e.g., from Microsoft Excel).
For now, we will stick to option 1 and move on to option 2 in a later unit (cf. Section 10.2.2).
7.3.1 Storing data in R
To create a two-dimensional plot, we will first need to generate two objects in R: one for the individual lemmas and one for the frequency counts.
Let’s start by combining the lemmas start, enjoy, begin and help into an object lemma
using R’s c()
function. Enter the following line into a new R script and click on Run (or simply press Ctrl+Enter/Cmd+Enter).
To make sure this worked, we can apply the print()
function to lemma
to view the elements it holds:
print(lemma)
[1] "start" "enjoy" "begin" "help"
Naturally, it is also possible to combine numeric information with c()
.
<- c(418, 139, 337, 281) frequency
The print()
functions allows us to inspect the contents of frequency
:
print(frequency)
[1] 418 139 337 281
Letters and numbers represent two distinct data types in R. Anything that should be understood as a simple sequence of letters must be enclosed by quotation marks "..."
. A linguistic item such as start will be will be evaluated as a string if it’s encoded as "start"
.
Numbers (or integers), by contrast, appear without quotation marks.
7.3.2 Creating the barplot
Our linguistic data is now stored in two variables lemma
and frequency
, which you can conceptualise as virtual container-like objects. These ‘containers’ are now showing in the Environment tab in the top right corner of your RStudio interface.
The combination of categorical labels and numeric information renders our data ideally suited for a barplot. R’s most basic barplot function (which is, unsurprisingly, called barplot()
) needs at the very least …
a
height
argument, i.e., our y-axis values anda
names.arg
argument, i.e., our x-axis labels.
barplot(frequency, names.arg = lemma)
After some tinkering, our plot looks more presentable:
barplot(frequency, names.arg = lemma,
main = "Frequency of Lemmas", # title
xlab = "Lemmas", # label for x-axis
ylab = "Frequency", # label for y-axis
col = "steelblue") # color
In R, everything followed by the hashtag #
will be interpreted as a comment and won’t be evaluated by the R compiler. While comments don’t affect the output of our code in the slightest, they are crucial to any kind of programming project.
Adding prose annotations will make your code not only easier to understand for others but also for your future self. Poor documentation is a common, yet unnecessary source of frustration for all parties involved …
In RStudio, you now have the option to save the plot to your computer. Once the figure has appeared in your “Plots” panel, you can click on “Export” in the menu bar below and proceed to choose the desired output format and file directory.
7.3.3 Essential R concepts
The example above demonstrates one of the most important data structures in R: vectors. They form the cornerstone of various more complex objects such as data frames, and are essential to handling large data sets (e.g., corpora). And yet, vectors are very simple in that they are merely one-dimensional sequences of characters or numbers — no more, no less.
print(lemma)
[1] "start" "enjoy" "begin" "help"
print(frequency)
[1] 418 139 337 281
The individual elements in these two vectors are not randomly jumbling around in virtual space, but are in fact following a clear order. Each element comes with an “ID” (or index), by which it can be accessed. For example, if we want to print the first lemma in our lemma
variable, we append square brackets [ ]
to it. This will allow us to subset it.
1] lemma[
[1] "start"
Similarly, we can subset frequency
according to, for example, its third element:
3] frequency[
[1] 337
It is also possible to obtain entire ranges of elements, such as everything from the second to the fourth element:
2:4] frequency[
[1] 139 337 281
7.4 Exercises
You can find the solutions to the exercises here.
Exercise 7.1 Create a vector that lists the third person personal pronouns of English (subject and object forms). Store them in a variable pp3
.
Exercise 7.2 Now print …
… the fourth element in
pp3
.… elements 3 through 5.
… all elements.
… elements 1, 3 and 5.
Exercise 7.3 When working with large datasets, we often don’t know whether an element is in the vector to begin with, let alone its position. For instance, if we wanted to check whether they is in pp3
or not, we could use the handy notation below, returning a TRUE
or FALSE
value:
"they" %in% pp3
Ascertain whether the following items are in pp3
:
him
you
it and them
we, us and me
Exercise 7.4 Once we are sure that an element is in the vector of interest, another common problem that arises is finding its location. In this case, we can use which()
to return the index of an element.
which(pp3 == "they")
You can read the code above as “Which element in pp3
is they?”. Note that the index number depends on the order of elements you’ve chosen when creating pp3
.
Find the locations of it and them in pp3
!
Exercise 7.5 Consider the vector numbers
.
<- c(500:1000) numbers
- What does the following code do? How does the output change when you subset
numbers
according to this expression?
which(numbers > 600)
- Describe the output of these code chunks:
!= 500] numbers[numbers
> 500 & numbers < 550] numbers[numbers
< 510 | numbers > 990] numbers[numbers