# Load libraries
library("tidyverse")
library("readxl")
library("tree") # for CART
library("randomForest") # for traditional random forests
library("party") # for Conditional Inference Trees
library("pdp") # for partial dependence plots
## Reaction time data
<- read_xlsx("data/ELP.xlsx")
ELP
$POS <- as.factor(ELP$POS)
ELP
## Levshina's (2020) data set on T/V forms in Russian
<- read.csv("data/Levshina_2020_tv_data.csv", sep = ",", header = TRUE, stringsAsFactors = TRUE) tv
25 Decision trees and random forests
25.1 Recommended reading
For linguists:
Levshina (2015): Chapter 14
Levshina (2020)
Gries (2021): Chapter 7
General:
James et al. (2021): Chapter 8
Hastie, Tibshirani, and Friedman (2017): Chapters 9.2 & 15
25.2 Introduction
Decision trees and random forests are very popular non-parametric methods. As such, “they do not make explicit assumptions about the functional form of \(f\)’’ (James et al. 2021: 23).
In this unit, we will cover the conceptual basics of these methods as well as their implementation in R using the tv
data from Levshina (2020) in addition to the ELP
data from the unit on Linear Regression. The libraries we will need are listed below:
In the tv
data frame, our target variable will be T/V Form
with the two outcomes ty (Russian 2.p.sg., informal) and vy (Russian 2.p.pl., polite).
str(tv)
head(tv)
25.3 Decision trees
Core concepts:
Segmenting the feature space: “[T]he feature space (i.e., the space spanned by all predictor variables) is recursively partitioned into a set of rectangular areas” (Strobl, Malley, and Tutz 2009: 325).
Impurity reduction: These simplified prediction areas should consist of mostly homogeneous (i.e., ‘pure’ rather than ‘mixed’) observations.
Tree construction: The ‘decisions’ made when partitioning the training data can be visualised using tree structures. The nodes of a tree represent variables, the branches represent decision rules, and leaf nodes indicate the final outcome (e.g., a prediction).
CART: The original computational implementation of decision trees is known as the CART (Classification and Regression Trees) algorithm developed by Breiman (1984).
25.3.1 Classification trees
If we are dealing with a categorical response variable, the tree()
function can be used to fit a classification tree in accordance with Breiman’s CART algorithm. For illustration, consider the tv
data frame. We will model the choice of the pronoun Form
based on the speaker’s and hearer’s social circle (Rel_Circle
) and their difference in social class (Rel_Class
).
# Set random number generator for reproducibility
set.seed(123)
# Supply model formula
<- tree(Form ~ Rel_Circle + Rel_Class, data = tv)
tree.tv
# View tree statistics
summary(tree.tv)
Classification tree:
tree(formula = Form ~ Rel_Circle + Rel_Class, data = tv)
Number of terminal nodes: 9
Residual mean deviance: 0.974 = 213.3 / 219
Misclassification error rate: 0.2412 = 55 / 228
# Visualisation
plot(tree.tv)
text(tree.tv, pretty = 3)
An important problem that arises during tree construction is that of split selection. When should the tree split a node into two further nodes and when not? Furthermore, when should the tree stop the splitting process entirely? In this respect, CART relies on the principle of impurity reduction: “The fundamental idea is to select each split of a subset so that the data in each of the descendent subsets are ‘purer’ than the data in the parent subset” (Breiman 1984: 23). A measure for node purity is the Gini index \(G\), which is defined as
\[ G = \sum_{k=1}^{K}{\hat{p}_{mk}(1-\hat{p}_{mk}),} \]
where \(\hat{p}_{mk}\) measures the proportion of observations of a response level \(k\) in the \(m\)th prediction area of the training data set. Values close to 0 are indicative of high node purity, meaning that most observations belong to the same class (e.g., Form = ty
). If splitting a node no longer leads to a substantial increase in purity, it becomes the terminal node, i.e., it is not split further. This terminal node returns the tree’s class prediction.
It is worth noting that modern CART implementations rely on different splitting criteria. For instance, Conditional Inference Trees use the \(p\) values of internal association tests to identify which variables warrant further subdivision of the training data. The presence or absence of correlation thus also determines whether or not a given node will be terminal (for more details, see B. Greenwell 2022: 122).
# Fitting a conditional inference tree
<- ctree(Form ~ ., data = tv) # dot . means 'include all predictors'
ctree.tv
plot(ctree.tv)
25.3.2 Regression trees
Regression trees are used for continuous response variables. Instead of providing class predictions, they return the mean value of observations in a given prediction area. The algorithm now strives to minimize the residual sum of squares (\(RSS\)). Consider the regression tree for reaction times depending on word length, frequency and part of speech:
# CART tree
<- tree(RT ~ Length + Freq + POS, data = ELP)
tree.rt
summary(tree.rt)
Regression tree:
tree(formula = RT ~ Length + Freq + POS, data = ELP)
Variables actually used in tree construction:
[1] "Freq" "Length"
Number of terminal nodes: 10
Residual mean deviance: 8629 = 7507000 / 870
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-220.10 -59.09 -11.17 0.00 49.66 397.40
plot(tree.rt)
text(tree.rt, pretty = 0)
# Conditional inference tree
<- ctree(RT ~ Length + Freq + POS, data = ELP)
ctree.rt
plot(ctree.rt)
25.4 Random forests
Random forests (Breiman 2001) belong to the class of ensemble methods because they combine simpler models (e.g., individual decision trees) into a more complex and possibly more accurate model. As part of the RF algorithm, a great number of decision trees is trained on bootstrapped samples of the training data.
So far, random forests are essentially identical with Bagging (= bootstrap aggregation); however, an important additional characteristic of the RF algorithm is that only a random subset of the predictors is taken into consideration at each split. According to Strobl et al. (2009: 332), the resulting variability in tree structure is advantageous: “By combining the prediction of such a diverse set of trees, ensemble methods utilize the fact that classification trees are unstable, but, on average, produce the right prediction”.
25.4.1 Regression forest
For regression tasks, random forests return the average prediction of all trees in the ensemble.
# For regression
<- randomForest(RT ~ Length + Freq + POS, data = ELP,
rt.rf.reg mtry = 1, # = sqrt(number of variables)
ntree = 500) # number of trees
rt.rf.reg
Call:
randomForest(formula = RT ~ Length + Freq + POS, data = ELP, mtry = 1, ntree = 500)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 8972.927
% Var explained: 43.64
# Conditional random forest
<- cforest(RT ~ Length + Freq + POS, data = ELP,
rt.crf.reg controls = cforest_unbiased(ntree = 500, mtry = 1))
25.4.2 Classification forest
For classification, all trees cast a vote for one of the response classes. The OOB error estimate refers to the accuracy of out-of-bag (OOB) predictions. After the initial bootstrapping procedure, roughly a third of the training data remains unused. These observations, which were not used for fitting trees, can be used as a test data set. Predictions based on this internal test data set are called OOB predictions.
# For classification
<- randomForest(Form ~ ., data = tv,
tv.rf.class mtry = 4,
ntree = 500)
tv.rf.class
Call:
randomForest(formula = Form ~ ., data = tv, mtry = 4, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 18.86%
Confusion matrix:
ty vy class.error
ty 86 22 0.2037037
vy 21 99 0.1750000
# Conditional random forest
<- cforest(Form ~ ., data = tv,
tv.crf.class controls = cforest_unbiased(ntree = 500, mtry = 4))
tv.crf.class
Random Forest using Conditional Inference Trees
Number of trees: 500
Response: Form
Inputs: Film, Rel_Age, Rel_Sex, Rel_Power, Rel_Circle, S_Class, H_Class, S_Age, H_Age, Rel_Class, Before68, Others, Office, S_Sex, H_Sex, Place
Number of observations: 228
25.4.3 Variable importance
Random forests allow users to assess whether or not certain predictors are useful for the model. The Gini index can be re-used to identify those variables that have led to the greatest reduction in impurity. However, this measure is biased towards predictors with many values (cf. Strobl et al. 2007).
# Gini importance (Reaction times)
varImpPlot(rt.rf.reg)
# Gini importance (Form of 2.p.)
varImpPlot(tv.rf.class)
A more robust measure is (Conditional) Permutation Accuracy Importance which compares the predictive accuracy of the random forest model before and after randomly permuting the values of the predictors (cf. Strobl et al. 2008; Debeer and Strobl 2020).
# Conditional permutation accuracy importance
library("permimp")
# Refit RF model with additional parameters
<- randomForest(Form ~ .,
tv.rf.class data = tv,
mtry = 4,
ntree = 500,
keep.inbag = TRUE,
keep.forest = TRUE)
# Compute CPI scores
<- permimp(tv.rf.class, conditional = TRUE, progressBar = FALSE, threshold = .95) # Choose "Yes" in the console
tv.rf.permimp
# Plot CPI scores
plot(tv.rf.permimp, horizontal = TRUE, type = "dot", sort = TRUE)
25.4.4 Visualising random forest models
Partial dependence plots provide averaged predictions \(\hat{y}\) for a given constellation of predictors. These averages are produced by the partial dependence function \(f_k\). If a categorical response variable has \(K\) possible values, it has the form \[ f_k(X) = \log [p_k (x)] - \frac{1}{N}\sum_{k=1}^{K} \log [p_k (x)], \] with \(p_k(x)\) corresponding to the fitted probability of the 𝑘th level of the response variable for \(k \in \{1, 2, ..., K\}\) (B. M. Greenwell 2017: 430; see also Hastie, Tibshirani, and Friedman 2017).
Show the code
# Form ~ Rel_Circle
<- pdp::partial(tv.rf.class, pred.var = "Rel_Circle", which.class = "ty")
Rel_Circle.partial
%>%
Rel_Circle.partial ggplot(aes(x = Rel_Circle, y = yhat, group = 1)) +
geom_point(col = "steelblue") +
geom_line(col = "steelblue") +
theme_minimal() +
labs(
title = "Probability of 'ty' (2.p.sg.) depending on social circle",
y = "Log odds of 'ty'"
)
Show the code
# RT ~ POS
<- pdp::partial(rt.rf.reg, pred.var = "POS")
pos.partial
%>%
pos.partial ggplot(aes(x = POS, y = yhat, group = 1)) +
geom_point(col = "steelblue") +
geom_line(col = "steelblue") +
theme_minimal() +
labs(
title = "Predicted reaction times by POS",
y = "Predicted reaction time"
)
Show the code
# RT ~ Length
<- pdp::partial(rt.rf.reg, pred.var = "Length")
length.partial
%>%
length.partial ggplot(aes(x = Length, y = yhat)) +
geom_line(col = "steelblue") +
theme_minimal() +
labs(
title = "Predicted reaction times by word length",
y = "Predicted reaction time"
)