Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Linguistic variables
    • 1.3 Research questions
    • 1.4 Set theory and mathematical notation
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 The CQP interface
    • 3.4 Data annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Chi-squared test
    • 4.6 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. 7. Machine Learning
  2. 7.1 Tree-based methods
  • 7. Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering

On this page

  • Recommended reading
  • Introduction
  • Decision trees
    • Classification trees
    • Regression trees
  • Random forests
    • Regression forest
    • Classification forest
    • Variable importance
    • Visualising random forest models
  1. 7. Machine Learning
  2. 7.1 Tree-based methods

7.1 Tree-based methods

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract
We familiarise ourselves with powerful non-parametric models based on recursive partitioning.

Recommended reading

For linguists:

Levshina (2015): Chapter 14

Levshina (2020)

Gries (2021): Chapter 7

General:

James et al. (2021): Chapter 8

Hastie, Tibshirani, and Friedman (2017): Chapters 9.2 & 15

Introduction

Decision trees and random forests are very popular non-parametric methods. As such, “they do not make explicit assumptions about the functional form of \(f\)’’ (James et al. 2021: 23).

In this unit, we will cover the conceptual basics of these methods as well as their implementation in R using the tv data from Levshina (2020) in addition to the ELP data from the unit on Linear Regression. The libraries we will need are listed below:

In the tv data frame, our target variable will be T/V Form with the two outcomes ty (Russian 2.p.sg., informal) and vy (Russian 2.p.pl., polite).

str(tv)

head(tv)

Decision trees

Core concepts:

  • Segmenting the feature space: “[T]he feature space (i.e., the space spanned by all predictor variables) is recursively partitioned into a set of rectangular areas” (Strobl, Malley, and Tutz 2009: 325).

  • Impurity reduction: These simplified prediction areas should consist of mostly homogeneous (i.e., ‘pure’ rather than ‘mixed’) observations.

  • Tree construction: The ‘decisions’ made when partitioning the training data can be visualised using tree structures. The nodes of a tree represent variables, the branches represent decision rules, and leaf nodes indicate the final outcome (e.g., a prediction).

  • CART: The original computational implementation of decision trees is known as the CART (Classification and Regression Trees) algorithm developed by Breiman (1984).

Classification trees

If we are dealing with a categorical response variable, the tree() function can be used to fit a classification tree in accordance with Breiman’s CART algorithm. For illustration, consider the tv data frame. We will model the choice of the pronoun Form based on the speaker’s and hearer’s social circle (Rel_Circle) and their difference in social class (Rel_Class).

# Set random number generator for reproducibility
set.seed(123)

# Supply model formula
tree.tv <- tree(Form ~ Rel_Circle + Rel_Class, data = tv)

# View tree statistics
summary(tree.tv)

Classification tree:
tree(formula = Form ~ Rel_Circle + Rel_Class, data = tv)
Number of terminal nodes:  9 
Residual mean deviance:  0.974 = 213.3 / 219 
Misclassification error rate: 0.2412 = 55 / 228 
# Visualisation
plot(tree.tv)

text(tree.tv, pretty = 3)

An important problem that arises during tree construction is that of split selection. When should the tree split a node into two further nodes and when not? Furthermore, when should the tree stop the splitting process entirely? In this respect, CART relies on the principle of impurity reduction: “The fundamental idea is to select each split of a subset so that the data in each of the descendent subsets are ‘purer’ than the data in the parent subset” (Breiman 1984: 23). A measure for node purity is the Gini index \(G\), which is defined as

\[ G = \sum_{k=1}^{K}{\hat{p}_{mk}(1-\hat{p}_{mk}),} \]

where \(\hat{p}_{mk}\) measures the proportion of observations of a response level \(k\) in the \(m\)th prediction area of the training data set. Values close to 0 are indicative of high node purity, meaning that most observations belong to the same class (e.g., Form = ty). If splitting a node no longer leads to a substantial increase in purity, it becomes the terminal node, i.e., it is not split further. This terminal node returns the tree’s class prediction.

It is worth noting that modern CART implementations rely on different splitting criteria. For instance, Conditional Inference Trees use the \(p\) values of internal association tests to identify which variables warrant further subdivision of the training data. The presence or absence of correlation thus also determines whether or not a given node will be terminal (for more details, see B. Greenwell 2022: 122).

# Fitting a conditional inference tree
ctree.tv <- ctree(Form ~ ., data = tv) # dot . means 'include all predictors'

plot(ctree.tv)

Regression trees

Regression trees are used for continuous response variables. Instead of providing class predictions, they return the mean value of observations in a given prediction area. The algorithm now strives to minimize the residual sum of squares (\(RSS\)). Consider the regression tree for reaction times depending on word length, frequency and part of speech:

# CART tree
tree.rt <- tree(RT ~ Length + Freq + POS, data = ELP)

summary(tree.rt)

Regression tree:
tree(formula = RT ~ Length + Freq + POS, data = ELP)
Variables actually used in tree construction:
[1] "Freq"   "Length"
Number of terminal nodes:  10 
Residual mean deviance:  8629 = 7507000 / 870 
Distribution of residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-220.10  -59.09  -11.17    0.00   49.66  397.40 
plot(tree.rt)

text(tree.rt, pretty = 0)

# Conditional inference tree
ctree.rt <- ctree(RT ~ Length + Freq + POS, data = ELP) 

plot(ctree.rt)

Random forests

Random forests (Breiman 2001) belong to the class of ensemble methods because they combine simpler models (e.g., individual decision trees) into a more complex and possibly more accurate model. As part of the RF algorithm, a great number of decision trees is trained on bootstrapped samples of the training data.

So far, random forests are essentially identical with Bagging (= bootstrap aggregation); however, an important additional characteristic of the RF algorithm is that only a random subset of the predictors is taken into consideration at each split. According to Strobl et al. (2009: 332), the resulting variability in tree structure is advantageous: “By combining the prediction of such a diverse set of trees, ensemble methods utilize the fact that classification trees are unstable, but, on average, produce the right prediction”.

Regression forest

For regression tasks, random forests return the average prediction of all trees in the ensemble.

# For regression
rt.rf.reg <- randomForest(RT ~ Length + Freq + POS, data = ELP,
                                mtry = 1, # = sqrt(number of variables)
                                ntree = 500) # number of trees

rt.rf.reg

Call:
 randomForest(formula = RT ~ Length + Freq + POS, data = ELP,      mtry = 1, ntree = 500) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 8972.927
                    % Var explained: 43.64
# Conditional random forest
rt.crf.reg <- cforest(RT ~ Length + Freq + POS, data = ELP, 
                    controls = cforest_unbiased(ntree = 500, mtry = 1))

Classification forest

For classification, all trees cast a vote for one of the response classes. The OOB error estimate refers to the accuracy of out-of-bag (OOB) predictions. After the initial bootstrapping procedure, roughly a third of the training data remains unused. These observations, which were not used for fitting trees, can be used as a test data set. Predictions based on this internal test data set are called OOB predictions.

# For classification
tv.rf.class <- randomForest(Form ~ ., data = tv,
                            mtry = 4,
                            ntree = 500)

tv.rf.class

Call:
 randomForest(formula = Form ~ ., data = tv, mtry = 4, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 18.86%
Confusion matrix:
   ty vy class.error
ty 86 22   0.2037037
vy 21 99   0.1750000
# Conditional random forest
tv.crf.class <- cforest(Form ~ ., data = tv,
                    controls = cforest_unbiased(ntree = 500, mtry = 4))

tv.crf.class

     Random Forest using Conditional Inference Trees

Number of trees:  500 

Response:  Form 
Inputs:  Film, Rel_Age, Rel_Sex, Rel_Power, Rel_Circle, S_Class, H_Class, S_Age, H_Age, Rel_Class, Before68, Others, Office, S_Sex, H_Sex, Place 
Number of observations:  228 

Variable importance

Random forests allow users to assess whether or not certain predictors are useful for the model. The Gini index can be re-used to identify those variables that have led to the greatest reduction in impurity. However, this measure is biased towards predictors with many values (cf. Strobl et al. 2007).

# Gini importance (Reaction times)
varImpPlot(rt.rf.reg)

# Gini importance (Form of 2.p.)
varImpPlot(tv.rf.class)

A more robust measure is (Conditional) Permutation Accuracy Importance which compares the predictive accuracy of the random forest model before and after randomly permuting the values of the predictors (cf. Strobl et al. 2008; Debeer and Strobl 2020).

# Conditional permutation accuracy importance
library("permimp")

# Refit RF model with additional parameters
tv.rf.class <- randomForest(Form ~ .,
                            data = tv,
                            mtry = 4,
                            ntree = 500,
                            keep.inbag = TRUE,
                            keep.forest = TRUE)

# Compute CPI scores
tv.rf.permimp <- permimp(tv.rf.class, conditional = TRUE, progressBar = FALSE, threshold = .95) # Choose "Yes" in the console

# Plot CPI scores
plot(tv.rf.permimp, horizontal = TRUE, type = "dot", sort = TRUE)

Visualising random forest models

Partial dependence plots provide averaged predictions \(\hat{y}\) for a given constellation of predictors. These averages are produced by the partial dependence function \(f_k\). If a categorical response variable has \(K\) possible values, it has the form \[ f_k(X) = \log [p_k (x)] - \frac{1}{N}\sum_{k=1}^{K} \log [p_k (x)], \] with \(p_k(x)\) corresponding to the fitted probability of the 𝑘th level of the response variable for \(k \in \{1, 2, ..., K\}\) (B. M. Greenwell 2017: 430; see also Hastie, Tibshirani, and Friedman 2017).

Show the code
# Form ~ Rel_Circle
Rel_Circle.partial <- pdp::partial(tv.rf.class, pred.var = "Rel_Circle", which.class = "ty")

Rel_Circle.partial %>% 
  ggplot(aes(x = Rel_Circle, y = yhat, group = 1)) +
  geom_point(col = "steelblue") +
  geom_line(col = "steelblue") +
  theme_minimal() +
  labs(
    title = "Probability of 'ty' (2.p.sg.) depending on social circle",
    y = "Log odds of 'ty'"
  )

Show the code
# RT ~ POS
pos.partial <- pdp::partial(rt.rf.reg, pred.var = "POS")

pos.partial %>% 
  ggplot(aes(x = POS, y = yhat, group = 1)) +
  geom_point(col = "steelblue") +
  geom_line(col = "steelblue") +
  theme_minimal() +
  labs(
    title = "Predicted reaction times by POS",
    y = "Predicted reaction time"
  )

Show the code
# RT ~ Length
length.partial <- pdp::partial(rt.rf.reg, pred.var = "Length")

length.partial %>% 
  ggplot(aes(x = Length, y = yhat)) +
  geom_line(col = "steelblue") +
  theme_minimal() +
  labs(
    title = "Predicted reaction times by word length",
    y = "Predicted reaction time"
  )

References

Breiman, Leo. 1984. Classification and Regression Trees. Belmont, Calif.: Wadsworth International Group.
———. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Debeer, Dries, and Carolin Strobl. 2020. “Conditional Permutation Importance Revisited.” Bioinformatics 21 (1): 307.
Greenwell, Brandon. 2022. Tree-Based Methods for Statistical Learning in r. London & New York: Taylor & Francis Group. https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=3288358.
Greenwell, Brandon M. 2017. “Pdp: An r Package for Constructing Partial Dependence Plots.” The R Journal 9 (1): 421–36.
Gries, Stefan Thomas. 2021. Statistics for Linguistics with r: A Practical Introduction. 3rd rev. ed. Berlin; Boston: De Gruyter Mouton.
Hastie, Trevor, Robert Tibshirani, and Jerome H. Friedman. 2017. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. New York: Springer. https://doi.org/10.1007/978-1-0716-1418-1.
Levshina, Natalia. 2015. How to Do Linguistics with r: Data Exploration and Statistical Analysis. Amsterdam; Philadelphia: John Benjamins Publishing Company.
———. 2020. “Conditional Inference Trees and Random Forests.” In A Practical Handbook of Corpus Linguistics, edited by Magali Paquot and Stefan Thomas Gries, 611–43. Cham: Springer.
Strobl, Carolin, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis. 2008. “Conditional Variable Importance for Random Forests.” BMC Bioinformatics 9 (1): 307.
Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8 (1): 25.
Strobl, Carolin, James Malley, and Gerhard Tutz. 2009. “An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random Forests.” Psychological Methods 14: 323–48. https://doi.org/10.1037/a0016973.
7.2 Gradient boosting