28 Gradient boosting

Author

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract

Gradient boosting constitutes a powerful extension of tree-based methods and is generally appreciated for its high predictive performance. Nevertheless, this family of methods, which includes implementations such as AdaBoost, XGBoost, and CatBoost, among many others, is not yet established in corpus-linguistic statistics. A practical scenario is presented to introduce the core ideas of gradient boosting, demonstrate its application to linguistic data as well as point out its advantages and drawbacks.

Keywords

Machine learning, gradient descent, loss function, regularization

28.1 Recommended reading

James et al. (2021): Chapter 8.2

Hastie, Tibshirani, and Friedman (2017): Chapters 9 & 10

28.2 Preparation

28.3 Boosting

The core idea of boosting is as simple as it is intuitive: By aggregating the insights of multiple weak models, a much more powerful complex model can be formed. The new model ensemble is generally superior in terms of predictive performance. While there are various computational implementations of boosting, we will restrict our scope to decision trees as introduced in the previous unit.

28.3.1 Trees revisited

A single tree \(T\) splits up the feature space into prediction regions \(R_j\) for \(j = 1, \dots, J\). Each observation \(x\) from the training data is assigned to a prediction region \(R_j\) and receives a constant value \(\gamma_j\). The final prediction \(\hat{y}\) is a function of the constants returned from all regions:

\[ \hat{y} = f(x) = \sum_{j= 1}^{J}\gamma_jI(x \in R_j) \tag{28.1}\]

Thus, the tree \(T\) is determined by the input \(x\) and the parameter space \(\Theta = \{R_j, \gamma_j\}_{1}^{J}\) which contains said regions and constant terms. A tree ensemble can the be described as the sum of \(M\) trees \(T(x; \Theta)\), as summarised in Equation 28.2.

\[ f_M(x) = \sum_{m=1}^{M}T(x;\Theta_m) \tag{28.2}\]

In contrast to random forests, where the goal is to attain maximum variance among trees, boosted trees actually learn from each other. Each subsequent tree should improve on the errors made by the current one. It does so by finding the parameter values that minimise the loss function \(L(f)\).¹

¹ There is no analytical solution to this which can be computed via a single formula. Instead, it poses an optimisation problem that requires specialised numerical techniques. In the end, these can only provide approximate solutions.

28.3.2 Loss functions and gradient descent

Given \(N\) observations, let \(y_i\) represent the labels of the response variable and \(f(x_i)\) the prediction function for a data point \(x_i\). There are numerous metrics that can be used to quantify the “loss” \(L(y_i, f(x_i))\), such as squared errors for regression or deviance for classification. The full loss function \(L(f)\) is given by Equation 28.3.

\[ L(f) = \sum_{i=1}^N L(y_i, f(x_i)) \tag{28.3}\]

The multi-dimensional space can be conceptualised as a mountain range where we are looking for paths that lead us back to level ground. The loss function quantifies the amount of energy we have to expend during this endeavor. For this reason, it would be a good idea to keep track of all the paths we could take. As fearless as we are, we want to opt for the steepest paths that lead us back down as quickly as possible.

Mathematically, this can be done via the gradient \(\mathbf{g}_m\), which is a vector of partial derivatives. It captures how a function with multiple variables changes in different directions. Essentially, it is akin to a map that indicates all paths and how steep they are. One possible path, a gradient component \(g_{mi}\), would have the general form in Equation 28.4.

\[ g_{mi} = \frac{\partial L(y, f(x_i))}{\partial f(x_i)} \tag{28.4}\]

The steepest path, and thus the most rapid decrease in the loss function, is the negative gradient \(-\mathbf{g}_m\), determining the predictions of the next boosted tree. It is, therefore, no surprise that this procedure is also known as gradient descent.

28.4 Implementation in R

# Load libraries
library(gbm)
library(xgboost)

# Load data

Warning

This page is still under construction. More content will be added soon!