26 Gradient boosting
Gradient boosting constitutes a powerful extension of tree-based methods and is generally appreciated for its high predictive performance. Nevertheless, this family of methods, which includes implementations such as AdaBoost, XGBoost, and CatBoost, among many others, is not yet established in corpus-linguistic statistics. A practical scenario is presented to introduce the core ideas of gradient boosting, demonstrate its application to linguistic data as well as point out its advantages and drawbacks.
Machine learning, gradient descent, loss function, regularization
26.1 Recommended reading
James et al. (2021): Chapter 8.2
Hastie, Tibshirani, and Friedman (2017): Chapter 10
26.2 Preparation
26.3 Boosting
The core idea of boosting is as simple as it is intuitive: By aggregating the insights of multiple weak models, a much more powerful complex model can be formed. The new model ensemble is argued to be superior in terms of predictive performance. Boosting is quite versatile, but we will restrict our scope to decision trees as introduced in the previous unit.
26.3.1 Loss functions and gradient descent
A possible way of quantifying a model’s errors is by means of a loss function. Given \(N\) observations, let \(y_i\) represent the labels of the response variable and \(f(x_i)\) the predicted value for a data point \(x_i\). Then the corresponding multivariable loss function would be \(L(y_i, f(x_i))\), with the loss itself having the form in Equation 26.1. There are various metrics that can be used as concrete measures of loss. Squared errors are common for continuous target variables, whereas deviance, which is based on log-likelihood, would be more appropriate for categorical ones.
\[ L(f) = \sum_{i=1}^N L(y_i, f(x_i)) \tag{26.1}\]
In order to find the best possible model, it is crucial to find the values that minimise \(L(f)\). Afterwards, it is possible to compute the rate of change in the predicted value, i.e., the gradient \(g_{m}\) for a given increment \(h_m\). The gradient is computed by
\[ g_{mi} = \frac{\partial L(y, f(x_i))}{\partial f(x_i)}. \tag{26.2}\]
Essentially, it is a derivative of the function \(L(y_i, f(x_i))\) with respect to the variable \(f(x_i)\), rendering it a partial derivative (as signified by \(\partial\) symbol).
26.3.2 Gradient boosting
This page is still under construction. More content will be added soon!