Consider the distribution of the variables RT (reaction times) and Freq from the ELP (English Lexicon Project) dataset (Balota et al. 2007).
We will apply a \(\log\)-transformation to both variables in order to even out the differences between extremely high and extremely low frequency counts (Winter 2020: 90-94).
# Log-transformedggplot(ELP, aes(x =log(RT))) +geom_histogram() +geom_vline(xintercept =mean(log(ELP$RT)), color ="steelblue") +geom_vline(xintercept =mean(log(ELP$RT)), color ="red") +theme_minimal() +labs(x ="Reaction time (log)" ) +annotate("text", x =log(mean(ELP$RT)), y =0, label ="mean", vjust =1.5, hjust =-0.2, color ="steelblue", parse =TRUE) +annotate("text", x =log(mean(ELP$RT)) +-0.2, y = .7, label ="median", vjust =1.5, hjust =-0.2, color ="red", parse =TRUE)
Code
# Skewedggplot(ELP, aes(x = RT)) +geom_histogram() +geom_vline(xintercept =mean(ELP$RT), color ="steelblue") +geom_vline(xintercept =median(ELP$RT), color ="red") +theme_minimal() +labs(x ="Reaction time" ) +annotate("text", x =mean(ELP$RT) +10, y =0, label ="mean", vjust =1.5, hjust =-0.2, color ="steelblue", parse =TRUE) +annotate("text", x =median(ELP$RT) -175, y = .7, label ="median", vjust =1.5, hjust =-0.2, color ="red", parse =TRUE)
We are particularly interested in the relationship between reaction times RT and the (log-)frequency Freq of a lexical stimulus. What kind of pattern does the scatter plot below suggest?
Some open questions
Can word frequency help us explain variation in reaction times?
If it can, then how could we characterise the effect of word frequency? In other words, does it increase or decrease reaction times?
What reaction times should we expect for new observations?
21.3.1 A simple statistical model
RT is the response or target that we wish to explain. We generically refer to the response as \(Y\).
Freq is the feature, input, or predictor, which we will call \(X\).
We can thus summarise our preliminary and fairly general statistical model as
\[Y = f(X) + \epsilon.
\tag{21.1}\]
The term \(f(X)\) describes the contribution of \(X\) to the explanation of \(Y\). Since no model can explain everything perfectly, we expect there to be some degree of error \(\epsilon\).
21.4 Linear regression
Linear regression is a simple approach to supervised machine learning where the response variable is known.1
It assumes that the dependence of \(Y\) on \(X\) is linear, i.e., their relationship is a straight line.
This approach is suitable for numerical response variables. The predictors, however, can be either continuous or discrete.
1 If the response variable is unknown or irrelevant, we speak of unsupervised machine learning. Unsupervised models are mostly concerned with finding patterns in high-dimensional datasets with dozens or even hundreds of variables.
Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.
21.4.1 Model with a single predictor \(X\)
The simple linear model has the general form
\[ Y = \beta_0 + \beta_1X + \epsilon.
\tag{21.2}\]
The model parameters (or coefficients) \(\beta_0\) and \(\beta_1\) specify the functional relationship \(f\) between \(Y\) and \(X\).
The first parameter \(\beta_0\) determines the intercept of the regression line, and \(\beta_1\) indicates the slope.
Once again, \(\epsilon\) captures the model error, which is equivalent to the sum of all distances of the data points from the regression line.
Applying the model formula to our dataset, we get the following updated regression equation:
But how do we find the exact values of the intercept and the slope? In short: We can’t! We are dealing with population parameters and can, therefore, only provide an approximation of the true relationship between RT and Freq.
To reflect the tentative nature of the model coefficients, we use the hat symbol ^ (e.g., \(\hat{\beta_0}\)) to indicate estimations rather than true values. The estimation procedure requires training data, based on which the algorithm “learns” the relationship between \(Y\) and \(X\) (hence the term “Machine Learning”).
How exactly do you estimate the model parameters?
The most common way of estimating parameters for linear models is the Least Squares approach. In essence, the parameters are chosen such that the residual sum of squares, i.e., the sum of the differences between observed and predicted values, is as low as possible. In other words, we are trying to minimise the distances between the data points and the regression line.
It can be computed using the equivalence in Equation 21.3.
The model statistics comprise the following elements:
Call
i.e., the model formula.
Residuals
These indicate the difference between the observed values in the data set and the values predicted by the model (= the fitted values). These correspond to the error term \(\epsilon\). The lower the residuals, the better the model describes the data.
# Show fitted values (= predictions) for the first six observationshead(rt.lm1$fitted.values)
The regression coefficients correspond to \(\hat{\beta}_0\) (“Intercept”) and \(\hat{\beta}_1\) (“log(Freq)”), respectively. The model shows that for a one-unit increase in log-frequency the log-reaction time decreases by approx. 0.05.
# Convert coefficients to a tibble (= tidyverse-style data frame)tidy_model <-tidy(rt.lm1)tidy_model
\(p\)-values and\(t\)-statistic: Given the null hypothesis \(H_0\) that there is no correlation between log(RT) and log(Freq) (i.e., \(H_0: \beta_1 = 0\)), a \(p\)-value lower than 0.05 indicates that \(\beta_1\) considerably deviates from 0, thus providing evidence for the alternative hypothesis \(H_1: \beta_1 \ne 0\). Since \(p < 0.001\), we can reject \(H_0\).
The \(p\)-value itself crucially depends on the \(t\)-statistic2, which measures “the number of standard deviations that \(\hat{\beta_1}\) is away from 0” (James et al. 2021: 67). The standard error (SE) reflects how much an estimated coefficient differs on average from the true values of \(\beta_0\) and \(\beta_1\). They can be used to compute the 95% confidence interval \[[\hat{\beta}_1 - 2 \cdot SE(\hat{β}_1), \hat{\beta}_1 + 2 \cdot SE(\hat{\beta}_1)].
\tag{21.6}\]
The true value of the parameter \(\beta_1\) lies within the specified range 95% of the time.
# Compute confidence intervals for intercept and log(Freq)tidy_model_ci <-tidy(rt.lm1, conf.int =TRUE)tidy_model_ci
The estimated parameter for log(Freq), which is -0.049, thus has the 95% confidence interval [-0.053, -0.044].
2 If the response variable is unknown or irrelevant, we speak of unsupervised machine learning. Unsupervised models are mostly concerned with finding patterns in high-dimensional datasets with dozens or even hundreds of variables.
Residual standard error (RSE)
This is an estimation of the average deviation of the predictions from the observed values.
The \(R^2\) score is important for assessing model fit because it “measures the proportion of variability in \(Y\) that can be explained using \(X\)” (James et al. 2021: 70), varying between 0 and 1.
It is used to measure the association between the dependent variable and the independent variable(s). Generally speaking, values greater than 1 indicate a possible correlation. A sufficiently low \(p\)-value suggests that the null hypothesis \(H_0: \beta_1 = 0\) can be rejected. The \(F\) statistic is computed as shown below (cf. Agresti and Kateri 2022: 232) and follows an \(F\)-distribution with two different \(df\) values.
In multiple linear regression, more than one predictor variable is taken into account. For instance, modelling log(RT) as a function of log(Freq), POS and Length requires a more complex model of the form
As a parametric method, linear regression makes numerous assumptions about the training data. It is, therefore, essential to run further tests to rule out possible violations. Among other things, the model assumptions include:
A linear relationship between the response and the quantitative predictors: The residuals should not display a clear pattern. For this reason, it is recommended to use component residual plots (e.g., crPlot() from the car library) for the visual identification of potentially non-linear trends.
Code
# pink line = main tendency vs. blue line = slope coefficients;# some minor non-linearity can be observedcrPlot(rt.lm2, var ="log(Freq)")
Code
crPlot(rt.lm2, var ="POS")
Code
crPlot(rt.lm2, var ="Length") # potentially problematic
No heteroscedasticity (i.e, non-constant variance of error terms): Visually, a violation of this assumption becomes apparent if the residuals form a funnel-like shape. It is also possible to conduct a non-constant variance test ncvTest(): If it returns \(p\)-values < 0.05, it suggests non-constant variance.
Code
plot(rt.lm2, which =1)
Code
ncvTest(rt.lm2) # significant, meaning that errors do not vary constantly
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 30.0101, Df = 1, p = 4.298e-08
No multicollinearity: Predictors should not be correlated with each other. In the model data, correlated variables have unusually high standard errors, thereby decreasing the explanatory power of both the coefficients and the model as a whole. Another diagnostic measure are variance inflation factors (VIF-scores); predictors with VIF scores > 5 are potentially collinear. They can be computed using the vif() function.
Code
vif(rt.lm2) # vif < 5 indicates that predictors are not correlated
Normally distributed residuals: The residuals should follow the normal distribution and be centered around 0:
\[
\epsilon \sim N(0, \sigma^2)
\tag{21.12}\]
Usually, a visual inspection using qqnorm() is sufficient, but the Shapiro-Wilke test shapiro.test() can also be run on the model residuals. Note that a \(p\)-value below 0.05 provides evidence for non-normality.
Code
plot(rt.lm2, which =2)
Code
shapiro.test(residuals(rt.lm2)) # residuals are not normally distributed because p < 0.05
Shapiro-Wilk normality test
data: residuals(rt.lm2)
W = 0.99062, p-value = 2.139e-05
Important
Beside the points mentioned above, it is always recommend to examine the model with regard to
outliers that might skew the regression estimates,
Call:
lm(formula = log(RT) ~ log(Freq) + POS + Length + log(Freq):Length,
data = ELP)
Residuals:
Min 1Q Median 3Q Max
-0.27355 -0.07723 -0.00651 0.06623 0.39971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4669304 0.0171836 376.343 < 2e-16 ***
log(Freq) -0.0243748 0.0062631 -3.892 0.000107 ***
POSNN -0.0054973 0.0101366 -0.542 0.587735
POSVB -0.0344559 0.0120992 -2.848 0.004506 **
Length 0.0217620 0.0018007 12.085 < 2e-16 ***
log(Freq):Length -0.0017681 0.0007606 -2.325 0.020319 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1111 on 874 degrees of freedom
Multiple R-squared: 0.4816, Adjusted R-squared: 0.4786
F-statistic: 162.4 on 5 and 874 DF, p-value: < 2.2e-16
Code
# ANOVA (analysis of variance)## Compare interaction model with main effects modelanova(rt.lm.int, rt.lm2) # interaction term improves the model
Analysis of Variance Table
Model 1: log(RT) ~ log(Freq) + POS + Length + log(Freq):Length
Model 2: log(RT) ~ log(Freq) + POS + Length
Res.Df RSS Df Sum of Sq F Pr(>F)
1 874 10.792
2 875 10.858 -1 -0.066726 5.404 0.02032 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
overfitting, which results in poor model performance outside the training data.
Code
library("rms")# Refit the model with ols(), which is equivalent to lm()ols.rt <-ols(log(RT) ~log(Freq) + POS + Length, data = ELP, x =TRUE, y =TRUE)# Cross-validateols.val <-validate(ols.rt, bw =TRUE, B =200) # Perform 200 random resampling iterations (= bootstrapping); compare model performance on training vs. test (= new) data. The slope optimism should be below 0.05 to rule out overfitting.
Backwards Step-down - Original Model
No Factors Deleted
Factors in Final Model
[1] Freq POS Length
Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.
Balota, David A., Melvin J. Yap, Keith A. Hutchison, Michael J. Cortese, Brett Kessler, Bjorn Loftis, James H. Neely, Douglas L. Nelson, Greg B. Simpson, and Rebecca Treiman. 2007. “The EnglishLexiconProject.”Behavior Research Methods 39 (3): 445–59. https://doi.org/10.3758/BF03193014.
Heumann, Christian, Michael Schomaker, and Shalabh. 2022. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in r. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-031-11833-3.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. New York: Springer. https://doi.org/10.1007/978-1-0716-1418-1.
Levshina, Natalia. 2015. How to Do Linguistics with r: Data Exploration and Statistical Analysis. Amsterdam; Philadelphia: John Benjamins Publishing Company.
Winter, Bodo. 2020. Statistics for Linguists: An Introduction Using r. New York; London: Routledge.