22 Linear regression

Author

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

22.1 Recommended reading

For linguists:

Levshina (2015: Chapter 7)

Winter (2020: Chapter 4)

General:

Heumann et al. (2022: Chapter 11)

James et al. (2021: Chapter 3)

22.2 Preparation

# Load libraries
library(readxl) # for reading in Excel data
library(tidyverse) # data manipulation and visualisation framework
library(broom) # converting models to data frames
library(sjPlot) # exporting regression tables
library(effects) # plot marginal effects
library(ggeffects) # generating predictions
library(car) # model diagnostics

# Load data
ELP <- read_xlsx("ELP.xlsx")

# Inspect data structure
str(ELP)

tibble [880 × 5] (S3: tbl_df/tbl/data.frame)
 $ Word  : chr [1:880] "rackets" "stepmother" "delineated" "swimmers" ...
 $ Length: num [1:880] 7 10 10 8 6 5 5 8 8 6 ...
 $ Freq  : num [1:880] 0.96 4.24 0.04 1.49 1.06 3.33 0.1 0.06 0.43 5.41 ...
 $ POS   : chr [1:880] "NN" "NN" "VB" "NN" ...
 $ RT    : num [1:880] 791 693 960 771 882 ...

22.3 Introduction

Consider the distribution of the variables RT (reaction times) and Freq from the ELP (English Lexicon Project) dataset (Balota et al. 2007).

We will apply a \(\log\)-transformation to both variables in order to even out the differences between extremely high and extremely low frequency counts (Winter 2020: 90-94).

Log-transformed
Untransformed

Code

# Log-transformed
ggplot(ELP, aes(x = log(RT))) +
  geom_histogram() +
  geom_vline(xintercept = mean(log(ELP$RT)), color = "steelblue") +
  geom_vline(xintercept = mean(log(ELP$RT)), color = "red") +
  theme_minimal() +
  labs(
    x = "Reaction time (log)"
  ) +
   annotate("text", x = log(mean(ELP$RT)), y = 0, 
           label = "mean", 
           vjust = 1.5, hjust = -0.2, color = "steelblue", parse = TRUE) +
  annotate("text", x = log(mean(ELP$RT)) + -0.2, y = .7, 
           label = "median", 
           vjust = 1.5, hjust = -0.2, color = "red", parse = TRUE)

Code

# Skewed
ggplot(ELP, aes(x = RT)) +
  geom_histogram() +
  geom_vline(xintercept = mean(ELP$RT), color = "steelblue") +
  geom_vline(xintercept = median(ELP$RT), color = "red") +
  theme_minimal() +
  labs(
    x = "Reaction time"
  ) +
annotate("text", x = mean(ELP$RT) + 10, y = 0, 
           label = "mean", 
           vjust = 1.5, hjust = -0.2, color = "steelblue", parse = TRUE) +
  annotate("text", x = median(ELP$RT) -175, y = .7, 
           label = "median", 
           vjust = 1.5, hjust = -0.2, color = "red", parse = TRUE)

We are particularly interested in the relationship between reaction times RT and the (log-)frequency Freq of a lexical stimulus. What kind of pattern does the scatter plot below suggest?

Some open questions

Can word frequency help us explain variation in reaction times?
If it can, then how could we characterise the effect of word frequency? In other words, does it increase or decrease reaction times?
What reaction times should we expect for new observations?

22.3.1 A simple statistical model

RT is the response or target that we wish to explain. We generically refer to the response as \(Y\).

Freq is the feature, input, or predictor, which we will call \(X\).

We can thus summarise our preliminary and fairly general statistical model as

\[Y = f(X) + \epsilon. \tag{22.1}\]

The term \(f(X)\) describes the contribution of \(X\) to the explanation of \(Y\). Since no model can explain everything perfectly, we expect there to be some degree of error \(\epsilon\).

22.4 Linear regression

Linear regression is a simple approach to supervised machine learning where the response variable is known.¹
It assumes that the dependence of \(Y\) on \(X\) is linear, i.e., their relationship is a straight line.
This approach is suitable for numerical response variables. The predictors, however, can be either continuous or discrete.

¹ If the response variable is unknown or irrelevant, we speak of unsupervised machine learning. Unsupervised models are mostly concerned with finding patterns in high-dimensional datasets with dozens or even hundreds of variables.

Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.

22.4.1 Model with a single predictor \(X\)

The simple linear model has the general form

\[ Y = \beta_0 + \beta_1X + \epsilon. \tag{22.2}\]

The model parameters (or coefficients) \(\beta_0\) and \(\beta_1\) specify the functional relationship \(f\) between \(Y\) and \(X\).
The first parameter \(\beta_0\) determines the intercept of the regression line, and \(\beta_1\) indicates the slope.
Once again, \(\epsilon\) captures the model error, which is equivalent to the sum of all distances of the data points from the regression line.

Applying the model formula to our dataset, we get the following updated regression equation:

\[ \text{Reaction time} = \beta_0 + \beta_1\text{Frequency} + \text{Model Error}. \]

But how do we find the exact values of the intercept and the slope? In short: We can’t! We are dealing with population parameters and can, therefore, only provide an approximation of the true relationship between RT and Freq.

To reflect the tentative nature of the model coefficients, we use the hat symbol ^ (e.g., \(\hat{\beta_0}\)) to indicate estimations rather than true values. The estimation procedure requires training data, based on which the algorithm “learns” the relationship between \(Y\) and \(X\) (hence the term “Machine Learning”).

How exactly do you estimate the model parameters?

The most common way of estimating parameters for linear models is the Least Squares approach. In essence, the parameters are chosen such that the residual sum of squares, i.e., the sum of the differences between observed and predicted values, is as low as possible. In other words, we are trying to minimise the distances between the data points and the regression line.

It can be computed using the equivalence in Equation 22.3.

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i- \bar{y})}{\sum_{i=1}^n(x_i- \bar{x})^2}. \tag{22.3}\]

We can then obtain the intercept via Equation 22.4.

\[ \hat{\beta_0}= \bar{y}- \hat{\beta}_1\bar{x} \tag{22.4}\]

Once we’ve fitted the model, we can then predict reaction times if we know the frequency of a lexical stimulus:

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x, \tag{22.5}\]

where \(\hat{y}\) indicates a prediction of \(Y\) on the basis of the predictor values \(X = x\).

22.4.2 Application in R

In R, we can fit a linear model with the lm() function.

# Fit linear model
rt.lm1 = lm(log(RT) ~ log(Freq), data = ELP)

# View model data
summary(rt.lm1)

tab_model(rt.lm1, show.se = TRUE, show.aic = TRUE, show.dev = TRUE, show.fstat = TRUE, digits = 3)

	log(RT)
Predictors	Estimates	std. Error	CI	p
(Intercept)	6.633	0.004	6.625 – 6.642	<0.001
Freq [log]	-0.049	0.002	-0.053 – -0.044	<0.001
Observations	880
R² / R² adjusted	0.357 / 0.356
Deviance	13.385
AIC	10534.255

The model statistics comprise the following elements:

Call

i.e., the model formula.

Residuals

These indicate the difference between the observed values in the data set and the values predicted by the model (= the fitted values). These correspond to the error term \(\epsilon\). The lower the residuals, the better the model describes the data.

# Show fitted values (= predictions) for the first six observations
head(rt.lm1$fitted.values)

       1        2        3        4        5        6 
6.635345 6.563152 6.789806 6.613980 6.630529 6.574894

# Show deviation of the fitted values from the observed values
head(rt.lm1$residuals)

          1           2           3           4           5           6 
 0.03778825 -0.02277165  0.07759556  0.03387713  0.15222949 -0.10432670

Coefficients

The regression coefficients correspond to \(\hat{\beta}_0\) (“Intercept”) and \(\hat{\beta}_1\) (“log(Freq)”), respectively. The model shows that for a one-unit increase in log-frequency the log-reaction time decreases by approx. 0.05.

# Convert coefficients to a tibble (= tidyverse-style data frame)
tidy_model <- tidy(rt.lm1)

tidy_model

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   6.63     0.00429    1548.  0       
2 log(Freq)    -0.0486   0.00220     -22.1 2.88e-86

\(p\)-values and \(t\)-statistic

\(p\)-values and \(t\)-statistic: Given the null hypothesis \(H_0\) that there is no correlation between log(RT) and log(Freq) (i.e., \(H_0: \beta_1 = 0\)), a \(p\)-value lower than 0.05 indicates that \(\beta_1\) considerably deviates from 0, thus providing evidence for the alternative hypothesis \(H_1: \beta_1 \ne 0\). Since \(p < 0.001\), we can reject \(H_0\).

The \(p\)-value itself crucially depends on the \(t\)-statistic², which measures “the number of standard deviations that \(\hat{\beta_1}\) is away from 0” (James et al. 2021: 67). The standard error (SE) reflects how much an estimated coefficient differs on average from the true values of \(\beta_0\) and \(\beta_1\). They can be used to compute the 95% confidence interval \[[\hat{\beta}_1 - 2 \cdot SE(\hat{β}_1), \hat{\beta}_1 + 2 \cdot SE(\hat{\beta}_1)]. \tag{22.6}\]

The true value of the parameter \(\beta_1\) lies within the specified range 95% of the time.

# Compute confidence intervals for intercept and log(Freq)
tidy_model_ci <- tidy(rt.lm1, conf.int = TRUE)

tidy_model_ci

# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   6.63     0.00429    1548.  0          6.62      6.64  
2 log(Freq)    -0.0486   0.00220     -22.1 2.88e-86  -0.0529   -0.0443

The estimated parameter for log(Freq), which is -0.049, thus has the 95% confidence interval [-0.053, -0.044].

² If the response variable is unknown or irrelevant, we speak of unsupervised machine learning. Unsupervised models are mostly concerned with finding patterns in high-dimensional datasets with dozens or even hundreds of variables.

Residual standard error (RSE)

This is an estimation of the average deviation of the predictions from the observed values.

\[RSE = \sqrt{\frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{y_i}})^2 \tag{22.7}\]

\(R^2\)

The \(R^2\) score is important for assessing model fit because it “measures the proportion of variability in \(Y\) that can be explained using \(X\)” (James et al. 2021: 70), varying between 0 and 1.

\[R^2 = 1-\frac{TSS}{RSS} = 1-\frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{\sum_{i=1}^n (y_i - \bar{y_i})^2} \tag{22.8}\]

\(F\)-statistic

It is used to measure the association between the dependent variable and the independent variable(s). Generally speaking, values greater than 1 indicate a possible correlation. A sufficiently low \(p\)-value suggests that the null hypothesis \(H_0: \beta_1 = 0\) can be rejected. The \(F\) statistic is computed as shown below (cf. Agresti and Kateri 2022: 232) and follows an \(F\)-distribution with two different \(df\) values.

\[ F = \frac{(TSS - SSE) / p}{SSE / [n - (p + 1)]} \tag{22.9}\]

22.4.3 Multiple linear regression

In multiple linear regression, more than one predictor variable is taken into account. For instance, modelling log(RT) as a function of log(Freq), POS and Length requires a more complex model of the form

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon. \tag{22.10}\]

Predictions are then obtained via the formula

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + ... + \hat{\beta}_px_p. \tag{22.11}\]

22.4.4 Application in R

In R, a multiple regression model is fitted as in the code example below:

# Fit multiple regression model
rt.lm2 <- lm(log(RT) ~ log(Freq) + POS + Length, data = ELP)

# View model statistics
summary(rt.lm2)

tab_model(rt.lm2, show.se = TRUE, show.aic = TRUE, show.dev = TRUE, show.fstat = TRUE, digits = 3)

	log(RT)
Predictors	Estimates	std. Error	CI	p
(Intercept)	6.460	0.017	6.426 – 6.493	<0.001
Freq [log]	-0.038	0.002	-0.042 – -0.034	<0.001
POS [NN]	-0.006	0.010	-0.026 – 0.014	0.539
POS [VB]	-0.035	0.012	-0.059 – -0.011	0.004
Length	0.023	0.002	0.020 – 0.026	<0.001
Observations	880
R² / R² adjusted	0.478 / 0.476
Deviance	10.858
AIC	10356.130

22.5 Visualising regression models

Plot coefficient estimates:

Code

# Tidy the model output
tidy_model <- tidy(rt.lm2, conf.int = TRUE)

# Remove intercept
tidy_model <- tidy_model %>% filter(term != "(Intercept)")

# Create the coefficient plot
ggplot(tidy_model, aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "steelblue") +
  theme_minimal() +
  labs(
    x = "Coefficient Estimate",
    y = "Predictor",
    title = "Coefficient Estimates with Confidence Intervals"
  )

Plot predictions:

Code

plot(ggeffect(rt.lm2, "Freq"), residuals = TRUE) + geom_line(col = "steelblue") + labs(subtitle = "Untransformed frequencies", y = "log(RT)")

Code

plot(ggeffect(rt.lm2, "Freq [log]"), residuals = TRUE) + geom_line(col = "steelblue") + labs(subtitle = "Log-transformed frequencies", y = "log(RT)")

Code

plot(ggeffect(rt.lm2, "POS"), residuals = TRUE) + geom_line(col = "steelblue") + labs(y = "log(RT)")

Code

plot(ggeffect(rt.lm2, "Length"), residuals = TRUE) + geom_line(col = "steelblue") + labs(y = "log(RT)")

22.6 Model assumptions and diagnostics

As a parametric method, linear regression makes numerous assumptions about the training data. It is, therefore, essential to run further tests to rule out possible violations. Among other things, the model assumptions include:

A linear relationship between the response and the quantitative predictors: The residuals should not display a clear pattern. For this reason, it is recommended to use component residual plots (e.g., crPlot() from the car library) for the visual identification of potentially non-linear trends.

Code

# pink line = main tendency vs. blue line = slope coefficients;
# some minor non-linearity can be observed

crPlot(rt.lm2, var = "log(Freq)")

Code

crPlot(rt.lm2, var = "POS")

Code

crPlot(rt.lm2, var = "Length") # potentially problematic

No heteroscedasticity (i.e, non-constant variance of error terms): Visually, a violation of this assumption becomes apparent if the residuals form a funnel-like shape. It is also possible to conduct a non-constant variance test ncvTest(): If it returns \(p\)-values < 0.05, it suggests non-constant variance.

Code

plot(rt.lm2, which = 1)

Code

ncvTest(rt.lm2) # significant, meaning that errors do not vary constantly

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 30.0101, Df = 1, p = 4.298e-08

No multicollinearity: Predictors should not be correlated with each other. In the model data, correlated variables have unusually high standard errors, thereby decreasing the explanatory power of both the coefficients and the model as a whole. Another diagnostic measure are variance inflation factors (VIF-scores); predictors with VIF scores > 5 are potentially collinear. They can be computed using the vif() function.

Code

vif(rt.lm2) # vif < 5 indicates that predictors are not correlated

              GVIF Df GVIF^(1/(2*Df))
log(Freq) 1.150140  1        1.072446
POS       1.026925  2        1.006664
Length    1.151054  1        1.072872

Normally distributed residuals: The residuals should follow the normal distribution and be centered around 0:

\[ \epsilon \sim N(0, \sigma^2) \tag{22.12}\]

Usually, a visual inspection using qqnorm() is sufficient, but the Shapiro-Wilke test shapiro.test() can also be run on the model residuals. Note that a \(p\)-value below 0.05 provides evidence for non-normality.

Code

plot(rt.lm2, which = 2)

Code

shapiro.test(residuals(rt.lm2)) # residuals are not normally distributed because p < 0.05


    Shapiro-Wilk normality test

data:  residuals(rt.lm2)
W = 0.99062, p-value = 2.139e-05

Important

Beside the points mentioned above, it is always recommend to examine the model with regard to

outliers that might skew the regression estimates,

Code

influencePlot(rt.lm2, id.method = "identify")

       StudRes         Hat       CookD
16   0.4910868 0.027128347 0.001346143
207 -0.7674391 0.030963411 0.003765568
452  3.3366052 0.009338916 0.020749639
498  3.5794954 0.004047706 0.010275907
660  3.0847082 0.008750676 0.016638370

interactions, i.e., combined effects of predictors, and

Code

rt.lm.int <- lm(log(RT) ~ log(Freq) + POS + Length + log(Freq):Length, data = ELP)

summary(rt.lm.int)


Call:
lm(formula = log(RT) ~ log(Freq) + POS + Length + log(Freq):Length, 
    data = ELP)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27355 -0.07723 -0.00651  0.06623  0.39971 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       6.4669304  0.0171836 376.343  < 2e-16 ***
log(Freq)        -0.0243748  0.0062631  -3.892 0.000107 ***
POSNN            -0.0054973  0.0101366  -0.542 0.587735    
POSVB            -0.0344559  0.0120992  -2.848 0.004506 ** 
Length            0.0217620  0.0018007  12.085  < 2e-16 ***
log(Freq):Length -0.0017681  0.0007606  -2.325 0.020319 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1111 on 874 degrees of freedom
Multiple R-squared:  0.4816,    Adjusted R-squared:  0.4786 
F-statistic: 162.4 on 5 and 874 DF,  p-value: < 2.2e-16

Code

# ANOVA (analysis of variance)

## Compare interaction model with main effects model

anova(rt.lm.int, rt.lm2) # interaction term improves the model

Analysis of Variance Table

Model 1: log(RT) ~ log(Freq) + POS + Length + log(Freq):Length
Model 2: log(RT) ~ log(Freq) + POS + Length
  Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
1    874 10.792                             
2    875 10.858 -1 -0.066726 5.404 0.02032 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

overfitting, which results in poor model performance outside the training data.

Code

library("rms")

# Refit the model with ols(), which is equivalent to lm()
ols.rt <- ols(log(RT) ~ log(Freq) + POS + Length, data = ELP, x = TRUE, y = TRUE)

# Cross-validate
ols.val <- validate(ols.rt, bw = TRUE, B = 200) # Perform 200 random resampling iterations (= bootstrapping); compare model performance on training vs. test (= new) data. The slope optimism should be below 0.05 to rule out overfitting.


        Backwards Step-down - Original Model

No Factors Deleted

Factors in Final Model

[1] Freq   POS    Length

Code

ols.val[,1:5] # The model does not overfit.

          index.orig  training       test      optimism index.corrected
R-square  0.47839094 0.4796176 0.47570919  0.0039084156      0.47448252
MSE       0.01233904 0.0122714 0.01240248 -0.0001310708      0.01247011
g         0.11875152 0.1187177 0.11856383  0.0001538558      0.11859767
Intercept 0.00000000 0.0000000 0.01919518 -0.0191951836      0.01919518
Slope     1.00000000 1.0000000 0.99706225  0.0029377543      0.99706225