21  Linear regression

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

21.2 Preparation

# Load libraries
library(readxl) # for reading in Excel data
library(tidyverse) # data manipulation and visualisation framework
library(broom) # converting models to data frames
library(sjPlot) # exporting regression tables
library(effects) # plot marginal effects
library(ggeffects) # generating predictions
library(car) # model diagnostics

# Load data
ELP <- read_xlsx("ELP.xlsx")

# Inspect data structure
str(ELP)
tibble [880 × 5] (S3: tbl_df/tbl/data.frame)
 $ Word  : chr [1:880] "rackets" "stepmother" "delineated" "swimmers" ...
 $ Length: num [1:880] 7 10 10 8 6 5 5 8 8 6 ...
 $ Freq  : num [1:880] 0.96 4.24 0.04 1.49 1.06 3.33 0.1 0.06 0.43 5.41 ...
 $ POS   : chr [1:880] "NN" "NN" "VB" "NN" ...
 $ RT    : num [1:880] 791 693 960 771 882 ...

21.3 Introduction

Consider the distribution of the variables RT (reaction times) and Freq from the ELP (English Lexicon Project) dataset (Balota et al. 2007).

We will apply a \(\log\)-transformation to both variables in order to even out the differences between extremely high and extremely low frequency counts (Winter 2020: 90-94).

Code
# Log-transformed
ggplot(ELP, aes(x = log(RT))) +
  geom_histogram() +
  geom_vline(xintercept = mean(log(ELP$RT)), color = "steelblue") +
  geom_vline(xintercept = mean(log(ELP$RT)), color = "red") +
  theme_minimal() +
  labs(
    x = "Reaction time (log)"
  ) +
   annotate("text", x = log(mean(ELP$RT)), y = 0, 
           label = "mean", 
           vjust = 1.5, hjust = -0.2, color = "steelblue", parse = TRUE) +
  annotate("text", x = log(mean(ELP$RT)) + -0.2, y = .7, 
           label = "median", 
           vjust = 1.5, hjust = -0.2, color = "red", parse = TRUE)

Code
# Skewed
ggplot(ELP, aes(x = RT)) +
  geom_histogram() +
  geom_vline(xintercept = mean(ELP$RT), color = "steelblue") +
  geom_vline(xintercept = median(ELP$RT), color = "red") +
  theme_minimal() +
  labs(
    x = "Reaction time"
  ) +
annotate("text", x = mean(ELP$RT) + 10, y = 0, 
           label = "mean", 
           vjust = 1.5, hjust = -0.2, color = "steelblue", parse = TRUE) +
  annotate("text", x = median(ELP$RT) -175, y = .7, 
           label = "median", 
           vjust = 1.5, hjust = -0.2, color = "red", parse = TRUE)

We are particularly interested in the relationship between reaction times RT and the (log-)frequency Freq of a lexical stimulus. What kind of pattern does the scatter plot below suggest?

Some open questions
  • Can word frequency help us explain variation in reaction times?

  • If it can, then how could we characterise the effect of word frequency? In other words, does it increase or decrease reaction times?

  • What reaction times should we expect for new observations?

21.3.1 A simple statistical model

RT is the response or target that we wish to explain. We generically refer to the response as \(Y\).

Freq is the feature, input, or predictor, which we will call \(X\).

We can thus summarise our preliminary and fairly general statistical model as

\[Y = f(X) + \epsilon. \tag{21.1}\]

The term \(f(X)\) describes the contribution of \(X\) to the explanation of \(Y\). Since no model can explain everything perfectly, we expect there to be some degree of error \(\epsilon\).

21.4 Linear regression

  • Linear regression is a simple approach to supervised machine learning where the response variable is known.1

  • It assumes that the dependence of \(Y\) on \(X\) is linear, i.e., their relationship is a straight line.

  • This approach is suitable for numerical response variables. The predictors, however, can be either continuous or discrete.

  • 1 If the response variable is unknown or irrelevant, we speak of unsupervised machine learning. Unsupervised models are mostly concerned with finding patterns in high-dimensional datasets with dozens or even hundreds of variables.

  • Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.

    21.4.1 Model with a single predictor \(X\)

    The simple linear model has the general form

    \[ Y = \beta_0 + \beta_1X + \epsilon. \tag{21.2}\]

    • The model parameters (or coefficients) \(\beta_0\) and \(\beta_1\) specify the functional relationship \(f\) between \(Y\) and \(X\).

    • The first parameter \(\beta_0\) determines the intercept of the regression line, and \(\beta_1\) indicates the slope.

    • Once again, \(\epsilon\) captures the model error, which is equivalent to the sum of all distances of the data points from the regression line.

    Applying the model formula to our dataset, we get the following updated regression equation:

    \[ \text{Reaction time} = \beta_0 + \beta_1\text{Frequency} + \text{Model Error}. \]

    But how do we find the exact values of the intercept and the slope? In short: We can’t! We are dealing with population parameters and can, therefore, only provide an approximation of the true relationship between RT and Freq.

    To reflect the tentative nature of the model coefficients, we use the hat symbol ^ (e.g., \(\hat{\beta_0}\)) to indicate estimations rather than true values. The estimation procedure requires training data, based on which the algorithm “learns” the relationship between \(Y\) and \(X\) (hence the term “Machine Learning”).

    The most common way of estimating parameters for linear models is the Least Squares approach. In essence, the parameters are chosen such that the residual sum of squares, i.e., the sum of the differences between observed and predicted values, is as low as possible. In other words, we are trying to minimise the distances between the data points and the regression line.

    It can be computed using the equivalence in Equation 21.3.

    \[ \hat{\beta}_1 = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i- \bar{y})}{\sum_{i=1}^n(x_i- \bar{x})^2}. \tag{21.3}\]

    We can then obtain the intercept via Equation 21.4.

    \[ \hat{\beta_0}= \bar{y}- \hat{\beta}_1\bar{x} \tag{21.4}\]

    Once we’ve fitted the model, we can then predict reaction times if we know the frequency of a lexical stimulus:

    \[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x, \tag{21.5}\]

    where \(\hat{y}\) indicates a prediction of \(Y\) on the basis of the predictor values \(X = x\).

    21.4.2 Application in R

    In R, we can fit a linear model with the lm() function.

    # Fit linear model
    rt.lm1 = lm(log(RT) ~ log(Freq), data = ELP)
    
    # View model data
    summary(rt.lm1)
    tab_model(rt.lm1, show.se = TRUE, show.aic = TRUE, show.dev = TRUE, show.fstat = TRUE, digits = 3)
      log(RT)
    Predictors Estimates std. Error CI p
    (Intercept) 6.633 0.004 6.625 – 6.642 <0.001
    Freq [log] -0.049 0.002 -0.053 – -0.044 <0.001
    Observations 880
    R2 / R2 adjusted 0.357 / 0.356
    Deviance 13.385
    AIC 10534.255

    The model statistics comprise the following elements:

    i.e., the model formula.

    These indicate the difference between the observed values in the data set and the values predicted by the model (= the fitted values). These correspond to the error term \(\epsilon\). The lower the residuals, the better the model describes the data.

    # Show fitted values (= predictions) for the first six observations
    head(rt.lm1$fitted.values)
           1        2        3        4        5        6 
    6.635345 6.563152 6.789806 6.613980 6.630529 6.574894 
    # Show deviation of the fitted values from the observed values
    head(rt.lm1$residuals)
              1           2           3           4           5           6 
     0.03778825 -0.02277165  0.07759556  0.03387713  0.15222949 -0.10432670 

    The regression coefficients correspond to \(\hat{\beta}_0\) (“Intercept”) and \(\hat{\beta}_1\) (“log(Freq)”), respectively. The model shows that for a one-unit increase in log-frequency the log-reaction time decreases by approx. 0.05.

    # Convert coefficients to a tibble (= tidyverse-style data frame)
    tidy_model <- tidy(rt.lm1)
    
    tidy_model
    # A tibble: 2 × 5
      term        estimate std.error statistic  p.value
      <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    1 (Intercept)   6.63     0.00429    1548.  0       
    2 log(Freq)    -0.0486   0.00220     -22.1 2.88e-86

    \(p\)-values and \(t\)-statistic: Given the null hypothesis \(H_0\) that there is no correlation between log(RT) and log(Freq) (i.e., \(H_0: \beta_1 = 0\)), a \(p\)-value lower than 0.05 indicates that \(\beta_1\) considerably deviates from 0, thus providing evidence for the alternative hypothesis \(H_1: \beta_1 \ne 0\). Since \(p < 0.001\), we can reject \(H_0\).

    The \(p\)-value itself crucially depends on the \(t\)-statistic2, which measures “the number of standard deviations that \(\hat{\beta_1}\) is away from 0” (James et al. 2021: 67). The standard error (SE) reflects how much an estimated coefficient differs on average from the true values of \(\beta_0\) and \(\beta_1\). They can be used to compute the 95% confidence interval \[[\hat{\beta}_1 - 2 \cdot SE(\hat{β}_1), \hat{\beta}_1 + 2 \cdot SE(\hat{\beta}_1)]. \tag{21.6}\]

    The true value of the parameter \(\beta_1\) lies within the specified range 95% of the time.

    # Compute confidence intervals for intercept and log(Freq)
    tidy_model_ci <- tidy(rt.lm1, conf.int = TRUE)
    
    tidy_model_ci
    # A tibble: 2 × 7
      term        estimate std.error statistic  p.value conf.low conf.high
      <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
    1 (Intercept)   6.63     0.00429    1548.  0          6.62      6.64  
    2 log(Freq)    -0.0486   0.00220     -22.1 2.88e-86  -0.0529   -0.0443

    The estimated parameter for log(Freq), which is -0.049, thus has the 95% confidence interval [-0.053, -0.044].

  • 2 If the response variable is unknown or irrelevant, we speak of unsupervised machine learning. Unsupervised models are mostly concerned with finding patterns in high-dimensional datasets with dozens or even hundreds of variables.

  • This is an estimation of the average deviation of the predictions from the observed values.

    \[RSE = \sqrt{\frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{y_i}})^2 \tag{21.7}\]

    The \(R^2\) score is important for assessing model fit because it “measures the proportion of variability in \(Y\) that can be explained using \(X\)(James et al. 2021: 70), varying between 0 and 1.

    \[R^2 = 1-\frac{TSS}{RSS} = 1-\frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{\sum_{i=1}^n (y_i - \bar{y_i})^2} \tag{21.8}\]

    It is used to measure the association between the dependent variable and the independent variable(s). Generally speaking, values greater than 1 indicate a possible correlation. A sufficiently low \(p\)-value suggests that the null hypothesis \(H_0: \beta_1 = 0\) can be rejected. The \(F\) statistic is computed as shown below (cf. Agresti and Kateri 2022: 232) and follows an \(F\)-distribution with two different \(df\) values.

    \[ F = \frac{(TSS - SSE) / p}{SSE / [n - (p + 1)]} \tag{21.9}\]

    21.4.3 Multiple linear regression

    In multiple linear regression, more than one predictor variable is taken into account. For instance, modelling log(RT) as a function of log(Freq), POS and Length requires a more complex model of the form

    \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon. \tag{21.10}\]

    Predictions are then obtained via the formula

    \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + ... + \hat{\beta}_px_p. \tag{21.11}\]

    21.4.4 Application in R

    In R, a multiple regression model is fitted as in the code example below:

    # Fit multiple regression model
    rt.lm2 <- lm(log(RT) ~ log(Freq) + POS + Length, data = ELP)
    
    # View model statistics
    summary(rt.lm2)
    tab_model(rt.lm2, show.se = TRUE, show.aic = TRUE, show.dev = TRUE, show.fstat = TRUE, digits = 3)
      log(RT)
    Predictors Estimates std. Error CI p
    (Intercept) 6.460 0.017 6.426 – 6.493 <0.001
    Freq [log] -0.038 0.002 -0.042 – -0.034 <0.001
    POS [NN] -0.006 0.010 -0.026 – 0.014 0.539
    POS [VB] -0.035 0.012 -0.059 – -0.011 0.004
    Length 0.023 0.002 0.020 – 0.026 <0.001
    Observations 880
    R2 / R2 adjusted 0.478 / 0.476
    Deviance 10.858
    AIC 10356.130

    21.5 Visualising regression models

    • Plot coefficient estimates:
    Code
    # Tidy the model output
    tidy_model <- tidy(rt.lm2, conf.int = TRUE)
    
    # Remove intercept
    tidy_model <- tidy_model %>% filter(term != "(Intercept)")
    
    # Create the coefficient plot
    ggplot(tidy_model, aes(x = estimate, y = term)) +
      geom_point() +
      geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
      geom_vline(xintercept = 0, linetype = "dashed", color = "steelblue") +
      theme_minimal() +
      labs(
        x = "Coefficient Estimate",
        y = "Predictor",
        title = "Coefficient Estimates with Confidence Intervals"
      )

    • Plot predictions:
    Code
    plot(ggeffect(rt.lm2, "Freq"), residuals = TRUE) + geom_line(col = "steelblue") + labs(subtitle = "Untransformed frequencies", y = "log(RT)")

    Code
    plot(ggeffect(rt.lm2, "Freq [log]"), residuals = TRUE) + geom_line(col = "steelblue") + labs(subtitle = "Log-transformed frequencies", y = "log(RT)")

    Code
    plot(ggeffect(rt.lm2, "POS"), residuals = TRUE) + geom_line(col = "steelblue") + labs(y = "log(RT)")

    Code
    plot(ggeffect(rt.lm2, "Length"), residuals = TRUE) + geom_line(col = "steelblue") + labs(y = "log(RT)")

    21.6 Model assumptions and diagnostics

    As a parametric method, linear regression makes numerous assumptions about the training data. It is, therefore, essential to run further tests to rule out possible violations. Among other things, the model assumptions include:

    • A linear relationship between the response and the quantitative predictors: The residuals should not display a clear pattern. For this reason, it is recommended to use component residual plots (e.g., crPlot() from the car library) for the visual identification of potentially non-linear trends.
    Code
    # pink line = main tendency vs. blue line = slope coefficients;
    # some minor non-linearity can be observed
    
    crPlot(rt.lm2, var = "log(Freq)") 

    Code
    crPlot(rt.lm2, var = "POS")

    Code
    crPlot(rt.lm2, var = "Length") # potentially problematic

    • No heteroscedasticity (i.e, non-constant variance of error terms): Visually, a violation of this assumption becomes apparent if the residuals form a funnel-like shape. It is also possible to conduct a non-constant variance test ncvTest(): If it returns \(p\)-values < 0.05, it suggests non-constant variance.
    Code
    plot(rt.lm2, which = 1)

    Code
    ncvTest(rt.lm2) # significant, meaning that errors do not vary constantly
    Non-constant Variance Score Test 
    Variance formula: ~ fitted.values 
    Chisquare = 30.0101, Df = 1, p = 4.298e-08
    • No multicollinearity: Predictors should not be correlated with each other. In the model data, correlated variables have unusually high standard errors, thereby decreasing the explanatory power of both the coefficients and the model as a whole. Another diagnostic measure are variance inflation factors (VIF-scores); predictors with VIF scores > 5 are potentially collinear. They can be computed using the vif() function.
    Code
    vif(rt.lm2) # vif < 5 indicates that predictors are not correlated
                  GVIF Df GVIF^(1/(2*Df))
    log(Freq) 1.150140  1        1.072446
    POS       1.026925  2        1.006664
    Length    1.151054  1        1.072872
    • Normally distributed residuals: The residuals should follow the normal distribution and be centered around 0:

    \[ \epsilon \sim N(0, \sigma^2) \tag{21.12}\]

    Usually, a visual inspection using qqnorm() is sufficient, but the Shapiro-Wilke test shapiro.test() can also be run on the model residuals. Note that a \(p\)-value below 0.05 provides evidence for non-normality.

    Code
    plot(rt.lm2, which = 2)

    Code
    shapiro.test(residuals(rt.lm2)) # residuals are not normally distributed because p < 0.05
    
        Shapiro-Wilk normality test
    
    data:  residuals(rt.lm2)
    W = 0.99062, p-value = 2.139e-05
    Important

    Beside the points mentioned above, it is always recommend to examine the model with regard to

    • outliers that might skew the regression estimates,
    Code
    influencePlot(rt.lm2, id.method = "identify")

           StudRes         Hat       CookD
    16   0.4910868 0.027128347 0.001346143
    207 -0.7674391 0.030963411 0.003765568
    452  3.3366052 0.009338916 0.020749639
    498  3.5794954 0.004047706 0.010275907
    660  3.0847082 0.008750676 0.016638370
    • interactions, i.e., combined effects of predictors, and
    Code
    rt.lm.int <- lm(log(RT) ~ log(Freq) + POS + Length + log(Freq):Length, data = ELP)
    
    summary(rt.lm.int)
    
    Call:
    lm(formula = log(RT) ~ log(Freq) + POS + Length + log(Freq):Length, 
        data = ELP)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.27355 -0.07723 -0.00651  0.06623  0.39971 
    
    Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
    (Intercept)       6.4669304  0.0171836 376.343  < 2e-16 ***
    log(Freq)        -0.0243748  0.0062631  -3.892 0.000107 ***
    POSNN            -0.0054973  0.0101366  -0.542 0.587735    
    POSVB            -0.0344559  0.0120992  -2.848 0.004506 ** 
    Length            0.0217620  0.0018007  12.085  < 2e-16 ***
    log(Freq):Length -0.0017681  0.0007606  -2.325 0.020319 *  
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.1111 on 874 degrees of freedom
    Multiple R-squared:  0.4816,    Adjusted R-squared:  0.4786 
    F-statistic: 162.4 on 5 and 874 DF,  p-value: < 2.2e-16
    Code
    # ANOVA (analysis of variance)
    
    ## Compare interaction model with main effects model
    
    anova(rt.lm.int, rt.lm2) # interaction term improves the model
    Analysis of Variance Table
    
    Model 1: log(RT) ~ log(Freq) + POS + Length + log(Freq):Length
    Model 2: log(RT) ~ log(Freq) + POS + Length
      Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
    1    874 10.792                             
    2    875 10.858 -1 -0.066726 5.404 0.02032 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    • overfitting, which results in poor model performance outside the training data.
    Code
    library("rms")
    
    # Refit the model with ols(), which is equivalent to lm()
    ols.rt <- ols(log(RT) ~ log(Freq) + POS + Length, data = ELP, x = TRUE, y = TRUE)
    
    # Cross-validate
    ols.val <- validate(ols.rt, bw = TRUE, B = 200) # Perform 200 random resampling iterations (= bootstrapping); compare model performance on training vs. test (= new) data. The slope optimism should be below 0.05 to rule out overfitting.
    
            Backwards Step-down - Original Model
    
    No Factors Deleted
    
    Factors in Final Model
    
    [1] Freq   POS    Length
    Code
    ols.val[,1:5] # The model does not overfit.
              index.orig  training       test      optimism index.corrected
    R-square  0.47839094 0.4796176 0.47570919  0.0039084156      0.47448252
    MSE       0.01233904 0.0122714 0.01240248 -0.0001310708      0.01247011
    g         0.11875152 0.1187177 0.11856383  0.0001538558      0.11859767
    Intercept 0.00000000 0.0000000 0.01919518 -0.0191951836      0.01919518
    Slope     1.00000000 1.0000000 0.99706225  0.0029377543      0.99706225