# Load libraries
library(tidyverse)
library(broom)
library(DescTools)
library(lmtest)
# Load data
<- read.csv("INPUT_pronouns.csv", sep = ",", header = TRUE)
data_pro
# Inspect data
str(data_pro)
head(data_pro)
6.2 Logistic Regression
Recommended reading
For linguists:
Levshina (2015: Chapter 12)
Winter (2020: Chapter 12)
Full theoretical treatment:
James et al. (2021: Chapter 4)
Hosmer & Lemeshow (2008)
Preparation
Consider the data from Buskin’s (2025)1 corpus-study on subject pronoun realisation:
1 The input data can be downloaded from this OSF repository: https://osf.io/qgnms.
Target variable:
Reference
(‘overt’, ‘null’)
Explanatory variables:
Person
(‘1.p.’, ‘2.p’, ‘3.p’ as well as the dummy pronouns ‘it’ and ‘there’)Register
(the text category in the International Corpus of English; ‘S1A’ are informal conversations, whereas ‘S1B’ comprises formal class lessons)Variety
(British English ‘GB’, Singapore English ‘SING’ and Hong Kong English ‘HK’), andReferentiality
(‘referential’ with an identifiable referent or ‘non-referential’ with no/generic reference)
head(data_pro)
Reference Person Register Variety Referentiality
1 overt 3 S1A GB referential
2 overt 3 S1A GB referential
3 overt 3 S1A GB referential
4 overt 3 S1A GB referential
5 overt 3 S1A GB referential
6 overt 3 S1A GB referential
table(data_pro$Reference)
null overt
174 4664
Descriptive overview
Show the code
<- data_pro %>%
data_pro_stats1 group_by(Variety) %>%
count(Reference) %>%
mutate(pct = n/sum(n) * 100)
Variety | Reference | n | pct |
---|---|---|---|
GB | null | 29 | 1.817043 |
GB | overt | 1567 | 98.182957 |
HK | null | 75 | 4.507212 |
HK | overt | 1589 | 95.492789 |
SING | null | 70 | 4.435995 |
SING | overt | 1508 | 95.564005 |
Show the code
<- data_pro %>%
data_pro_stats2 group_by(Variety, Register) %>%
count(Reference) %>%
mutate(pct = n/sum(n) * 100)
Logistic regression
In contrast to linear regression, logistic regression models a qualitative response variable \(Y\) with two outcomes2. In the present study, \(Y\) is pronominal Reference
and has the outcomes Reference = null
and Reference = overt
, which represent null and overt subjects, respectively. Dichotomous variables of this kind are also often coded as yes
/no
or 1
/0
.
2 Logistic regression can also be used for \(\geq 3\) classes by breaking down the response variable into a series of dichotomous variables. This is also known as multinomial logistic regression or softmax regression.
Another difference from linear regression is the output of the model:
In linear regression, we obtain a predicted value for the continuous response variable we’re interested in. For instance, if we’re modelling reaction times, the model will return an estimated mean reaction time (given the predictors).
In logistic regression, the model will return a probability. In variational linguistics, this may correspond to the probability that a speaker will use one syntactic variant versus the other.
A core component of logistic regression is the logistic function. The rationale for using it is that the output of the function will always lie between \(0\) and \(1\), and it will always denote a probability.
The simple logistic model
Assuming a binary response variable \(Y\) with the values 1 and 0 and a single predictor \(X\), the conditional probability \(P(Y = 1 \mid X)\) is then equivalent to the inverse logit in Equation 1.
\[ P(Y = 1 \mid X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}. \tag{1}\]
With some manipulation it can be shaped into a form that is definitely more familiar:
\[ \log\left(\frac{P(Y = 1 \mid X)}{1 - P(Y = 1 \mid X)}\right) = \beta_0 + \beta_1X. \tag{2}\]
The logistic model has several characteristic components. The fraction \(\frac{P(Y = 1 \mid X)}{1-P(Y = 1 \mid X)}\) represents the odds, which stand for to the probability of one outcome (e.g., Reference = null
) compared to the other (e.g., Reference = overt
). Their logarithmic transformation are the log odds (or logits) of one outcome versus the other.
When interpreting the output of a logistic model, note that
positive log odds indicate an increase in \(\frac{P(Y = 1 \mid X)}{1-P(Y = 1 \mid X)}\), whereas
negative log odds indicate a decrease in \(\frac{P(Y = 1 \mid X)}{1-P(Y = 1 \mid X)}\).
In more concrete terms: If we are interested in the probability that the form of pronominal reference is null
(our \(Y\)) while taking into account the extra-linguistic context (Register
; our \(X\)), the model would then have the general form in Equation 3.
\[ \log\left(\frac{P(\text{Reference} = \text{null} \mid \text{Register})}{1- P(\text{Reference} = \text{null} \mid \text{Register})}\right) = \beta_0 + \beta_1\text{Register} \tag{3}\]
Multiple logistic regression
If more than one predictor is included, the above equations can be expanded so as to take into account \(p\) slopes for \(p\) independent variables \(X_1, X_2, ..., X_p\).
\[ P(Y = 1 \mid X_1, ..., X_p) = \frac{e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}}{1 + e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}}. \tag{4}\]
Consequently, the log odds correspond to the sum of linear predictors \(\beta_1X_1 + \beta_2X_2 + ...+ \beta_pX_p\) (cf. Equation 5).
\[ \log\left(\frac{P(Y = 1 \mid X_1, ..., X_p)}{1 - P(Y = 1 \mid X_1, ..., X_p)}\right) = \beta_0 + \sum_{i=1}^{p} \beta_i X_i \tag{5}\]
Odds ratios
To assess the strength of an effect, it is instructive to examine the odds ratios that correspond to the model coefficients. Odds ratios (OR) are defined as
\[ OR(X_1) = e^{\beta_1}. \]
Essentially, the OR describes the ratio between two odds with respect to another independent variable. This is illustrated for Reference
given Register
below:
\[ \text{OR}(\text{Reference} \mid \text{Register}) = \frac{\frac{P(\text{Reference} = \text{null} \mid \text{Register} = \text{S1A})}{P(\text{Reference} = \text{overt} \mid \text{Register} = \text{S1A})}}{\frac{P(\text{Reference} = \text{null} \mid \text{Register} = \text{S1B})}{P(\text{Reference} = \text{overt} \mid \text{Register} = \text{S1B})}} \]
Read as: ‘The ratio between the probability of a null vs. overt subject in S1A and the probability of a null vs. overt object in S1B’.
Maximum Likelihood Estimation
In contrast to continuous data, the estimation of parameters for discrete response variables is less straightforward in that there is no unique solution. Rather than finding a regression line that minimises the distance to all data points, the default approach of logistic models is to find the parameter values that are most likely, given the data. Hence this procedure is also known as Maximum Likelihood Estimation (MLE).
The model first makes an assumption about the probability distribution of the data. For categorical data, the binomial distribution is a common choice. Equation 6 indicates the corresponding probability mass function, which describes the probability \(\pi\) of observing \(y\) successes in \(k\) independent Bernoulli trials. In other words, if we tossed a coin \(n = 10\) times and observed \(y = 5\) heads (i.e., 5 successes), what is the probability \(\pi\) of a success?
\[ f(y; k; \pi) = \binom{k}{y} \pi^y (1 - \pi)^{n-y} \tag{6}\]
Now, let \(\beta\) denote some parameter of interest (e.g., the slope coefficient of a logistic regression model). Given some observed data, the likelihood of this parameter can be described in terms of the likelihood function \(L(\beta)\) in Equation 7. It assumes \(n\) binomial probability mass functions with trials \(k = 1\) and computes their product. Since the binomial coefficient \(\binom{n}{y}\) is a constant term, it is typically dropped. In essence, we’re multiplying successes \(\pi^{y_i}_i\) with failures \((1 - \pi_i)^{1-y_i}\) for each data point.
\[ L(\beta) = \prod_{i=1}^n \pi^{y_i}_i (1 - \pi_i)^{1-y_i} \tag{7}\]
Within the context of our pronoun data, this can be understood more intuitively as the product of the probability that a subject is a null pronoun with the probability that it is not:
\[ L(\beta) = \prod_{i=1}^n \pi^{\text{null}}_i (1 - \pi_i)^{1-\text{null}} \tag{8}\]
Conventionally, this expression is log-transformed in order to convert the product into a sum because, for one, sums are easier to handle computationally. The log-likelihood function \(\ell(\beta)\) in Equation 9 forms the basis for a variety of goodness-of-fit measures used to evaluate logistic regression models.
\[ \ell(\beta) = \sum_{i=1}^n y_i \log(\pi_i) + (1 - y_i) \log(1 - \pi_i) \tag{9}\]
And in more concrete terms:
\[ \ell(\beta) = \sum_{i=1}^n \text{null}_i \log(\pi_i) + (1 - \text{null}_i) \log(1 - \pi_i) \tag{10}\]
In a next step, all \(\pi_i\)s would be substituted by their respectively probabilities from the right half of Equation 4.
The goal is to find the value that maximises \(\ell(\beta)\), i.e., the maximum likelihood estimator \(\hat{\beta}\). Approximate solutions can be attained via iterative optimisation techniques (e.g., Newton-Ralphson or Gradient Descent). Sometimes the algorithm may fail to find an optimal solution, which R may report as a model convergence error. For further technical details, see Wood (2006: 63-66) or Agresti & Kateri (2022: 291-294).
Workflow in R
Research question and hypotheses
How do the intra- and extra-linguistic variables suggested in the literature affect subject pronoun realisation (Definite Null Instantiation) in British English, Singapore English and Hong Kong English?
Given the significance level \(\alpha = 0.05\), the hypotheses are:
\[ \begin{aligned} H_0: & \quad \text{None of the predictor coefficients deviate from 0}.\\ H_1: & \quad \text{At least one predictor coefficient deviates from 0}. \end{aligned} \]
These can be restated mathematically as:
\[ \begin{aligned} H_0: & \quad \beta_1 = \beta_2 = \cdots = \beta_p = 0 \\ H_1: & \quad \text{At least one } \beta_i \neq 0 \text{ for } i \in \{1, 2, \ldots, p\} \end{aligned} \]
Convert to factors and specify reference levels
The next step involves specifying reference levels for all categorical variables. This step is very important because it will directly impact the parameter estimation procedure and, consequently, influence our interpretation of the model output.
The reference level of the response is usually chosen such that it corresponds to the unmarked or most frequent case. Since overt pronouns are much more common in the data, the reference level of the
Reference
variable will be set toReference = overt
. This way, the model coefficients will directly represent the probability of the null subject variant (i.e., the special case) given certain predictor configurations.The predictor levels need to be specified as well. Among other things, we are interested in how the Asian Englishes pattern relative to British English. Therefore, we will define British English as the baseline for comparison.
We will use the following specifications:
Variable | Factor Levels | Preferred Reference level |
---|---|---|
Register | S1A, S1B | S1A |
Variety | GB, SING, HK | GB |
Person | 1, 2, 3, it, there | 3 |
Referentiality | referential, non-referential | referential |
# Store "Reference" as factor
$Reference <- as.factor(data_pro$Reference)
data_pro
## Specify reference level (the 'unmarked' case)
$Reference <- relevel(data_pro$Reference, "overt")
data_pro
## Print levels
levels(data_pro$Reference)
[1] "overt" "null"
Repeat the procedure for the remaining categorical variables.
Code
# Store "Register" as factor
$Register <- as.factor(data_pro$Register)
data_pro
## Specify reference level
$Register <- relevel(data_pro$Register, "S1A")
data_pro
# Store "Variety" as factor
$Variety <- as.factor(data_pro$Variety)
data_pro
## Specify reference level
$Variety <- relevel(data_pro$Variety, "GB")
data_pro
# Store "Person" as factor
$Person <- as.factor(data_pro$Person)
data_pro
## Specify reference level
$Person <- relevel(data_pro$Person, "3")
data_pro
# Store "Referentiality" as factor
$Referentiality <- as.factor(data_pro$Referentiality)
data_pro
## Specify reference level
$Referentiality <- relevel(data_pro$Referentiality, "referential") data_pro
Model fitting
# With (glm); available in base R
# Note the additional "family" argument!
<- glm(Reference ~ Register + Variety + Person + Referentiality, data = data_pro, family = "binomial")
Reference.glm
# View model statistics
summary(Reference.glm)
Call:
glm(formula = Reference ~ Register + Variety + Person + Referentiality,
family = "binomial", data = data_pro)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6257 -0.2719 -0.2512 -0.1597 3.1990
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4402 0.2221 -15.489 < 2e-16 ***
RegisterS1B 0.1102 0.1603 0.688 0.49173
VarietyHK 0.9856 0.2248 4.385 1.16e-05 ***
VarietySING 0.9662 0.2272 4.252 2.12e-05 ***
Person1 -0.9155 0.1800 -5.086 3.65e-07 ***
Person2 -1.6676 0.2691 -6.198 5.72e-10 ***
Personit 0.8130 0.2960 2.747 0.00602 **
Personthere -2.6367 1.0078 -2.616 0.00889 **
Referentialitynon-referential NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1498.8 on 4837 degrees of freedom
Residual deviance: 1387.5 on 4830 degrees of freedom
AIC: 1403.5
Number of Fisher Scoring iterations: 7
# Get Pseudo-R^2 (requires the DescTools library)
::PseudoR2(Reference.glm, "Nagelkerke") DescTools
Nagelkerke
0.08538986
# Likelihood ratio test
::lrtest(Reference.glm) lmtest
Likelihood ratio test
Model 1: Reference ~ Register + Variety + Person + Referentiality
Model 2: Reference ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 8 -693.75
2 1 -749.42 -7 111.33 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Further model diagnostics
Post-hoc evaluation of the logistic regression model is just as important as it is for linear regression. However, several assumptions can be relaxed. While independence of observations, a linear relationship between predictors and response as well as uncorrelated predictors remain essential, the stipulations on the distribution and constant variance of the residuals may be disregarded.
Multicollinearity can be inspected via the vif()
function from the car
package:
Code
vif(Reference.glm)
RegisterS1B VarietyHK
NA NA
VarietySING Person1
NA NA
Person2 Personit
NA NA
Personthere Referentialitynon-referential
NA NA
The fact that the variance inflation factors cannot be estimated properly is highly problematic, suggesting severe multicollinearity. In fact, checking for linear dependencies via alias()
reveals Referentiality = non-referential
as the culprit:
alias(Reference.glm)
Model :
Reference ~ Register + Variety + Person + Referentiality
Complete :
(Intercept) RegisterS1B VarietyHK VarietySING
Referentialitynon-referential 0 0 0 0
Person1 Person2 Personit Personthere
Referentialitynon-referential 0 0 1 1
Refitting the model without said predictor solves the problem:
<- glm(Reference ~ Register + Variety + Person, data = data_pro, family = "binomial")
Reference.glm2
summary(Reference.glm2)
Call:
glm(formula = Reference ~ Register + Variety + Person, family = "binomial",
data = data_pro)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6257 -0.2719 -0.2512 -0.1597 3.1990
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4402 0.2221 -15.489 < 2e-16 ***
RegisterS1B 0.1102 0.1603 0.688 0.49173
VarietyHK 0.9856 0.2248 4.385 1.16e-05 ***
VarietySING 0.9662 0.2272 4.252 2.12e-05 ***
Person1 -0.9155 0.1800 -5.086 3.65e-07 ***
Person2 -1.6676 0.2691 -6.198 5.72e-10 ***
Personit 0.8130 0.2960 2.747 0.00602 **
Personthere -2.6367 1.0078 -2.616 0.00889 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1498.8 on 4837 degrees of freedom
Residual deviance: 1387.5 on 4830 degrees of freedom
AIC: 1403.5
Number of Fisher Scoring iterations: 7
vif(Reference.glm2)
RegisterS1B VarietyHK VarietySING Person1 Person2 Personit
1.036261 2.018748 2.023863 1.129088 1.087312 1.098571
Personthere
1.006525
Removing it does not affect the fit at all:
anova(Reference.glm, Reference.glm2)
Analysis of Deviance Table
Model 1: Reference ~ Register + Variety + Person + Referentiality
Model 2: Reference ~ Register + Variety + Person
Resid. Df Resid. Dev Df Deviance
1 4830 1387.5
2 4830 1387.5 0 0
With the function drop1()
, it is possible to successively remove variables from the complex model to ascertain which ones improve the model significantly (i.e., decrease the deviance and AIC scores).
drop1(Reference.glm, test = "Chisq")
Single term deletions
Model:
Reference ~ Register + Variety + Person + Referentiality
Df Deviance AIC LRT Pr(>Chi)
<none> 1387.5 1403.5
Register 1 1388.0 1402.0 0.471 0.4926
Variety 2 1413.7 1425.7 26.179 2.067e-06 ***
Person 3 1471.1 1481.1 83.607 < 2.2e-16 ***
Referentiality 0 1387.5 1403.5 0.000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output suggests that Register
and Referentiality
do not render the model substantially worse upon their removal.
Confidence intervals and odds ratios
# Tidy the model output
<- tidy(Reference.glm2, conf.int = TRUE)
tidy_model
# Remove intercept, compute odds ratios and their CIs
<- tidy_model %>%
(tidy_model filter(term != "(Intercept)") %>%
mutate(
odds_ratio = exp(estimate),
odds.conf.low = exp(conf.low),
odds.conf.high = exp(conf.high)
) )
# A tibble: 7 × 10
term estimate std.error statistic p.value conf.low conf.high odds_ratio
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 RegisterS… 0.110 0.160 0.688 4.92e- 1 -0.206 0.423 1.12
2 VarietyHK 0.986 0.225 4.38 1.16e- 5 0.556 1.44 2.68
3 VarietySI… 0.966 0.227 4.25 2.12e- 5 0.532 1.43 2.63
4 Person1 -0.916 0.180 -5.09 3.65e- 7 -1.27 -0.567 0.400
5 Person2 -1.67 0.269 -6.20 5.72e-10 -2.23 -1.17 0.189
6 Personit 0.813 0.296 2.75 6.02e- 3 0.202 1.37 2.25
7 Personthe… -2.64 1.01 -2.62 8.89e- 3 -5.51 -1.12 0.0716
# ℹ 2 more variables: odds.conf.low <dbl>, odds.conf.high <dbl>
Visualisation
- Plot model coefficients:
Code
# Create the coefficient plot
ggplot(tidy_model, aes(x = estimate, y = term)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
geom_vline(xintercept = 0, linetype = "dashed", color = "steelblue") +
theme_minimal() +
labs(
x = "Coefficient Estimate (log-odds)",
y = "Predictor",
title = "Coefficient Estimates with Confidence Intervals",
caption = "*Note that the CIs of singificant predictors do not include 0."
)
Code
# Plot odds ratios
ggplot(tidy_model, aes(x = exp(estimate), y = term)) +
geom_point() +
geom_errorbarh(aes(xmin = odds.conf.low, xmax = odds.conf.high), height = 0.2) +
geom_vline(xintercept = 1, linetype = "dashed", color = "steelblue") +
theme_minimal() +
labs(
x = "Coefficient Estimate (odds ratios)",
y = "Predictor",
title = "Odds ratios with Confidence Intervals",
caption = "*Note that the CIs of singificant predictors do not include 1."
)
Interpret the model
The full logistic regression model is significantly better than a null model with no predictors (Likelihood Ratio Test: \(\chi^2 = 111.33\), \(df = 7\), \(p < 0.001\)) and has acceptable fit (Nagelkerke’s-\(R^2\) = \(0.09\)). However, heavy multicollinearity has been detected and traced back to Referentiality
; removing this feature resolved the issue and did not affect model fit.
The model coefficients of the reduced model indicate that null subjects are significantly more likely in Singapore English compared to British English (Estimate = 0.97, 95% CI [0.53, 1.43], \(p < 0.001\)). This effect is moderate with an \(OR\) of 2.63 (95% CI [1.70, 4.16]), suggesting that the odds of subject omission are approximately 2.6 times higher in the Singaporean variety.
…