In contrast to linear regression, logistic regression models a qualitative response variable\(Y\) with two outcomes2. In the present study, \(Y\) is pronominal Reference and has the outcomes Reference = null and Reference = overt, which represent null and overt subjects, respectively. Dichotomous variables of this kind are also often coded as yes/no or 1/0.
2 Logistic regression can also be used for \(\geq 3\) classes by breaking down the response variable into a series of dichotomous variables. This is also known as multinomial logistic regression or softmax regression.
Another difference from linear regression is the output of the model:
In linear regression, we obtain a predicted value for the continuous response variable we’re interested in. For instance, if we’re modelling reaction times, the model will return an estimated mean reaction time (given the predictors).
In logistic regression, however, we either get a class label or a probability. When modelling pronominal reference, the model will thus either tell us
whether a speaker would use an overt or a null subject in a given observation (class prediction).
what the probability of using one variant vs. the other would be (probability prediction).
A core component of logistic regression is the logistic function. The rationale for using it is that the output of the function will always lie between \(0\) and \(1\), and it will always denote a probability.
22.3.1 The simple logistic model
Assuming a binary response variable \(Y\) with the values 1 and 0 and a single predictor \(X\), the conditional probability \(P(Y = 1 \mid X)\) is then equivalent to the inverse logit in Equation 22.1.
The logistic model has several characteristic components. The fraction \(\frac{P(Y = 1 \mid X)}{1-P(Y = 1 \mid X)}\) represents the odds, which stand for to the probability of one outcome (e.g., Reference = null) compared to the other (e.g., Reference = overt). Their logarithmic transformation are the log odds (or logits) of one outcome versus the other.
Understanding log odds
When interpreting the output of a logistic model, note that
positive log odds indicate an increase in \(\frac{P(Y = 1 \mid X)}{1-P(Y = 1 \mid X)}\), whereas
negative log odds indicate a decrease in \(\frac{P(Y = 1 \mid X)}{1-P(Y = 1 \mid X)}\).
In more concrete terms: If we are interested in the probability that the form of pronominal reference is null (our \(Y\)) while taking into account the extra-linguistic context (Register; our \(X\)), the model would then have the general form in Equation 22.3.
If more than one predictor is included, the above equations can be expanded so as to take into account \(p\) slopes for \(p\) independent variables \(X_1, X_2, ..., X_p\).
To assess the strength of an effect, it is instructive to examine the odds ratios that correspond to the model coefficients. Odds ratios (OR) are defined as
\[
OR(X_1) = e^{\beta_1}.
\]
Understanding odds ratios
Essentially, the OR describes the ratio between two odds with respect to another independent variable. This is illustrated for Reference given Register below:
Read as: ‘The ratio between the probability of a null vs. overt object in S1A and the probability of a null vs. overt object in S1B’.
22.4 Finding \(\beta_0\) and \(\beta_1\): Maximum Likelihood Estimation
In contrast to continuous data, the estimation of parameters for discrete response variables is less straightforward in that there is no unique solution. Rather than finding a regression line that minimises the distance to all data points, the default approach of logistic models is to find the parameter values that are most likely, given the data. Hence this procedure is also known as Maximum Likelihood Estimation (MLE).
The model first makes an assumption about the probability distribution of the data. For categorical data, the binomial distribution is a common choice. Equation 22.6 indicates the corresponding probability mass function, which describes the probability \(\pi\) of observing \(y\) successes in \(k\) independent Bernoulli trials. In other words, if we tossed a coin \(n = 10\) times and observed \(y = 5\) heads (i.e., 5 successes), what is the probability \(\pi\) of a success?
Now, let \(\beta\) denote some parameter of interest (e.g., the slope coefficient of a logistic regression model). Given some observed data, likelihood of this parameter can be described in terms of the likelihood function \(L(\beta)\) in Equation 22.7. It assumes \(n\) binomial probability mass functions with trials \(k = 1\) and computes their product. Since the binomial coefficient \(\binom{n}{y}\) is a constant term, it is typically dropped. In essence, we’re multiplying successes \(\pi^{y_i}_i\) with failures \((1 - \pi_i)^{1-y_i}\) for each data point.
Conventionally, this expression is log-transformed in order to convert the product into a sum because, for one, sums are easier to handle computationally. The log-likelihood function \(\ell(\beta)\) in Equation 22.8 forms the basis for a variety of goodness-of-fit measures used to evaluate logistic regression models.
The goal is to find the value that maximises \(\ell(\beta)\), i.e., the maximum likelihood estimator\(\hat{\beta}\). Approximate solutions can be attained via iterative optimisation techniques (e.g. Newton-Ralphson). Sometimes the algorithm may fail to find an optimal solution, which R may report as a model convergence error. For further technical details, see Wood (2006: 63-66) or Agresti & Kateri (2022: 291-294).
22.5 Workflow in R
22.5.1 Research question and hypotheses
RQ: How do the intra- and extra-linguistic variables suggested in the literature affect subject pronoun realisation (Definite Null Instantiation) in British English, Singapore English and Hong Kong English?
Given a significance level \(\alpha = 0.05\), the hypotheses are: \[
\begin{aligned}
H_0: & \quad \text{None of the predictor coefficients deviate from 0}.\\
H_1: & \quad \text{At least one predictor coefficient deviates from 0}.
\end{aligned}
\]
These can be restated mathematically as:
\[
\begin{aligned}
H_0: & \quad \beta_1 = \beta_2 = \cdots = \beta_p = 0 \\
H_1: & \quad \text{At least one } \beta_i \neq 0 \text{ for } i \in \{1, 2, \ldots, p\}
\end{aligned} \]
Caution
Note that in practice, these hypotheses are rarely stated explicitly in corpus-linguistic publications.
22.5.2 Convert to factors and specify reference levels
The next step involves specifying reference levels for all categorical variables. This step is very important because it will directly impact the parameter estimation procedure and, consequently, influence our interpretation of the model output.
The reference level of the response is usually chosen such that it corresponds to the unmarked or most frequent case. Since overt pronouns are much more common in the data, the reference level of the Reference variable will be set to Reference = overt. This way, the model coefficients will directly represent the probability of the nullsubject variant (i.e., the special case) given certain predictor configurations.
The predictor levels need to be specified as well. Among other things, we are interested in how the Asian Englishes pattern relative to British English. Therefore, we will define British English as the baseline for comparison.
We will use the following specifications:
Variable
Factor Levels
Preferred Reference level
Register
S1A, S1B
S1A
Variety
GB, SING, HK
GB
Person
1, 2, 3, it, there
3
Referentiality
referential, non-referential
referential
# Store "Reference" as factordata_pro$Reference <-as.factor(data_pro$Reference)## Specify reference level (the 'unmarked' case)data_pro$Reference <-relevel(data_pro$Reference, "overt")## Print levelslevels(data_pro$Reference)
[1] "overt" "null"
Repeat the procedure for the remaining categorical variables.
Code
# Store "Register" as factordata_pro$Register <-as.factor(data_pro$Register)## Specify reference leveldata_pro$Register <-relevel(data_pro$Register, "S1A")# Store "Variety" as factordata_pro$Variety <-as.factor(data_pro$Variety)## Specify reference leveldata_pro$Variety <-relevel(data_pro$Variety, "GB")# Store "Person" as factordata_pro$Person <-as.factor(data_pro$Person)## Specify reference leveldata_pro$Person <-relevel(data_pro$Person, "3")# Store "Referentiality" as factordata_pro$Referentiality <-as.factor(data_pro$Referentiality)## Specify reference leveldata_pro$Referentiality <-relevel(data_pro$Referentiality, "referential")
22.5.3 Model fitting
There are two functions that can fit logistic models in R: lrm() and glm().
Note
The model formula below does not include Referentiality because several intermediary steps revealed it to be almost completely irrelevant for predicting Reference. In addition, the existing (and significant) interaction Variety:Person has been excluded to improve the interpretability of the model.
# With lrm(); requires library("rms")# Fit interaction modelReference.lrm <-lrm(Reference ~ Register + Variety + Register:Variety + Person, data = data_pro)# View model statisticsReference.lrm
# With (glm); available in base R# Note the additional "family" argument!Reference.glm <-glm(Reference ~ Register + Variety + Register:Variety + Person, data = data_pro, family ="binomial")# View model statisticssummary(Reference.glm)
Stepwise variable selection
With the function drop1(), it is possible to successively remove variables from the complex model to ascertain which ones improve the model significantly (i.e., decrease the deviance and AIC scores).
# Tidy the model outputtidy_model <-tidy(Reference.glm, conf.int =TRUE)# Remove intercept, compute odds ratios and their CIstidy_model <- tidy_model %>%filter(term !="(Intercept)") %>%mutate(odds_ratio =exp(estimate),odds.conf.low =exp(conf.low),odds.conf.high =exp(conf.high) )
22.5.5 Visualisation
Plot model coefficients:
Code
# Create the coefficient plotggplot(tidy_model, aes(x = estimate, y = term)) +geom_point() +geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height =0.2) +geom_vline(xintercept =0, linetype ="dashed", color ="steelblue") +theme_minimal() +labs(x ="Coefficient Estimate (log-odds)",y ="Predictor",title ="Coefficient Estimates with Confidence Intervals",caption ="*Note that the CIs of singificant predictors do not include 0." )
Code
# Plot odds ratiosggplot(tidy_model, aes(x =exp(estimate), y = term)) +geom_point() +geom_errorbarh(aes(xmin = odds.conf.low, xmax = odds.conf.high), height =0.2) +geom_vline(xintercept =1, linetype ="dashed", color ="steelblue") +theme_minimal() +labs(x ="Coefficient Estimate (odds ratios)",y ="Predictor",title ="Odds ratios with Confidence Intervals",caption ="*Note that the CIs of singificant predictors do not include 1." )
Plot predicted probabilities:
Code
# Use ggeffect() from the ggeffects packageplot(ggeffect(Reference.glm, terms =c("Register"))) +geom_line(col ="steelblue")
The logistic regression model is statistically significant at \(p < 0.001\) (\(\chi^2 = 120.43\), \(df = 9\)) and has acceptable fit (Nagelkerke’s-\(R^2\) = \(0.09\), \(C = 0.73\)).
The model coefficients indicate that null subjects are significantly more likely in Singapore English compared to British English (Estimate = 1.12, 95% CI [0.56, 1.73], \(p < 0.001\)). This effect is moderate with an \(OR\) of 3.06 (95% CI [1.75, 5.64]), suggesting that the probability of subject omission is elevated by a factor of approximately 3 in the Singaporean variety.
…
22.5.7 Further model diagnostics
Post-hoc evaluation of the logistic regression model is just as important as it is for linear regression. However, several assumptions can be relaxed. Although independence of observations, a linear relationship between predictors and response as well as uncorrelated predictors remain essential, the stipulations on the distribution and variance of the residuals may be disregarded.
Multicollinearity can be inspected via the vif() function from the car package.
Code
# Variable inflation factors further reveal severe multicollinearityvif(Reference.lrm)
Cross-validating the model is always recommended to assess the model’s ability to generalise beyond the training data and thus prevent overfitting.
Code
# Set seed for reproducibilityset.seed(123)# Refit the model with additional settingsReference.val <-lrm(Reference ~ Register + Variety + Register:Variety + Person, data = data_pro, x = T, y = T)# Perform 200-fold cross-validationmodel.validated <-validate(Reference.val, B =200)# Slope optimism should be as low possible!model.validated
Agresti, Alan, and Maria Kateri. 2022. Foundations of Statistics for Data Scientists: With r and Python. Boca Raton: CRC Press.
Buskin, Vladimir. n.d. “Definite Null Instantiation in English(es): A Usage-based Construction Grammar Approach.”Constructions and Frames.
Hosmer, David W., and Stanley Lemeshow. 2008. Applied Logistic Regression. 2nd ed. New York: Wiley.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. New York: Springer. https://doi.org/10.1007/978-1-0716-1418-1.
Levshina, Natalia. 2015. How to Do Linguistics with r: Data Exploration and Statistical Analysis. Amsterdam; Philadelphia: John Benjamins Publishing Company.
Winter, Bodo. 2020. Statistics for Linguists: An Introduction Using r. New York; London: Routledge.
Wood, Simon N. 2006. Generalized AdditiveModels: AnIntroduction with R. Boca Raton: Chapman & Hall/CRC.