Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Linguistic variables
    • 1.3 Research questions
    • 1.4 Set theory and mathematical notation
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 The CQP interface
    • 3.4 Data annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Chi-squared test
    • 4.6 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. 7. Machine Learning
  2. 7.4 EFA
  • 7. Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering

On this page

  • Recommended reading
  • Preparation
  • Exploratory Factor Analysis vs. PCA
  • Application in R
    • Rotation
  1. 7. Machine Learning
  2. 7.4 EFA

7.4 Exploratory Factor Analysis

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Recommended reading

For linguists:

Levshina (2015: Chapter 18)

General:

Mair (2018: Chapter 2)

Preparation

The implementation of Exploratory Factor Analysis in R is very similar to that of Principle Components Analysis. To highlight these similarities, we will use the same libraries (most importantly psych) and the same dataset scope_sem_sub as in the unit on PCA (see ?@sec-pca-prep for further details).

# Load libraries
library(tidyverse)
library(purrr)
library(psych)
library(GPArotation)
library(gridExtra)

# Load data
scope_sem_df <- readRDS("../datasets/scope_sem.RDS")

# Select subset
scope_sem_sub <- scope_sem_df[,1:11]

# Overview
glimpse(scope_sem_sub)
Rows: 1,702
Columns: 11
$ Verb               <chr> "abstain", "abstract", "abuse", "accelerate", "acce…
$ Resnik_strength    <dbl> 0.40909889, 0.18206692, 0.12473608, -0.76972217, -1…
$ Conc_Brys          <dbl> -0.94444378, -1.92983639, -0.59478833, 0.22107437, …
$ Nsenses_WordNet    <dbl> -0.68843996, 0.27755219, 0.00155443, -0.68843996, 0…
$ Nmeanings_Websters <dbl> -0.95559835, 0.73781281, 0.73781281, -0.27823388, 0…
$ Visual_Lanc        <dbl> -2.2545455, 0.6103733, 1.3354358, -0.4342084, -0.34…
$ Auditory_Lanc      <dbl> -0.84225787, -0.35605108, 1.54797548, 0.18795651, 1…
$ Haptic_Lanc        <dbl> -0.75523987, -0.29089287, 1.25099360, -0.18911818, …
$ Olfactory_Lanc     <dbl> -0.14444936, -0.37350419, -0.53335522, -0.37350419,…
$ Gustatory_Lanc     <dbl> 0.27698988, -0.10105698, -0.36148925, -0.52110903, …
$ Interoceptive_Lanc <dbl> 1.08153427, -0.06560311, 1.64313895, 1.45452985, 0.…

Exploratory Factor Analysis vs. PCA

Exploratory Factor Analysis (EFA) is quite similar to PCA in that it compresses the high-dimensional feature space, yet the core idea is not to capture as much variance as possible with as few variables as possible, but rather reveal latent (= invisible) variables, i.e., factors.

The computation bears some resemblance to that of PCA, with the main difference being that an observation \(x_m\) is assumed to be generated by combinations of factor loadings \(\lambda_{1}, \lambda_{2}, \dots, \lambda_{mp}\) with the underlying factors \(\xi_{1}, \xi_{2}, \dots, \xi_{p}\) (see Equation 1). Everything to the right of the equation can only be obtained by running estimation procedures such as Principle Axis Factoring or Maximum Likelihood Estimation.

\[ x_1 = \lambda_{11}\xi_{11} + \lambda_{12}\xi_{12} + \dots + \lambda_{1p}\xi_{p} + \epsilon_1 \tag{1}\]

When retrieving PCA and EFA loadings, several interpretive differences must be kept in mind:

Key differences between EFA and PCA
  • PCA: PCA weights can be conceptualised as “directions in feature space along which the data vary the most” (James et al. 2021: 503) and are analogous to regression slopes. Features with similar loadings on a given PC will be very close to each other in a biplot and could be understood as correlated with each other.

  • EFA: The factor loadings in an EFA, on the other hand, directly indicate how strong a factor is correlated with an existing independent variable in the dataset. As such, they help identify and interpret the underlying constructs that have given rise to the data. We can think of EFA loadings as regression coefficients and correlation coefficients at the same time.

Application in R

We use our insights from the PCA analysis, according to which three latent variables are enough to capture the bulk of variance in the dataset. When fitting an EFA model, principle axis factoring is the default solution, but could also be changed to fm = "ml" to perform Maximum Likelihood Estimation.

efa1 <- fa(scope_sem_sub[,-1], nfactors = 3, rotate = "none", fm = "pa")

The remaining printing and plotting methods are identical to PCA.

  • Print loadings:
loadings(efa1)

Loadings:
                   PA1    PA2    PA3   
Resnik_strength            0.419 -0.269
Conc_Brys           0.800  0.228 -0.300
Nsenses_WordNet     0.560 -0.628  0.236
Nmeanings_Websters  0.495 -0.555  0.233
Visual_Lanc         0.576  0.141 -0.255
Auditory_Lanc      -0.270 -0.132  0.153
Haptic_Lanc         0.608  0.123       
Olfactory_Lanc      0.291  0.482  0.441
Gustatory_Lanc      0.263  0.513  0.657
Interoceptive_Lanc -0.245         0.386

                 PA1   PA2   PA3
SS loadings    2.196 1.482 1.143
Proportion Var 0.220 0.148 0.114
Cumulative Var 0.220 0.368 0.482
  • Plot loadings:
plot(efa1, labels = colnames(scope_sem_sub[,-1]), main = NA)

  • Plot PA scores and loadings:
biplot(efa1, choose = c(1, 2), main = NA,
       pch = 20, col = c("darkgrey", "blue"))

biplot(efa1, choose = c(2, 3), main = NA,
       pch = 20, col = c("darkgrey", "blue"))

Rotation

Factors are typically rotated in order to aid in their interpretation, resulting in much clearer loading patterns. Varimax rotation is the default technique and does not affect the model fit (i.e., there is no loss in explained variance; for details see (Mair 2018: 26-29).1

  • 1 Varimax is a so-called orthogonal rotation technique and, therefore, does not introduce correlations between the factors. If correlated factors are explicitly desired, oblique rotations such as oblimin and promax provide apt alternatives (Mair 2018: 27).

  • efa2 <- fa(scope_sem_sub[,-1], nfactors = 3, rotate = "Varimax", fm = "pa")
    
    loadings(efa2)
    
    Loadings:
                       PA1    PA2    PA3   
    Resnik_strength     0.188 -0.470       
    Conc_Brys           0.868  0.130  0.107
    Nsenses_WordNet     0.145  0.861       
    Nmeanings_Websters  0.115  0.771       
    Visual_Lanc         0.638              
    Auditory_Lanc      -0.336              
    Haptic_Lanc         0.572  0.192  0.165
    Olfactory_Lanc      0.164         0.694
    Gustatory_Lanc                    0.873
    Interoceptive_Lanc -0.410         0.199
    
                     PA1   PA2   PA3
    SS loadings    1.868 1.627 1.326
    Proportion Var 0.187 0.163 0.133
    Cumulative Var 0.187 0.350 0.482

    The rotated EFA object paints a picture that is very similar to the PCA result from the previous unit.

    diagram(efa2, main = NA)

    biplot(efa2, choose = c(1, 2), main = NA,
           pch = 20, col = c("darkgrey", "blue"))

    biplot(efa2, choose = c(2, 3), main = NA,
           pch = 20, col = c("darkgrey", "blue"))

    Interpreting the EFA output
    • Perception: The first principle axis is once more loaded heavily (and positively) by increasing concreteness scores in addition to higher visual and haptic ratings. Moreover, they display strong linear relationships. The negative association with interoceptive ratings suggests that referents that tend be perceived directly with their senses (concreteness) do not tend to be perceived inside their body.

    • Senses: In PA2 we find the inverse pattern of PC2 – very strong positive correlations with sense-related features and a weaker, yet notable negative correlation with selectional preference strength. If a verb has more senses, it tends to carry less information about its context.

    • Ingestion: Interoceptive ratings are no longer part of the picture, thus giving way to the gustatory and olfactory perception of referents.

    References

    James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. New York: Springer. https://doi.org/10.1007/978-1-0716-1418-1.
    Levshina, Natalia. 2015. How to Do Linguistics with r: Data Exploration and Statistical Analysis. Amsterdam; Philadelphia: John Benjamins Publishing Company.
    Mair, Patrick. 2018. Modern Psychometrics with R. Cham: Springer. https://doi.org/10.1007/978-3-319-93177-7.
    7.3 PCA
    7.5 Clustering