27 Principal Components Analysis

Author

Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

27.1 Recommended reading

For linguists:

Levshina (2015: Chapter 18)

General:

Mair (2018: Chapter 6)

27.2 Preparation

This unit relies on psycholinguistic data from the South Carolina Psycholinguistic Metabase (Gao, Shinkareva, and Desai 2022).¹ Detailed descriptions of the variables can be found here.

¹ One exception is the variable Resnik_strength [Resnik (1996)], which was computed manually and appended to the data frame.

The data frame scope_sem_df contains semantic ratings for a sample of 1,702 transitive verbs. Note that all columns have been standardised (cf. ?scale() for details).

# Load libraries
library(tidyverse)
library(purrr)
library(lattice)
library(corrplot)
library(psych)
library(GPArotation)
library(gridExtra)

# Load data
scope_sem_df <- readRDS("scope_sem.RDS")

# Select subset
scope_sem_sub <- scope_sem_df[,1:11]

# Overview
glimpse(scope_sem_sub)

Rows: 1,702
Columns: 11
$ Verb               <chr> "abstain", "abstract", "abuse", "accelerate", "acce…
$ Resnik_strength    <dbl> 0.40909889, 0.18206692, 0.12473608, -0.76972217, -1…
$ Conc_Brys          <dbl> -0.94444378, -1.92983639, -0.59478833, 0.22107437, …
$ Nsenses_WordNet    <dbl> -0.68843996, 0.27755219, 0.00155443, -0.68843996, 0…
$ Nmeanings_Websters <dbl> -0.95559835, 0.73781281, 0.73781281, -0.27823388, 0…
$ Visual_Lanc        <dbl> -2.2545455, 0.6103733, 1.3354358, -0.4342084, -0.34…
$ Auditory_Lanc      <dbl> -0.84225787, -0.35605108, 1.54797548, 0.18795651, 1…
$ Haptic_Lanc        <dbl> -0.75523987, -0.29089287, 1.25099360, -0.18911818, …
$ Olfactory_Lanc     <dbl> -0.14444936, -0.37350419, -0.53335522, -0.37350419,…
$ Gustatory_Lanc     <dbl> 0.27698988, -0.10105698, -0.36148925, -0.52110903, …
$ Interoceptive_Lanc <dbl> 1.08153427, -0.06560311, 1.64313895, 1.45452985, 0.…

27.3 Descriptive overview

A popular descriptive measure for associations between continuous variables $x$ and $y$ is the Pearson product-moment correlation coefficient (or simply Pearson’s $r$; cf. Equation 27.1). It varies on a scale from $-1$ to $1$ and indicates the extent to which two variables form a straight-line relationship (Heumann, Schomaker, and Shalabh 2022: 153-154). One of its core components is the covariance between $x$ and $y$ which “measures the average tendency of two variables to covary (change together)” (Baguley 2012: 206).

\[ r_{xy} = \frac{Cov(x, y)}{\sqrt{Var(x)}\sqrt{Var(y)}}= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}. \tag{27.1}\]

In R, we can compute Pearson’s $r$ by using the cor() function.

# Check correlation between number of senses and concreteness
cor(scope_sem_sub[,-1]$Nsenses_WordNet, scope_sem_sub[,-1]$Conc_Brys) # low

[1] 0.2351554

# Check correlation between haptic experience and concreteness
cor(scope_sem_sub[,-1]$Haptic_Lanc, scope_sem_sub[,-1]$Conc_Brys) # high

[1] 0.5676945

If the data frame consists of numeric columns only (i.e., if it is a matrix), we can apply cor() to the full dataset and obtain the correlation matrix (also known as covariance matrix).

# Generate correlation matrix
cor_mat1 <- cor(scope_sem_sub[,-1])

head(cor_mat1)

                   Resnik_strength  Conc_Brys Nsenses_WordNet
Resnik_strength         1.00000000  0.1166670     -0.37442983
Conc_Brys               0.11666697  1.0000000      0.23515537
Nsenses_WordNet        -0.37442983  0.2351554      1.00000000
Nmeanings_Websters     -0.34225250  0.2023356      0.68509560
Visual_Lanc             0.05471417  0.5519836      0.17154846
Auditory_Lanc          -0.11162700 -0.2683646     -0.02960745
                   Nmeanings_Websters Visual_Lanc Auditory_Lanc  Haptic_Lanc
Resnik_strength            -0.3422525  0.05471417   -0.11162700  0.008260683
Conc_Brys                   0.2023356  0.55198358   -0.26836458  0.567694470
Nsenses_WordNet             0.6850956  0.17154846   -0.02960745  0.239470104
Nmeanings_Websters          1.0000000  0.14597243   -0.04656650  0.193226862
Visual_Lanc                 0.1459724  1.00000000   -0.11674896  0.404536416
Auditory_Lanc              -0.0465665 -0.11674896    1.00000000 -0.289586292
                   Olfactory_Lanc Gustatory_Lanc Interoceptive_Lanc
Resnik_strength        0.05131717    0.015087346      -9.459551e-02
Conc_Brys              0.21354305    0.123459754      -3.257702e-01
Nsenses_WordNet       -0.03353627   -0.018262178      -1.392934e-02
Nmeanings_Websters    -0.01898766    0.001409646      -4.460375e-05
Visual_Lanc            0.15319007    0.055176064      -3.424087e-01
Auditory_Lanc         -0.06123191   -0.047086877       1.799479e-01

# Plot correlation matrix
corrplot(cor_mat1, col = topo.colors(200), tl.col = "darkgrey", number.cex = 0.5, tl.cex = 0.5)

Since the upper triangle mirrors the lower one, it is enough to only examine one of them. The diagonal values are not particularly insightful and can be ignored.

# Levelplot
seq1 <- seq(-1, 1, by = 0.01)

levelplot(cor_mat1, aspect = "fill", col.regions = topo.colors(length(seq1)),
          at = seq1, scales = list(x = list(rot = 45)),
          xlab = "", ylab = "")

Needless to say, the above correlation matrices are hard to interpret – even more so if the number of variables were to increase further.

Principal Components Analysis offers a technique to break down a high-dimensional dataset into a much smaller set of “meta-variables”, i.e., principle components (PCs) which capture the bulk of the variance in the data. This is also known as dimension reduction, which allows researchers to see overarching patterns in the data and re-use the output for further analysis (e.g., clustering or predictive modelling).

27.4 Basics of PCA

PCA “repackages” large sets of variables by forming uncorrelated linear combinations of them, yielding $k$ principal components $Z_1, ..., Z_k$ (PCs hf.) of the dataset (for $1, ..., k$). PCs are ordered such that the first PC explains the most variance in the data, with each subsequent PC explaining the maximum remaining variance while being uncorrelated with previous PCs.

Each PC comprises a set of loadings (or weights) $w_{nm}$, which are comparable to the coefficients of regression equations. For instance, the first PC has the general form shown in Equation 27.2, where $x_m$ stand for continuous input variables in the $n \times m$ data matrix $\mathbf{X}$.

\[ Z_{1} = w_{11} \begin{pmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{pmatrix} + w_{21} \begin{pmatrix} x_{12} \\ x_{22} \\ \vdots \\ x_{n2} \end{pmatrix} + \dots + w_{m1} \begin{pmatrix} x_{1m} \\ x_{2m} \\ \vdots \\ x_{nm} \end{pmatrix} \tag{27.2}\]

If a feature positively loads on a principal component (i.e., $w > 0$), it means that as the value of this feature increases, the score for this principal component also increases. The magnitude of $w$ indicates the strength of this relationship. Conversely, negative loadings ($w < 0$) indicate that as the feature value increases, the PC score decreases as well.

How do we find PCs?

PCs are identified using common techniques from matrix algebra, namely singular value decomposition and eigenvalue decomposition. By breaking down the input data into products of several further matrices, it becomes possible to characterise the exact ‘shape’ of its variance (Mair 2018: 181).

The figure below offers a visual summary of PCA:

27.5 Application in R

27.5.1 Fitting the model and identifying number of PCs

First, we fit a PCA object with the number of PCs equivalent to the number of columns in scope_sem_sub.

# Fit initial PCA
pca1 <- principal(scope_sem_sub[,-1],
                  nfactors = ncol(scope_sem_sub[,-1]),
                  rotate = "none")

# Print loadings
loadings(pca1)


Loadings:
                   PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8   
Resnik_strength            0.666 -0.271 -0.100  0.250  0.627              
Conc_Brys           0.813  0.210 -0.173         0.149        -0.170 -0.260
Nsenses_WordNet     0.523 -0.696  0.124                0.241              
Nmeanings_Websters  0.493 -0.683  0.149                0.343              
Visual_Lanc         0.691  0.168 -0.236  0.382  0.152 -0.127  0.484  0.136
Auditory_Lanc      -0.388 -0.228  0.199  0.734  0.413        -0.208       
Haptic_Lanc         0.728  0.113        -0.272  0.401 -0.254 -0.266  0.120
Olfactory_Lanc      0.324  0.444  0.671  0.163 -0.200               -0.347
Gustatory_Lanc      0.256  0.377  0.759        -0.160                0.377
Interoceptive_Lanc -0.341 -0.164  0.577 -0.366  0.543         0.265 -0.100
                   PC9    PC10  
Resnik_strength                 
Conc_Brys          -0.378       
Nsenses_WordNet            0.405
Nmeanings_Websters        -0.373
Visual_Lanc                     
Auditory_Lanc                   
Haptic_Lanc         0.266       
Olfactory_Lanc      0.234       
Gustatory_Lanc     -0.187       
Interoceptive_Lanc -0.120       

                 PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9  PC10
SS loadings    2.629 1.898 1.595 0.937 0.806 0.657 0.461 0.378 0.330 0.309
Proportion Var 0.263 0.190 0.160 0.094 0.081 0.066 0.046 0.038 0.033 0.031
Cumulative Var 0.263 0.453 0.612 0.706 0.787 0.852 0.898 0.936 0.969 1.000

It is common practice to retain only those PCs with eigenvalues (variances) $> 1$ (cf. scree plot).

# Scree plot
barplot(pca1$values, main = "Scree plot", ylab = "Variances", xlab = "PC", # first three PCs
        names.arg = 1:length(pca1$values))
  abline(h = 1, col = "blue", lty = "dotted")

Alternatively, one can perform parallel analysis to identify statistically significant PCs whose variances are “larger than the 95% quantile […] of those obtained from random or resampled data” (Mair 2018: 31). The corresponding function is fa.parallel() from the psych package.

pca.pa <- fa.parallel(scope_sem_sub[,-1], # raw data
                     fa = "pc", # Use PCA instead of factor analysis
                     cor = "cor",  # Use Pearson correlations (default for PCA)
                     n.iter = 200, # Number of iterations (increase for more stable results)
                     quant = 0.95, # Use 95th percentile (common choice)
                     fm = "minres") # Factor method

Parallel analysis suggests that the number of factors =  NA  and the number of components =  3

27.5.2 Accessing and visualising the loadings

Since three PCs appear to be enough to explain the majority of variance in the data, we will refit the model with nfactors = 3.

pca2 <- principal(scope_sem_sub[,-1],
                  nfactors = 3,
                  rotate = "none")

A convenient function for printing the PCA loadings is loadings(). Weights close to $0$ are not displayed.

loadings(pca2)


Loadings:
                   PC1    PC2    PC3   
Resnik_strength            0.666 -0.271
Conc_Brys           0.813  0.210 -0.173
Nsenses_WordNet     0.523 -0.696  0.124
Nmeanings_Websters  0.493 -0.683  0.149
Visual_Lanc         0.691  0.168 -0.236
Auditory_Lanc      -0.388 -0.228  0.199
Haptic_Lanc         0.728  0.113       
Olfactory_Lanc      0.324  0.444  0.671
Gustatory_Lanc      0.256  0.377  0.759
Interoceptive_Lanc -0.341 -0.164  0.577

                 PC1   PC2   PC3
SS loadings    2.629 1.898 1.595
Proportion Var 0.263 0.190 0.160
Cumulative Var 0.263 0.453 0.612

In order to see what features load particularly strongly on the PCs, we can draw a path diagram with diagram(). Note that the red arrows indicate negative weights (i.e., negative “regression coefficients”).

diagram(pca2, main = NA)

The generic plot method returns a scatterplot of the loadings:

plot(pca2, labels = colnames(scope_sem_sub[,-1]), main = NA)

Finally, you can obtain the PC scores for each observation in the input data by accessing the $scores element:

head(pca2$scores, n = 15)

              PC1         PC2         PC3
 [1,] -1.45999990  0.38323657  0.61549383
 [2,] -0.32170158 -0.60027352 -0.08271852
 [3,]  0.12196548 -0.68757984  0.33959072
 [4,] -0.57929327 -0.35785887  0.26877126
 [5,] -0.34097381 -1.35963060  0.54717808
 [6,] -0.04799048 -0.34404820 -0.19668070
 [7,] -0.33873248  0.52372694 -0.28318588
 [8,] -1.11868861  0.26178424 -0.66465979
 [9,]  0.15263031  0.60489417 -0.84324699
[10,] -1.75834143 -0.47957110  0.44313029
[11,] -1.26440095 -1.15766536  0.46594800
[12,]  0.10641410  0.05075197  0.48556702
[13,] -1.26133394 -0.35022899 -0.36512925
[14,] -0.28070472  0.60992380 -1.29547347
[15,] -0.72805598 -0.45777808  0.56031788

Biplots offer juxtaposed visualisations of PC scores (points) and loadings (arrows).

# PC1 and PC2
biplot(pca2, choose = c(1, 2), 
       main = NA, pch = 20, col = c("darkgrey", "blue"))

# PC2 and PC3
biplot(pca2, choose = c(2, 3), 
       main = NA, pch = 20, col = c("darkgrey", "blue"))

Interpreting the PCA output

After inspecting the loadings and biplots, we can see the following patterns:

External sensation: Higher ratings in concreteness (i.e., direct perception with one’s senses) as well as the visual and haptic dimensions of verbs are associated with an increase in PC1.
Senses and selection: PC2 displays notable negative loadings in features relating to the number of meanings a verb has and how much information it carries about the meaning of its objects. PC2 scores decrease if a verb has fewer meanings, but they increase if it displays higher selectional preference strength.
Internal sensation: PC3 captures variance in olfactory, gustatory and interoceptive² ratings.

² Here interoceptive means “[t]o what extent one experiences the referent by sensations inside one’s body” (Gao, Shinkareva, and Desai 2022: 2859).