27  Principal Components Analysis

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

27.2 Preparation

This unit relies on psycholinguistic data from the South Carolina Psycholinguistic Metabase (Gao, Shinkareva, and Desai 2022).1 Detailed descriptions of the variables can be found here.

  • 1 One exception is the variable Resnik_strength [Resnik (1996)], which was computed manually and appended to the data frame.

  • The data frame scope_sem_df contains semantic ratings for a sample of 1,702 transitive verbs. Note that all columns have been standardised (cf. ?scale() for details).

    # Load libraries
    library(tidyverse)
    library(purrr)
    library(lattice)
    library(corrplot)
    library(psych)
    library(GPArotation)
    library(gridExtra)
    
    # Load data
    scope_sem_df <- readRDS("scope_sem.RDS")
    
    # Select subset
    scope_sem_sub <- scope_sem_df[,1:11]
    
    # Overview
    glimpse(scope_sem_sub)
    Rows: 1,702
    Columns: 11
    $ Verb               <chr> "abstain", "abstract", "abuse", "accelerate", "acce…
    $ Resnik_strength    <dbl> 0.40909889, 0.18206692, 0.12473608, -0.76972217, -1…
    $ Conc_Brys          <dbl> -0.94444378, -1.92983639, -0.59478833, 0.22107437, …
    $ Nsenses_WordNet    <dbl> -0.68843996, 0.27755219, 0.00155443, -0.68843996, 0…
    $ Nmeanings_Websters <dbl> -0.95559835, 0.73781281, 0.73781281, -0.27823388, 0…
    $ Visual_Lanc        <dbl> -2.2545455, 0.6103733, 1.3354358, -0.4342084, -0.34…
    $ Auditory_Lanc      <dbl> -0.84225787, -0.35605108, 1.54797548, 0.18795651, 1…
    $ Haptic_Lanc        <dbl> -0.75523987, -0.29089287, 1.25099360, -0.18911818, …
    $ Olfactory_Lanc     <dbl> -0.14444936, -0.37350419, -0.53335522, -0.37350419,…
    $ Gustatory_Lanc     <dbl> 0.27698988, -0.10105698, -0.36148925, -0.52110903, …
    $ Interoceptive_Lanc <dbl> 1.08153427, -0.06560311, 1.64313895, 1.45452985, 0.…

    27.3 Descriptive overview

    A popular descriptive measure for associations between continuous variables \(x\) and \(y\) is the Pearson product-moment correlation coefficient (or simply Pearson’s \(r\); cf. Equation 27.1). It varies on a scale from \(-1\) to \(1\) and indicates the extent to which two variables form a straight-line relationship (Heumann, Schomaker, and Shalabh 2022: 153-154). One of its core components is the covariance between \(x\) and \(y\) which “measures the average tendency of two variables to covary (change together)” (Baguley 2012: 206).

    \[ r_{xy} = \frac{Cov(x, y)}{\sqrt{Var(x)}\sqrt{Var(y)}}= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}. \tag{27.1}\]

    In R, we can compute Pearson’s \(r\) by using the cor() function.

    # Check correlation between number of senses and concreteness
    cor(scope_sem_sub[,-1]$Nsenses_WordNet, scope_sem_sub[,-1]$Conc_Brys) # low
    [1] 0.2351554
    # Check correlation between haptic experience and concreteness
    cor(scope_sem_sub[,-1]$Haptic_Lanc, scope_sem_sub[,-1]$Conc_Brys) # high
    [1] 0.5676945

    If the data frame consists of numeric columns only (i.e., if it is a matrix), we can apply cor() to the full dataset and obtain the correlation matrix (also known as covariance matrix).

    # Generate correlation matrix
    cor_mat1 <- cor(scope_sem_sub[,-1])
    
    head(cor_mat1)
                       Resnik_strength  Conc_Brys Nsenses_WordNet
    Resnik_strength         1.00000000  0.1166670     -0.37442983
    Conc_Brys               0.11666697  1.0000000      0.23515537
    Nsenses_WordNet        -0.37442983  0.2351554      1.00000000
    Nmeanings_Websters     -0.34225250  0.2023356      0.68509560
    Visual_Lanc             0.05471417  0.5519836      0.17154846
    Auditory_Lanc          -0.11162700 -0.2683646     -0.02960745
                       Nmeanings_Websters Visual_Lanc Auditory_Lanc  Haptic_Lanc
    Resnik_strength            -0.3422525  0.05471417   -0.11162700  0.008260683
    Conc_Brys                   0.2023356  0.55198358   -0.26836458  0.567694470
    Nsenses_WordNet             0.6850956  0.17154846   -0.02960745  0.239470104
    Nmeanings_Websters          1.0000000  0.14597243   -0.04656650  0.193226862
    Visual_Lanc                 0.1459724  1.00000000   -0.11674896  0.404536416
    Auditory_Lanc              -0.0465665 -0.11674896    1.00000000 -0.289586292
                       Olfactory_Lanc Gustatory_Lanc Interoceptive_Lanc
    Resnik_strength        0.05131717    0.015087346      -9.459551e-02
    Conc_Brys              0.21354305    0.123459754      -3.257702e-01
    Nsenses_WordNet       -0.03353627   -0.018262178      -1.392934e-02
    Nmeanings_Websters    -0.01898766    0.001409646      -4.460375e-05
    Visual_Lanc            0.15319007    0.055176064      -3.424087e-01
    Auditory_Lanc         -0.06123191   -0.047086877       1.799479e-01
    # Plot correlation matrix
    corrplot(cor_mat1, col = topo.colors(200), tl.col = "darkgrey", number.cex = 0.5, tl.cex = 0.5)

    Since the upper triangle mirrors the lower one, it is enough to only examine one of them. The diagonal values are not particularly insightful and can be ignored.

    # Levelplot
    seq1 <- seq(-1, 1, by = 0.01)
    
    levelplot(cor_mat1, aspect = "fill", col.regions = topo.colors(length(seq1)),
              at = seq1, scales = list(x = list(rot = 45)),
              xlab = "", ylab = "")

    Needless to say, the above correlation matrices are hard to interpret – even more so if the number of variables were to increase further.

    Principal Components Analysis offers a technique to break down a high-dimensional dataset into a much smaller set of “meta-variables”, i.e., principle components (PCs) which capture the bulk of the variance in the data. This is also known as dimension reduction, which allows researchers to see overarching patterns in the data and re-use the output for further analysis (e.g., clustering or predictive modelling).

    27.4 Basics of PCA

    PCA “repackages” large sets of variables by forming uncorrelated linear combinations of them, yielding \(k\) principal components \(Z_1, ..., Z_k\) (PCs hf.) of the dataset (for \(1, ..., k\)). PCs are ordered such that the first PC explains the most variance in the data, with each subsequent PC explaining the maximum remaining variance while being uncorrelated with previous PCs.

    Each PC comprises a set of loadings (or weights) \(w_{nm}\), which are comparable to the coefficients of regression equations. For instance, the first PC has the general form shown in Equation 27.2, where \(x_m\) stand for continuous input variables in the \(n \times m\) data matrix \(\mathbf{X}\).

    \[ Z_{1} = w_{11} \begin{pmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{pmatrix} + w_{21} \begin{pmatrix} x_{12} \\ x_{22} \\ \vdots \\ x_{n2} \end{pmatrix} + \dots + w_{m1} \begin{pmatrix} x_{1m} \\ x_{2m} \\ \vdots \\ x_{nm} \end{pmatrix} \tag{27.2}\]

    If a feature positively loads on a principal component (i.e., \(w > 0\)), it means that as the value of this feature increases, the score for this principal component also increases. The magnitude of \(w\) indicates the strength of this relationship. Conversely, negative loadings (\(w < 0\)) indicate that as the feature value increases, the PC score decreases as well.

    PCs are identified using common techniques from matrix algebra, namely singular value decomposition and eigenvalue decomposition. By breaking down the input data into products of several further matrices, it becomes possible to characterise the exact ‘shape’ of its variance (Mair 2018: 181).

    The figure below offers a visual summary of PCA:

    27.5 Application in R

    27.5.1 Fitting the model and identifying number of PCs

    First, we fit a PCA object with the number of PCs equivalent to the number of columns in scope_sem_sub.

    # Fit initial PCA
    pca1 <- principal(scope_sem_sub[,-1],
                      nfactors = ncol(scope_sem_sub[,-1]),
                      rotate = "none")
    
    # Print loadings
    loadings(pca1)
    
    Loadings:
                       PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8   
    Resnik_strength            0.666 -0.271 -0.100  0.250  0.627              
    Conc_Brys           0.813  0.210 -0.173         0.149        -0.170 -0.260
    Nsenses_WordNet     0.523 -0.696  0.124                0.241              
    Nmeanings_Websters  0.493 -0.683  0.149                0.343              
    Visual_Lanc         0.691  0.168 -0.236  0.382  0.152 -0.127  0.484  0.136
    Auditory_Lanc      -0.388 -0.228  0.199  0.734  0.413        -0.208       
    Haptic_Lanc         0.728  0.113        -0.272  0.401 -0.254 -0.266  0.120
    Olfactory_Lanc      0.324  0.444  0.671  0.163 -0.200               -0.347
    Gustatory_Lanc      0.256  0.377  0.759        -0.160                0.377
    Interoceptive_Lanc -0.341 -0.164  0.577 -0.366  0.543         0.265 -0.100
                       PC9    PC10  
    Resnik_strength                 
    Conc_Brys          -0.378       
    Nsenses_WordNet            0.405
    Nmeanings_Websters        -0.373
    Visual_Lanc                     
    Auditory_Lanc                   
    Haptic_Lanc         0.266       
    Olfactory_Lanc      0.234       
    Gustatory_Lanc     -0.187       
    Interoceptive_Lanc -0.120       
    
                     PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9  PC10
    SS loadings    2.629 1.898 1.595 0.937 0.806 0.657 0.461 0.378 0.330 0.309
    Proportion Var 0.263 0.190 0.160 0.094 0.081 0.066 0.046 0.038 0.033 0.031
    Cumulative Var 0.263 0.453 0.612 0.706 0.787 0.852 0.898 0.936 0.969 1.000

    It is common practice to retain only those PCs with eigenvalues (variances) \(> 1\) (cf. scree plot).

    # Scree plot
    barplot(pca1$values, main = "Scree plot", ylab = "Variances", xlab = "PC", # first three PCs
            names.arg = 1:length(pca1$values))
      abline(h = 1, col = "blue", lty = "dotted")

    Alternatively, one can perform parallel analysis to identify statistically significant PCs whose variances are “larger than the 95% quantile […] of those obtained from random or resampled data” (Mair 2018: 31). The corresponding function is fa.parallel() from the psych package.

    pca.pa <- fa.parallel(scope_sem_sub[,-1], # raw data
                         fa = "pc", # Use PCA instead of factor analysis
                         cor = "cor",  # Use Pearson correlations (default for PCA)
                         n.iter = 200, # Number of iterations (increase for more stable results)
                         quant = 0.95, # Use 95th percentile (common choice)
                         fm = "minres") # Factor method

    Parallel analysis suggests that the number of factors =  NA  and the number of components =  3 

    27.5.2 Accessing and visualising the loadings

    Since three PCs appear to be enough to explain the majority of variance in the data, we will refit the model with nfactors = 3.

    pca2 <- principal(scope_sem_sub[,-1],
                      nfactors = 3,
                      rotate = "none")

    A convenient function for printing the PCA loadings is loadings(). Weights close to \(0\) are not displayed.

    loadings(pca2)
    
    Loadings:
                       PC1    PC2    PC3   
    Resnik_strength            0.666 -0.271
    Conc_Brys           0.813  0.210 -0.173
    Nsenses_WordNet     0.523 -0.696  0.124
    Nmeanings_Websters  0.493 -0.683  0.149
    Visual_Lanc         0.691  0.168 -0.236
    Auditory_Lanc      -0.388 -0.228  0.199
    Haptic_Lanc         0.728  0.113       
    Olfactory_Lanc      0.324  0.444  0.671
    Gustatory_Lanc      0.256  0.377  0.759
    Interoceptive_Lanc -0.341 -0.164  0.577
    
                     PC1   PC2   PC3
    SS loadings    2.629 1.898 1.595
    Proportion Var 0.263 0.190 0.160
    Cumulative Var 0.263 0.453 0.612

    In order to see what features load particularly strongly on the PCs, we can draw a path diagram with diagram(). Note that the red arrows indicate negative weights (i.e., negative “regression coefficients”).

    diagram(pca2, main = NA)

    The generic plot method returns a scatterplot of the loadings:

    plot(pca2, labels = colnames(scope_sem_sub[,-1]), main = NA)

    Finally, you can obtain the PC scores for each observation in the input data by accessing the $scores element:

    head(pca2$scores, n = 15)
                  PC1         PC2         PC3
     [1,] -1.45999990  0.38323657  0.61549383
     [2,] -0.32170158 -0.60027352 -0.08271852
     [3,]  0.12196548 -0.68757984  0.33959072
     [4,] -0.57929327 -0.35785887  0.26877126
     [5,] -0.34097381 -1.35963060  0.54717808
     [6,] -0.04799048 -0.34404820 -0.19668070
     [7,] -0.33873248  0.52372694 -0.28318588
     [8,] -1.11868861  0.26178424 -0.66465979
     [9,]  0.15263031  0.60489417 -0.84324699
    [10,] -1.75834143 -0.47957110  0.44313029
    [11,] -1.26440095 -1.15766536  0.46594800
    [12,]  0.10641410  0.05075197  0.48556702
    [13,] -1.26133394 -0.35022899 -0.36512925
    [14,] -0.28070472  0.60992380 -1.29547347
    [15,] -0.72805598 -0.45777808  0.56031788

    Biplots offer juxtaposed visualisations of PC scores (points) and loadings (arrows).

    # PC1 and PC2
    biplot(pca2, choose = c(1, 2), 
           main = NA, pch = 20, col = c("darkgrey", "blue"))

    # PC2 and PC3
    biplot(pca2, choose = c(2, 3), 
           main = NA, pch = 20, col = c("darkgrey", "blue"))

    Interpreting the PCA output

    After inspecting the loadings and biplots, we can see the following patterns:

    • External sensation: Higher ratings in concreteness (i.e., direct perception with one’s senses) as well as the visual and haptic dimensions of verbs are associated with an increase in PC1.

    • Senses and selection: PC2 displays notable negative loadings in features relating to the number of meanings a verb has and how much information it carries about the meaning of its objects. PC2 scores decrease if a verb has fewer meanings, but they increase if it displays higher selectional preference strength.

    • Internal sensation: PC3 captures variance in olfactory, gustatory and interoceptive2 ratings.

  • 2 Here interoceptive means “[t]o what extent one experiences the referent by sensations inside one’s body” (Gao, Shinkareva, and Desai 2022: 2859).