1 One exception is the variable Resnik_strength [Resnik (1996)], which was computed manually and appended to the data frame.
The data frame scope_sem_df contains semantic ratings for a sample of 1,702 transitive verbs. Note that all columns have been standardised (cf. ?scale() for details).
A popular descriptive measure for associations between continuous variables \(x\) and \(y\) is the Pearson product-moment correlation coefficient (or simply Pearson’s \(r\); cf. Equation 27.1). It varies on a scale from \(-1\) to \(1\) and indicates the extent to which two variables form a straight-line relationship (Heumann, Schomaker, and Shalabh 2022: 153-154). One of its core components is the covariance between \(x\) and \(y\) which “measures the average tendency of two variables to covary (change together)” (Baguley 2012: 206).
In R, we can compute Pearson’s \(r\) by using the cor() function.
# Check correlation between number of senses and concretenesscor(scope_sem_sub[,-1]$Nsenses_WordNet, scope_sem_sub[,-1]$Conc_Brys) # low
[1] 0.2351554
# Check correlation between haptic experience and concretenesscor(scope_sem_sub[,-1]$Haptic_Lanc, scope_sem_sub[,-1]$Conc_Brys) # high
[1] 0.5676945
If the data frame consists of numeric columns only (i.e., if it is a matrix), we can apply cor() to the full dataset and obtain the correlation matrix (also known as covariance matrix).
Since the upper triangle mirrors the lower one, it is enough to only examine one of them. The diagonal values are not particularly insightful and can be ignored.
Needless to say, the above correlation matrices are hard to interpret – even more so if the number of variables were to increase further.
Principal Components Analysis offers a technique to break down a high-dimensional dataset into a much smaller set of “meta-variables”, i.e., principle components (PCs) which capture the bulk of the variance in the data. This is also known as dimension reduction, which allows researchers to see overarching patterns in the data and re-use the output for further analysis (e.g., clustering or predictive modelling).
27.4 Basics of PCA
PCA “repackages” large sets of variables by forming uncorrelated linear combinations of them, yielding \(k\) principal components \(Z_1, ..., Z_k\) (PCs hf.) of the dataset (for \(1, ..., k\)). PCs are ordered such that the first PC explains the most variance in the data, with each subsequent PC explaining the maximum remaining variance while being uncorrelated with previous PCs.
Each PC comprises a set of loadings (or weights) \(w_{nm}\), which are comparable to the coefficients of regression equations. For instance, the first PC has the general form shown in Equation 27.2, where \(x_m\) stand for continuous input variables in the \(n \times m\) data matrix \(\mathbf{X}\).
If a feature positively loads on a principal component (i.e., \(w > 0\)), it means that as the value of this feature increases, the score for this principal component also increases. The magnitude of \(w\) indicates the strength of this relationship. Conversely, negative loadings (\(w < 0\)) indicate that as the feature value increases, the PC score decreases as well.
How do we find PCs?
PCs are identified using common techniques from matrix algebra, namely singular value decomposition and eigenvalue decomposition. By breaking down the input data into products of several further matrices, it becomes possible to characterise the exact ‘shape’ of its variance (Mair 2018: 181).
The figure below offers a visual summary of PCA:
27.5 Application in R
27.5.1 Fitting the model and identifying number of PCs
First, we fit a PCA object with the number of PCs equivalent to the number of columns in scope_sem_sub.
# Fit initial PCApca1 <-principal(scope_sem_sub[,-1],nfactors =ncol(scope_sem_sub[,-1]),rotate ="none")# Print loadingsloadings(pca1)
It is common practice to retain only those PCs with eigenvalues (variances) \(> 1\) (cf. scree plot).
# Scree plotbarplot(pca1$values, main ="Scree plot", ylab ="Variances", xlab ="PC", # first three PCsnames.arg =1:length(pca1$values))abline(h =1, col ="blue", lty ="dotted")
Alternatively, one can perform parallel analysis to identify statistically significant PCs whose variances are “larger than the 95% quantile […] of those obtained from random or resampled data” (Mair 2018: 31). The corresponding function is fa.parallel() from the psych package.
pca.pa <-fa.parallel(scope_sem_sub[,-1], # raw datafa ="pc", # Use PCA instead of factor analysiscor ="cor", # Use Pearson correlations (default for PCA)n.iter =200, # Number of iterations (increase for more stable results)quant =0.95, # Use 95th percentile (common choice)fm ="minres") # Factor method
Parallel analysis suggests that the number of factors = NA and the number of components = 3
27.5.2 Accessing and visualising the loadings
Since three PCs appear to be enough to explain the majority of variance in the data, we will refit the model with nfactors = 3.
In order to see what features load particularly strongly on the PCs, we can draw a path diagram with diagram(). Note that the red arrows indicate negative weights (i.e., negative “regression coefficients”).
diagram(pca2, main =NA)
The generic plot method returns a scatterplot of the loadings:
plot(pca2, labels =colnames(scope_sem_sub[,-1]), main =NA)
Finally, you can obtain the PC scores for each observation in the input data by accessing the $scores element:
Biplots offer juxtaposed visualisations of PC scores (points) and loadings (arrows).
# PC1 and PC2biplot(pca2, choose =c(1, 2), main =NA, pch =20, col =c("darkgrey", "blue"))
# PC2 and PC3biplot(pca2, choose =c(2, 3), main =NA, pch =20, col =c("darkgrey", "blue"))
Interpreting the PCA output
After inspecting the loadings and biplots, we can see the following patterns:
External sensation: Higher ratings in concreteness (i.e., direct perception with one’s senses) as well as the visual and haptic dimensions of verbs are associated with an increase in PC1.
Senses and selection: PC2 displays notable negative loadings in features relating to the number of meanings a verb has and how much information it carries about the meaning of its objects. PC2 scores decrease if a verb has fewer meanings, but they increase if it displays higher selectional preference strength.
Internal sensation: PC3 captures variance in olfactory, gustatory and interoceptive2 ratings.
2 Here interoceptive means “[t]o what extent one experiences the referent by sensations inside one’s body” (Gao, Shinkareva, and Desai 2022: 2859).
Baguley, Thomas. 2012. Serious Stats: AGuide to AdvancedStatistics for the BehavioralSciences. Houndmills, Basingstoke: Palgrave Macmillan.
Gao, Chuanji, Svetlana V. Shinkareva, and Rutvik H. Desai. 2022. “SCOPE: TheSouthCarolina Psycholinguistic Metabase.”Behavior Research Methods 55 (6): 2853–84. https://doi.org/10.3758/s13428-022-01934-0.
Heumann, Christian, Michael Schomaker, and Shalabh. 2022. Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in r. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-031-11833-3.
Levshina, Natalia. 2015. How to Do Linguistics with r: Data Exploration and Statistical Analysis. Amsterdam; Philadelphia: John Benjamins Publishing Company.