A “dimension reduction” method of rearranging multivariate data. It’s been used for decades in psychology to look at the correlations between items of a questionnaire (where it’s often, wrongly for pedants like me, referred to as a factor analysis!) It’s also something you may come across in genetic studies now as its ability to simplify data from many people on very, very many gene loci is valuable. I think it may also be used in MRI (Magnetic Resonance Imaging) again for the ability to simplify data from umpteen “voxels” in a scan.
Details #
The basic maths is that if you have scores on a lot of variables, typically the k items of a questionnaire (k is usually used for that number), then if you have values for all k items for more than k + 1 observations (typically independent, i.e. not repeated, completions of a questionnaire) then the maths of PCA will rearrange the data to give you “component scores” on k “principal components”. So you seem to have translated one n by k matrix of numbers: how does that help?
The crucial thing is that new matrix has the nice properties that the first principal component (PC) accounts for the most possible variance across all the items and n values. (Hence “principal” I assume.) The second component accounts for the most variance after that first component is removed … and so on until the last (kth) component. The other crucial thing is that the maths also ensures (depends upon) all k components being “orthogonal” to one another: uncorrelated.
I find the analogy of taking a map of k cities of some country and taking measurements of how far each one is from some random point on the map, then taking another random point and measuring the distance to each city from that point; keep going until you have at least k + sets of meausurements of those distances from each of the random points. That n by k matrix of numbers isn’t very informative. However, if you push the numbers into a PCA (any statistics package will do this for you and in a split second on modern computers) you will find that you have two PCs with numbers and the remaining k – 2 components will all be zeros (if you have measured the distances near perfectly).
That’s because you started with a 2D reality and the magic of PCA has recovered it for you. The PCs won’t necessarily be aligned North to South and East to West but the first will be orthogonal to the second and the first will actually lie along the line of maximum variance in the values for each city.
It’s not a perfect analogy to what goes on conducting PCAs on questionnaire data but it’s pretty good, good enough for our purposes. I think it may be pretty close to what is going on in constructing an MRI scan (but that would be 3D not 2D).
Clearly the variance across the items in a questionnaire is not neatly 2D or 3D but, depending on the questionnaire construction it can be very clear that some simple dimensions are being recovered. For example, for the Hospital Anxiety and Depression Scales (HADS) with seven anxiety oriented items and seven depression oriented items a PCA tends to show most of the variance in the first two components and “rotation” of the data to let go of the orthogonality will amost always come up with two components that are strongly positively and map the anxiety items to one of the components and the depression items to the other.
(Yes, I know I use this for my examples again and again but that’s because it’s nice and simple!)
Try also #
- Confirmatory factor analysis (CFA)
- Correlation
- Factor analysis
- Hospital anxiety and depression scales (HADS)
- Mapping
- Orthogonal
- Psychometrics
- Statistical methods
Chapters #
Not covered in the OMbook.
Online resources #
None yet nor I think likely as it’s probably not something people need, or if they do, it’s best they get into it with a statistician.
Dates #
First created 22.i.25