View Categories

Cluster analysis

Pretty much what it says: a family of statistical methods that look to see if there are clusters in multivariate data. It’s been most used in our areas to decide if there are clusters of individuals in datasets where we have data for all the individuals on a number of variables.

Details #

The first key question is: what is a cluster? To some extent it’s as simple as that it brings together a set of individuals whose scores on the variables seem more similar to each others’ scores and fairly different from the scores on those same variables of individuals in the other groups.

The next question is similar how? This has a number of possible answers as there are many “similarity coefficients”: “Cartesian distances”, “city-block distances”, “Mahalanobis distances” are just a selection. Different similarity coefficients may turn out to create very similar clusters, or very different ones.

Then there is the question: how do you form the clusters? Again there a number of possibilities including forms of “nearest neighbour” and “centroid/group similarity” and variants within those. Sequential “nearest neighbour” methods work by first pairing up the two most similar individuals based on whatever similarity coefficient you chose. “Centroid/group similarity” methods work differently: not looking at pairwise similarities but jumping straight to finding clusters whose collective members are more similar to each other than are those in other clusters. It’s probably harder to picture how the computer program does this than to imagine the sequential pairing but there are strong statistical methods to do this generally related to analysis of variance (q.v.).

The final question is: how many clusters? I find it helps to think that if you have n individuals in your dataset you are looking at numbers of clusters between 1 and n and those two extremes are achieving any clustering at all. Yet again there are various ways that cluster analyses answer this question.

As you have probably already guessed, the great challenge with cluster analysis is that there are so many ways to do it and the complexities of which are best for what datasets and what variables are quite substantial. Sometimes it makes sense to try various approaches and see if there are fairly stable clusters that emerge, regardless of the methods you use. Even there though one has to be quite careful as using different similarity coefficients, and certainly using different clustering methods, can mean that you are actually asking rather different questions of the data: looking for what might be genuinely and perhaps interestingly different clusters.

In our field these complexities probably led to cluster analysis rather going out of fashion at the end of the 20th Century. That may say more about the desperate desire to have definite “best answers” in our field than it may say about inherent problems with the methods.

Try also #

  • Similarity coefficients/indices

Chapters #

Not covered in the OMbook.

Online resources #

None yet nor likely I’d say.

Dates #

First created 31.i.25.

Powered by BetterDocs