K-means & correlation distance

Here's a useful observation related to the use of K-means together with the Pearson correlation distance (© Alex).

The standard K-means update step, where you update the cluster centers by taking the means of the corresponding points is technically not very appropriate in the case of the correlation distance d(x, y) = 1 - corr(x, y). The proper step would be to take the sum of the normalized points as the new cluster center:

c = sumi(xi/|xi|)

However, you would get an equivalent result if you just used the standard k-means with the eucleidian metric on the normalized dataset.

In short: the standard k-means is only meant to be used with the eucleidian metric, so don't use correlation distance with k-means. Just normalize the points before clustering if you want correlation to be the measure of similarity.

PS: The above statements are not blind rules of the thumb, but can be substantiated by some straightforward maths which I leave for you to work out if you wish.