Created
December 12, 2014 21:59
-
-
Save gdbassett/528d816d035f2deaaca1 to your computer and use it in GitHub Desktop.
Efficient python implementation of canopy clustering. (A method for efficiently generating centroids and clusters, most commonly as input to a more robust clustering algorithm.)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.metrics.pairwise import pairwise_distances | |
import numpy as np | |
# X shoudl be a numpy matrix, very likely sparse matrix: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix | |
# T1 > T2 for overlapping clusters | |
# T1 = Distance to centroid point to not include in other clusters | |
# T2 = Distance to centroid point to include in cluster | |
# T1 > T2 for overlapping clusters | |
# T1 < T2 will have points which reside in no clusters | |
# T1 == T2 will cause all points to reside in mutually exclusive clusters | |
# Distance metric can be any from here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html | |
# filemap may be a list of point names in their order in X. If included, row numbers from X will be replaced with names from filemap. | |
def canopy(X, T1, T2, distance_metric='euclidean', filemap=None): | |
canopies = dict() | |
X1_dist = pairwise_distances(X, metric=distance_metric) | |
canopy_points = set(range(X.shape[0])) | |
while canopy_points: | |
point = canopy_points.pop() | |
i = len(canopies) | |
canopies[i] = {"c":point, "points": list(np.where(X1_dist[point] < T2)[0])} | |
canopy_points = canopy_points.difference(set(np.where(X1_dist[point] < T1)[0])) | |
if filemap: | |
for canopy_id in canopies.keys(): | |
canopy = canopies.pop(canopy_id) | |
canopy2 = {"c":filemap[canopy['c']], "points":list()} | |
for point in canopy['points']: | |
canopy2["points"].append(filemap[point]) | |
canopies[canopy_id] = canopy2 | |
return canopies |
I think that in lines 5, 8 and 9 the signs should be reversed.
I agree with MitraMitraMitra, those lines seem backwards. Also, if you are using this on larger datasets, you can use pairwise_distance_chunked to avoid the need to compute the distance matrix all at once.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
One potential problem here is that this method relies on computing the distance matrix directly. This is somewhat of a brute force approach and will end up meaning that the time and memory requirements for this method will be, at best, O(n^2). That may be fine for smallish data sets, since computing the pairwise distances may be quite fast. However, for application to large datasets, you might want to consider using something like ball tree or KD tree. I.e. instead of computing the distance matrix directly, build a skl.neighbors.BallTree, then use the query_radius function to obtain a 'sparse' distance matrix. This can backfire if T1 and T2 are too larger, but if T1 and T2 are small relative to the volume / density of your clustering, this could yield considerable savings.