Data Clustering Algorithms

  • Clustering

K-means Algorithm

  1. Initialize: Randomly choose K inintal cluster centroids

  2. Assign clusters: Assign data point to nearest cluster based on euclidean distance

  3. Update centroids: calculate new centroids by averaging points

  4. Repeat

  • Centroid: center of a cluster, for a given cluster, the centroid is the mean of all the data points within that cluster

  • Limitations

    • Choice of k

    • Sensitivity to initalization: final result is sensitive to the initial centroids placement

    • Only work with numeric data

    • Assumes spherical clusters

    • Sensitivity to outliers

  • Choice of k

    • The elbow method: plot the sum of squared errors (SSE) or inertia against different value of K, and look for an elbow point, where SSE begins to decrease at a slower rate

    • SSE: Sum of squared distance between each point and the centroid to the cluster which it is assigned $SSE = \sum_{i=1}^{K} \sum_{x \in C_i} ||x - \mu_i||^2$

  • Silhouette Analysis

    • Measures how similar a point is to its own cluster(cohesion) compared to other clusters (seperation) $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$

    • a is the average distance between i and all other points in the same cluster (cohesion)

    • b is the minimum average distance between i and all points in the nearest cluster that the point is not assigned to (seperation)

K Mode Algorithm

  • Because K means cannot measure categorical data, K mode is here

  • Uses a different similarity measure

  • Updates cluster centroids on the mode instead of the mean

  1. Initalization: randomly choose k inital cluster modes

  2. Assignment: for each data point xj, calculate the dissimilarity to each cluster mode using the simple matching dismmilarity. Assign xj to the cluster with the least similarity

  3. Update: Update each cluster's mode by taking the most freqeunt value for each attribute

  4. Repeat

K-Prototype Algorithm

  1. Initalization: select k inital prototypes (centroids) for the clusters. These prototype contain both numeric and categorical values.

  2. Dissimilarity Measure: Combination of euclidean distance and simple matching dissimilarity

Hierarchical Clustering

  • Agglomerative (bottom up): start with each data point as their own cluster and sucessively merge clusters until only k clusters left

  • Divisive (top-down): start with all data point in a single cluster and split them until we reach k clusters

Agglomerative Clustering

  • Start with all data points in their own clusters

  • Repeat until only one cluster remains:

    • Find 2 clusters (C1, C2) that are most similar (that have the smallest pairwise cluster dissimilarity d(C1, C2) )

    • Merge C1, C2 into one cluster

    compare the closes points between clusters
    compare the farthest points between clusters

Dendrogram Construction

  1. Leaves

    1. The leaves represent individual data points/clusters

  2. Vertical lines

    1. Each vertical line represents a merger between clusters

    2. The height of the vertical line is the distance (dissimilarity) between clusters being merged

  3. Horizontal lines

    1. connect clusters being merged

    2. as you move up the more dissimilar the merged are

DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based clustering algorithm designed to identify clusters of varying shapes and sizes while distinguishing between noise points (outliers) and core points.

  • Epsilon (ε): The radius or distance within which points are considered to be neighbors

  • MinPts: The minimum number of points required to form a dense region (i.e., for a point to be considered a core point).

  • Core Point: A point is classified as a core point if at least a specified number of points (MinPts) are within a certain distance (ε).

  • Border Point: A point that is within ε distance of a core point but doesn’t have enough neighbors to qualify as a core point.

  • Noise (Outlier): A point that is neither a core point nor a border point; it lies in a region of low density.

Algorithm

  1. Decide on the value of min points and ε

  2. Identify all points as either core point or not

  3. Identify the remaining points as border point, or noise point.

  4. For all of the unclustered core points:

    1. Create a new cluster

    2. Add all the points that are unclustered and density connected to the current point in the cluster

  5. For each unclustered border point, assign it to the cluster of nearest core point

  6. Leave all the noise points as it is.

Large Scale Clustering

BFR Clustering Algorithm

The BFR algorithm is designed for clustering large datasets that are too big to fit into main memory, especially in high-dimensional spaces.

Key Assumptions:

  • Clusters are normally distributed around the centroid

  • The features are uncorrelated

Mahalanobis Distance

  • Unitless measure: normalizes distances based on variance instead of units

  • Accounts for correlation: considers covariance between variables, unlike euclidean

  • Multivariate Generalization: extends the concept of standard scores to multiple dimensions

    • Unlike the standard Euclidean distance, which assumes features are independent and have equal variance, Mahalanobis distance considers correlations between features and unequal variances

Application

  • Outlier Detection: identify observations that are unusally distant from the mean

  • Cluster Analysis: assisn points to clusters

  • Classfication: in discriminant analysis, mahalonboix distance is used to classify observations into predefined category

CURE (Clustering Using REpresentatives)

Limitations

  • Assumes normality

  • Assumes no correlation between features

  1. Take a small sample of the data and cluster it in main memory using Hierarchical Clustering

  2. Select a small set of points from each cluster to be representative points. These points should be chosen to be as far from one another as possible

  3. Move each of the representative points a fixed fraction of the distance between its location and the centroid of its cluster. 20% is usually a good fraction to pick.

  4. Merge two clusters if they have a pair of representative points, one from each cluster, that are “sufficiently close”. This merging step can repeat, until there are no more sufficiently close clusters, or until you find a sufficiently small number of clusters.

  5. Each point p is brought from secondary storage and compared with the representative points. We assign p to the cluster of the representative point that is closest to p.

Advantages

  1. Robustness to Outliers:

    1. Shrinking representative points towards the centroid reduces the influence of outliers.

  2. Ability to Detect Arbitrary Shapes:

    1. Multiple representative points capture the geometry of clusters better than a single centroid.

  3. Scalability:

    1. Sampling and partitioning allow CURE to handle large datasets efficiently.

  4. Flexibility:

    1. The number of representative points and the shrinkage factor α can be adjusted based on the dataset

Limitations

  1. Parameter Sensitivity:

    1. The performance depends on the choice of parameters like the number of representative points and the shrinkage factor.

  2. Computational Complexity:

    1. The hierarchical clustering of representative points can be computationally intensive for large numbers of clusters.

  3. Sampling Bias:

    1. If the sample is not representative of the entire dataset, cluster quality may be affected.

Last updated