Data Clustering Algorithms
Clustering
K-means Algorithm
Initialize: Randomly choose K inintal cluster centroids
Assign clusters: Assign data point to nearest cluster based on euclidean distance
Update centroids: calculate new centroids by averaging points
Repeat
Centroid: center of a cluster, for a given cluster, the centroid is the mean of all the data points within that cluster
Limitations
Choice of k
Sensitivity to initalization: final result is sensitive to the initial centroids placement
Only work with numeric data
Assumes spherical clusters
Sensitivity to outliers
Choice of k
The elbow method: plot the sum of squared errors (SSE) or inertia against different value of K, and look for an elbow point, where SSE begins to decrease at a slower rate
SSE: Sum of squared distance between each point and the centroid to the cluster which it is assigned $SSE = \sum_{i=1}^{K} \sum_{x \in C_i} ||x - \mu_i||^2$
Silhouette Analysis
Measures how similar a point is to its own cluster(cohesion) compared to other clusters (seperation) $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$
a is the average distance between i and all other points in the same cluster (cohesion)
b is the minimum average distance between i and all points in the nearest cluster that the point is not assigned to (seperation)
K Mode Algorithm
Because K means cannot measure categorical data, K mode is here
Uses a different similarity measure
Updates cluster centroids on the mode instead of the mean
Initalization: randomly choose k inital cluster modes
Assignment: for each data point xj, calculate the dissimilarity to each cluster mode using the simple matching dismmilarity. Assign xj to the cluster with the least similarity
Update: Update each cluster's mode by taking the most freqeunt value for each attribute
Repeat
K-Prototype Algorithm
Initalization: select k inital prototypes (centroids) for the clusters. These prototype contain both numeric and categorical values.
Dissimilarity Measure: Combination of euclidean distance and simple matching dissimilarity
Hierarchical Clustering
Agglomerative (bottom up): start with each data point as their own cluster and sucessively merge clusters until only k clusters left
Divisive (top-down): start with all data point in a single cluster and split them until we reach k clusters
Agglomerative Clustering
Start with all data points in their own clusters
Repeat until only one cluster remains:
Find 2 clusters (C1, C2) that are most similar (that have the smallest pairwise cluster dissimilarity d(C1, C2) )
Merge C1, C2 into one cluster

compare the closes points between clusters 
compare the farthest points between clusters 
Dendrogram Construction
Leaves
The leaves represent individual data points/clusters
Vertical lines
Each vertical line represents a merger between clusters
The height of the vertical line is the distance (dissimilarity) between clusters being merged
Horizontal lines
connect clusters being merged
as you move up the more dissimilar the merged are
DBSCAN
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based clustering algorithm designed to identify clusters of varying shapes and sizes while distinguishing between noise points (outliers) and core points.
Epsilon (ε): The radius or distance within which points are considered to be neighbors
MinPts: The minimum number of points required to form a dense region (i.e., for a point to be considered a core point).
Core Point: A point is classified as a core point if at least a specified number of points (MinPts) are within a certain distance (ε).
Border Point: A point that is within ε distance of a core point but doesn’t have enough neighbors to qualify as a core point.
Noise (Outlier): A point that is neither a core point nor a border point; it lies in a region of low density.

Algorithm
Decide on the value of min points and ε
Identify all points as either core point or not
Identify the remaining points as border point, or noise point.
For all of the unclustered core points:
Create a new cluster
Add all the points that are unclustered and density connected to the current point in the cluster
For each unclustered border point, assign it to the cluster of nearest core point
Leave all the noise points as it is.
Large Scale Clustering
BFR Clustering Algorithm
The BFR algorithm is designed for clustering large datasets that are too big to fit into main memory, especially in high-dimensional spaces.
Key Assumptions:
Clusters are normally distributed around the centroid
The features are uncorrelated

Unitless measure: normalizes distances based on variance instead of units
Accounts for correlation: considers covariance between variables, unlike euclidean
Multivariate Generalization: extends the concept of standard scores to multiple dimensions
Unlike the standard Euclidean distance, which assumes features are independent and have equal variance, Mahalanobis distance considers correlations between features and unequal variances
Application
Outlier Detection: identify observations that are unusally distant from the mean
Cluster Analysis: assisn points to clusters
Classfication: in discriminant analysis, mahalonboix distance is used to classify observations into predefined category






CURE (Clustering Using REpresentatives)
Limitations
Assumes normality
Assumes no correlation between features
Take a small sample of the data and cluster it in main memory using Hierarchical Clustering
Select a small set of points from each cluster to be representative points. These points should be chosen to be as far from one another as possible
Move each of the representative points a fixed fraction of the distance between its location and the centroid of its cluster. 20% is usually a good fraction to pick.
Merge two clusters if they have a pair of representative points, one from each cluster, that are “sufficiently close”. This merging step can repeat, until there are no more sufficiently close clusters, or until you find a sufficiently small number of clusters.
Each point p is brought from secondary storage and compared with the representative points. We assign p to the cluster of the representative point that is closest to p.
Advantages
Robustness to Outliers:
Shrinking representative points towards the centroid reduces the influence of outliers.
Ability to Detect Arbitrary Shapes:
Multiple representative points capture the geometry of clusters better than a single centroid.
Scalability:
Sampling and partitioning allow CURE to handle large datasets efficiently.
Flexibility:
The number of representative points and the shrinkage factor α can be adjusted based on the dataset
Limitations
Parameter Sensitivity:
The performance depends on the choice of parameters like the number of representative points and the shrinkage factor.
Computational Complexity:
The hierarchical clustering of representative points can be computationally intensive for large numbers of clusters.
Sampling Bias:
If the sample is not representative of the entire dataset, cluster quality may be affected.
Last updated