Data Clustering Algorithms

Clustering

K-means Algorithm

Initialize: Randomly choose K inintal cluster centroids
Assign clusters: Assign data point to nearest cluster based on euclidean distance
Update centroids: calculate new centroids by averaging points
Repeat

Centroid: center of a cluster, for a given cluster, the centroid is the mean of all the data points within that cluster
Limitations
- Choice of k
- Sensitivity to initalization: final result is sensitive to the initial centroids placement
- Only work with numeric data
- Assumes spherical clusters
- Sensitivity to outliers
Choice of k
- The elbow method: plot the sum of squared errors (SSE) or inertia against different value of K, and look for an elbow point, where SSE begins to decrease at a slower rate
- SSE: Sum of squared distance between each point and the centroid to the cluster which it is assigned $SSE = \sum_{i=1}^{K} \sum_{x \in C_i} ||x - \mu_i||^2$
Silhouette Analysis
- Measures how similar a point is to its own cluster(cohesion) compared to other clusters (seperation) $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$
- a is the average distance between i and all other points in the same cluster (cohesion)
- b is the minimum average distance between i and all points in the nearest cluster that the point is not assigned to (seperation)

K Mode Algorithm

Because K means cannot measure categorical data, K mode is here
Uses a different similarity measure
Updates cluster centroids on the mode instead of the mean

Initalization: randomly choose k inital cluster modes
Assignment: for each data point xj, calculate the dissimilarity to each cluster mode using the simple matching dismmilarity. Assign xj to the cluster with the least similarity
Update: Update each cluster's mode by taking the most freqeunt value for each attribute
Repeat

K-Prototype Algorithm

Initalization: select k inital prototypes (centroids) for the clusters. These prototype contain both numeric and categorical values.
Dissimilarity Measure: Combination of euclidean distance and simple matching dissimilarity

Hierarchical Clustering

Agglomerative (bottom up): start with each data point as their own cluster and sucessively merge clusters until only k clusters left
Divisive (top-down): start with all data point in a single cluster and split them until we reach k clusters

Agglomerative Clustering

Start with all data points in their own clusters
Repeat until only one cluster remains:
- Find 2 clusters (C1, C2) that are most similar (that have the smallest pairwise cluster dissimilarity d(C1, C2) )
- Merge C1, C2 into one cluster
compare the closes points between clusters
compare the farthest points between clusters

Dendrogram Construction

Leaves
1. The leaves represent individual data points/clusters
Vertical lines
1. Each vertical line represents a merger between clusters
2. The height of the vertical line is the distance (dissimilarity) between clusters being merged
Horizontal lines
1. connect clusters being merged
2. as you move up the more dissimilar the merged are

DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based clustering algorithm designed to identify clusters of varying shapes and sizes while distinguishing between noise points (outliers) and core points.

Epsilon (ε): The radius or distance within which points are considered to be neighbors
MinPts: The minimum number of points required to form a dense region (i.e., for a point to be considered a core point).

Core Point: A point is classified as a core point if at least a specified number of points (MinPts) are within a certain distance (ε).
Border Point: A point that is within ε distance of a core point but doesn’t have enough neighbors to qualify as a core point.
Noise (Outlier): A point that is neither a core point nor a border point; it lies in a region of low density.

Algorithm

Decide on the value of min points and ε
Identify all points as either core point or not
Identify the remaining points as border point, or noise point.
For all of the unclustered core points:
1. Create a new cluster
2. Add all the points that are unclustered and density connected to the current point in the cluster
For each unclustered border point, assign it to the cluster of nearest core point
Leave all the noise points as it is.

Large Scale Clustering

BFR Clustering Algorithm

The BFR algorithm is designed for clustering large datasets that are too big to fit into main memory, especially in high-dimensional spaces.

Key Assumptions:

Clusters are normally distributed around the centroid
The features are uncorrelated

Unitless measure: normalizes distances based on variance instead of units
Accounts for correlation: considers covariance between variables, unlike euclidean
Multivariate Generalization: extends the concept of standard scores to multiple dimensions
- Unlike the standard Euclidean distance, which assumes features are independent and have equal variance, Mahalanobis distance considers correlations between features and unequal variances

Application

Outlier Detection: identify observations that are unusally distant from the mean
Cluster Analysis: assisn points to clusters
Classfication: in discriminant analysis, mahalonboix distance is used to classify observations into predefined category

CURE (Clustering Using REpresentatives)

Limitations

Assumes normality
Assumes no correlation between features

Take a small sample of the data and cluster it in main memory using Hierarchical Clustering
Select a small set of points from each cluster to be representative points. These points should be chosen to be as far from one another as possible
Move each of the representative points a fixed fraction of the distance between its location and the centroid of its cluster. 20% is usually a good fraction to pick.
Merge two clusters if they have a pair of representative points, one from each cluster, that are “sufficiently close”. This merging step can repeat, until there are no more sufficiently close clusters, or until you find a sufficiently small number of clusters.
Each point p is brought from secondary storage and compared with the representative points. We assign p to the cluster of the representative point that is closest to p.

Advantages

Robustness to Outliers:
1. Shrinking representative points towards the centroid reduces the influence of outliers.
Ability to Detect Arbitrary Shapes:
1. Multiple representative points capture the geometry of clusters better than a single centroid.
Scalability:
1. Sampling and partitioning allow CURE to handle large datasets efficiently.
Flexibility:
1. The number of representative points and the shrinkage factor α can be adjusted based on the dataset

Limitations

Parameter Sensitivity:
1. The performance depends on the choice of parameters like the number of representative points and the shrinkage factor.
Computational Complexity:
1. The hierarchical clustering of representative points can be computationally intensive for large numbers of clusters.
Sampling Bias:
1. If the sample is not representative of the entire dataset, cluster quality may be affected.

PreviousData Preprocessing Algorithms NextSupervised Machine Learning

Last updated 1 year ago

hashtagK-means Algorithm

hashtagK Mode Algorithm

hashtagK-Prototype Algorithm

hashtagHierarchical Clustering

hashtagAgglomerative Clustering

hashtagDendrogram Construction

hashtagDBSCAN

hashtagAlgorithm

hashtagLarge Scale Clustering

hashtagBFR Clustering Algorithm

hashtagApplication

hashtagCURE (Clustering Using REpresentatives)