Constellation, Meteor, or Just Another Plane?
Our brains have the natural ability to see clusters of like objects and identify outliers. With unsupervised learning algorithms, computers can start doing the same.
- By Troy Hiltbrand
- September 8, 2020
In July, stargazers in the Northern Hemisphere left the comfort of their homes to go and peer into the heavens with the goal of seeing a phenomenon that won't be visible for another 6,800 years -- the NEOWISE meteor. Armed with binoculars, telescopes, or just the naked eye, they looked skyward. The first thing that they probably saw was a sky full of tiny stars. As they continued to stare upwards, their eyes naturally started to see groupings of stars that almost came alive. They might have intentionally or unintentionally caught sight of the same groupings of stars that Ptolemy, the ancient Greek astronomer, saw as he defined the constellations that we know today.
If they followed the handle of the Big Dipper down to its cup and dropped down towards the horizon, they might have seen an object that looked a little different from the stars around and appeared to have a tail of light. If they did, they happily caught a glimpse of the rare NEOWISE meteor.
At the same time, many of them saw other lights in the sky that appeared to move too quickly or started to flash and found themselves looking at lights that were not celestial in origin, but were instead just airplanes or satellites crossing the sky.
This process is similar to what computers do every day in a process of unsupervised learning called clustering. With this process, the results are often the same. First, like the constellations, these processes identify groupings of data that have meaning. These groups might represent customers with similar demographics, orders with similar characteristics, or homogenous network traffic. Identifying the cluster is like seeing a group of related stars. Being able to name that cluster and give it business definition is like Ptolemy giving these constellations names and associating them with figures in Greek mythology that were familiar to the people of his time. With these names, they were able to see hunters, queens, dragons, and a myriad of other creatures in the sky to which they could relate. When you give a cluster a name and a personification, it takes on meaning and value to the business.
Occasionally, you find instances that fall outside of those clusters and you have to determine whether you are seeing something truly important, like the meteor, or are you peering at noise, that like the airplanes in the sky that don't provide insight into your celestial pursuits. With business data, these meteor-like outliers might represent instances of fraud that, if caught, early will represent real savings to your company. They might also represent that unique candidate with skills and experience that will take your team to a whole new level and are different than the rest of the candidate pool.
How do you start identifying the clusters of meaning in your data and find those points that fall outside those groupings? Your first step is to understand what tools and algorithms are available to process your data and tease out those patterns and isolate your outliers.
In terms of the choice of algorithms that power unsupervised learning clustering, there are two main methods of creating these clusters of similarity: k-means clustering and DBSCAN.
K-means clustering, or its close cousins k-medians and k-medoids, takes the data set and a predefined number of clusters. The clusters start with randomized centroids scattered throughout the distribution of data. With each iteration, the distance between the points and the centroid is measured. The cluster boundaries are re-evaluated to position like points closer together in the same cluster.
Once these boundaries are reassessed, the centroid is shifted to reflect the center point of the new group. Over multiple iterations, this centering process stabilizes with the centroid moving less and less. Once this happens, each cluster is defined as a centroid with a defined distance to the outermost points.
K-means clustering is sensitive to outliers in the training data and these can skew the definition of your clusters. Once the clusters are defined and applied to new data sets, outliers can be identified.
Another popular clustering algorithm is the DBSCAN or density-based spatial clustering of applications with noise. One of the biggest differences between k-means clustering and DBSCAN is that you do not have to know beforehand how many clusters exist with DBSCAN as you do with k-means. DBSCAN looks for points in your data that are naturally grouped. This is termed data density. If a point has a minimum number of neighbors, that grouping is considered to be a cluster.
Points outside of the density-based clusters are viewed as outliers or noise. DBSCAN is less susceptible to outliers in the training data but does not work well with sparse data because it relies on patterns where there are multiple instances grouped closely together.
Other Clustering Algorithms
These two algorithms make up a large percentage of the clustering processes of unsupervised learning projects. They are, however, not the only options available and other algorithms are being invented to work in scenarios where k-means and DBSCAN are not optimal.
Agglomerative clustering is a bottom-up hierarchical approach to clustering. It starts with pairs, grouping like instances together, and then continues by grouping like pairs together into groups. This continues until you get an equilibrium of pairings that have sufficient similarity, creating your clusters.
Another algorithm that is similar to k-means is the mean-shift algorithm. It starts by placing a circle around each of the data points. It computes the mean of all the instances within that circle. It then shifts the circle so that it is centered on that new mean. As the means and circles stabilize, those instances whose circles have a mean that is the same or close enough are grouped as a cluster. Like DBSCAN, it does not require a predefined number of clusters and can assess the clusters based on the data.
Affinity propagation uses a voting system. Each instance votes for similar instances to be its representative. This voting occurs as messaging happens between the instances trying to identify which instance is the best representation of the cluster. As the cluster definition evolves, the ideal representative instance changes. This algorithm does not require you know the number of clusters in advance.
This is only a sampling of the algorithms available today used to identify clusters within your data. Some identify clusters organically; some start using a known number of clusters beforehand and work toward grouping the data into that predefined number of clusters.
With cluster definitions in hand, you can lay that model on top of other data sets and start to isolate the instances that fall between the clusters. With these instances, you can assess why they do not fit the norm, if they are valuable outliers or noise, and what to do with them.
Just as those stargazers strove to bring order to the July night sky by finding patterns and outliers to those patterns, your systems will be able to bring order to your data. You will be able to take a proverbial sky full of data points that represent your customers, their activity, or your employees and find truly profound meaning that will drive business decisions. If you are lucky, you will also find a high-value meaningful outlier among your data and be able to discard instances of distracting noise. Good luck in your data gazing.