Q&A: Classification, Clustering, and ML Challenges
In this Q&A, we look at two key machine learning approaches -- what they are, how they’re used, and the challenges of implementing them -- with Naveed Ahmed Janvekar, a senior data scientist at Amazon.
- By Upside Staff
- February 22, 2022
Upside: What are classification and clustering techniques and how do they improve analytics?
Naveed Ahmed Janvekar: Broadly speaking, machine learning can be divided into three types -- supervised learning, unsupervised learning, and reinforcement learning. Classification is a type of supervised machine learning that separates data into different classes. The value of classification models is the accuracy with which they can separate data into various classes at scale. However, the accuracy of a classification model is contingent on the quality of the training data set and how well the algorithm fits that data set. One example of a classification problem is identifying an email as spam or not spam.
Clustering, on the other hand, is a type of unsupervised learning that involves identifying groups within data, where each of the data points within a given group is similar (related) or different (unrelated) to data points in other groups. The attributes and features of data points help establish relatedness. For example, if we want to cluster customers on a platform and we have information such as income, education, and age of our customers, we can represent each of these customers as a data point in n-dimensional space and calculate how similar pairs of customers are based on their attributes. Customer1, Customer2, Customer3 can be grouped in cluster 1 because they all have young age, high income, and Ph.D.s.
Clustering is valuable when no labeled data is available to train a supervised model. It can be considered as a good step towards exploratory data analysis. Some applications of clustering models include customer segmentation and recommendation engines.
What are some widely used classification and clustering algorithms?
There are several types of classification models that can be used based on the type of training data set that is available and whether the data can be linearly separated or not. Logistic regression, decision trees, random forest, and XGBoost are some of the more popularly used classification algorithms. Factors such as model evaluation metrics and inference time are used in deciding the best classification model for a particular data set.
Similarly, for clustering based on the available data set, algorithms such as k-means, hierarchical clustering, and density based scan (DBSCAN) clustering are popular. Factors such as pre-determining the number of clusters to be generated, interpretability, and algorithm runtime are important in choosing one algorithm over another.
What are some challenges in implementing classification algorithms?
The first and foremost challenge is whether there is a high-quality training data set. By high quality, I mean is there enough labeled data from which the model can learn? Are there enough records in the data set across a sufficiently wide time range to help the model generalize and not overfit? Do we drop features with high missing percentages or impute them? What are the methods of imputation? How do you deal with class imbalance issues or model performance degradation over time?
There are multiple ways of dealing with all these issues by experimenting with various strategies for each of the challenge and choosing the ones that gives the best performance metrics. For example, class imbalance can be addressed by undersampling, oversampling, and synthetic minority over-sampling (SMOTE) and then model performance can be evaluated to select the best technique among these to handle class-imbalance issue.
Why is there a need to use unsupervised models such as clustering that may not have as good results as supervised models?
This is a great question. Ideally, if we have supervised models that can predict which class or category data points should belong to, we don’t really need an unsupervised mechanism. However, in the real world, that’s not always the case. Many times, we lack labeled data to train classification models, but we still need a way to segment the data or perform exploratory data analysis. That’s where clustering algorithms are useful. By grouping data points based on their proximity or distance to each other, these algorithms give us clusters that we can use to perform tasks such as customer segmentation and anomaly detection.
What are some challenges in implementing clustering solutions?
One such challenge are outliers, which can affect clustering results because they largely influence the distance calculations. Data sparsity is another challenge, due to 0s and missing information that affects the computational efficiency as well as the distance calculations. Large data sets are difficult for some clustering algorithms’ (e.g., Hierarchical clustering) runtime or computational efficiency because they do not scale linearly as the amount of data increases. There is also what’s known as the curse of dimensionality, which could result in sub-optimal clustering results due to an increase in dimensions in data sets.
What are some metrics that can be used to evaluate classification and clustering models?
Choosing the right set of metrics for classification and clustering largely depends on the problem statement for which these algorithms are being used. For classification algorithms, evaluation metrics such as precision, recall, F1 score, accuracy, PR AUC, and ROC AUC are used based on whether it is more important to correctly predict true positives, true negatives, or both. For clustering, there are metrics such as the silhouette coefficient, which gives a sense of how far apart clusters that are formed are from each other. In addition, having some cluster-level metrics that are aggregated from underlying data can help define the generated clusters and clusters of interest.
Is there a need to unbox ML model outputs and provide interpretability?
There is significant effort and research going into ML/AI explainability. I believe the main motivation is to understand factors affecting the final output from machine learning models. For example, if a loan application is rejected due to an ML model, you need an explanation of the factors that led to the rejection so these findings can be communicated to various stakeholders, or to customers so they can work towards filling any gaps. Also, it helps in identifying possible bias that an ML model may have. Some techniques such as LIME, SHAP, and GNN explainer exist currently to help unbox ML model outputs.
What skills do you think aspiring data science professionals should acquire to gain an edge in the industry? Do you have any suggestions about any advanced topics?
My suggestion would be to get a good understanding and knowledge of science’s breadth and depth. Knowing the mathematical details of how various algorithms work and which algorithms work well in which problem space and why are also important. Taking part in various data science challenges online and creating an online project portfolio will give a good boost to your overall skill set.
In terms of programming, I’d suggest Python and R for data exploration and model building, and SQL or Hive for data extraction and manipulation. Shell scripting skills can come in handy during the model deployment stage. Apart from traditional machine learning, studying advanced topics such as graph modeling, active learning, and BERT for NLP are good skills to add on.
[Editor’s note: Naveed Ahmed Janvekar is a senior data scientist at Amazon where he is responsible for fraud and abuse prevention using machine learning. You can reach him via email. The opinions expressed in this article are solely those of Mr. Janvekar and do not express the views or opinions of his employer.]