TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - Architecting a Modern Martech Stack for Speed, Scale, and AI Readiness August 28, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Q&A: Classification, Clustering, and ML Challenges

In this Q&A, we look at two key machine learning approaches -- what they are, how they’re used, and the challenges of implementing them -- with Naveed Ahmed Janvekar, a senior data scientist at Amazon.

By Upside Staff
February 22, 2022

Upside: What are classification and clustering techniques and how do they improve analytics?

For Further Reading:

Machine Learning to Power the Future of Streaming Analytics

The Future of Machine Learning: Models as APIs

Synthetic Data: The Diamonds of Machine Learning

Naveed Ahmed Janvekar: Broadly speaking, machine learning can be divided into three types -- supervised learning, unsupervised learning, and reinforcement learning. Classification is a type of supervised machine learning that separates data into different classes. The value of classification models is the accuracy with which they can separate data into various classes at scale. However, the accuracy of a classification model is contingent on the quality of the training data set and how well the algorithm fits that data set. One example of a classification problem is identifying an email as spam or not spam.

Clustering, on the other hand, is a type of unsupervised learning that involves identifying groups within data, where each of the data points within a given group is similar (related) or different (unrelated) to data points in other groups. The attributes and features of data points help establish relatedness. For example, if we want to cluster customers on a platform and we have information such as income, education, and age of our customers, we can represent each of these customers as a data point in n-dimensional space and calculate how similar pairs of customers are based on their attributes. Customer1, Customer2, Customer3 can be grouped in cluster 1 because they all have young age, high income, and Ph.D.s.

Clustering is valuable when no labeled data is available to train a supervised model. It can be considered as a good step towards exploratory data analysis. Some applications of clustering models include customer segmentation and recommendation engines.

What are some widely used classification and clustering algorithms?

There are several types of classification models that can be used based on the type of training data set that is available and whether the data can be linearly separated or not. Logistic regression, decision trees, random forest, and XGBoost are some of the more popularly used classification algorithms. Factors such as model evaluation metrics and inference time are used in deciding the best classification model for a particular data set.

Similarly, for clustering based on the available data set, algorithms such as k-means, hierarchical clustering, and density based scan (DBSCAN) clustering are popular. Factors such as pre-determining the number of clusters to be generated, interpretability, and algorithm runtime are important in choosing one algorithm over another.

What are some challenges in implementing classification algorithms?

The first and foremost challenge is whether there is a high-quality training data set. By high quality, I mean is there enough labeled data from which the model can learn? Are there enough records in the data set across a sufficiently wide time range to help the model generalize and not overfit? Do we drop features with high missing percentages or impute them? What are the methods of imputation? How do you deal with class imbalance issues or model performance degradation over time?

There are multiple ways of dealing with all these issues by experimenting with various strategies for each of the challenge and choosing the ones that gives the best performance metrics. For example, class imbalance can be addressed by undersampling, oversampling, and synthetic minority over-sampling (SMOTE) and then model performance can be evaluated to select the best technique among these to handle class-imbalance issue.

Why is there a need to use unsupervised models such as clustering that may not have as good results as supervised models?

This is a great question. Ideally, if we have supervised models that can predict which class or category data points should belong to, we don’t really need an unsupervised mechanism. However, in the real world, that’s not always the case. Many times, we lack labeled data to train classification models, but we still need a way to segment the data or perform exploratory data analysis. That’s where clustering algorithms are useful. By grouping data points based on their proximity or distance to each other, these algorithms give us clusters that we can use to perform tasks such as customer segmentation and anomaly detection.

What are some challenges in implementing clustering solutions?

One such challenge are outliers, which can affect clustering results because they largely influence the distance calculations. Data sparsity is another challenge, due to 0s and missing information that affects the computational efficiency as well as the distance calculations. Large data sets are difficult for some clustering algorithms’ (e.g., Hierarchical clustering) runtime or computational efficiency because they do not scale linearly as the amount of data increases. There is also what’s known as the curse of dimensionality, which could result in sub-optimal clustering results due to an increase in dimensions in data sets.

What are some metrics that can be used to evaluate classification and clustering models?

Choosing the right set of metrics for classification and clustering largely depends on the problem statement for which these algorithms are being used. For classification algorithms, evaluation metrics such as precision, recall, F1 score, accuracy, PR AUC, and ROC AUC are used based on whether it is more important to correctly predict true positives, true negatives, or both. For clustering, there are metrics such as the silhouette coefficient, which gives a sense of how far apart clusters that are formed are from each other. In addition, having some cluster-level metrics that are aggregated from underlying data can help define the generated clusters and clusters of interest.

Is there a need to unbox ML model outputs and provide interpretability?

There is significant effort and research going into ML/AI explainability. I believe the main motivation is to understand factors affecting the final output from machine learning models. For example, if a loan application is rejected due to an ML model, you need an explanation of the factors that led to the rejection so these findings can be communicated to various stakeholders, or to customers so they can work towards filling any gaps. Also, it helps in identifying possible bias that an ML model may have. Some techniques such as LIME, SHAP, and GNN explainer exist currently to help unbox ML model outputs.

What skills do you think aspiring data science professionals should acquire to gain an edge in the industry? Do you have any suggestions about any advanced topics?

My suggestion would be to get a good understanding and knowledge of science’s breadth and depth. Knowing the mathematical details of how various algorithms work and which algorithms work well in which problem space and why are also important. Taking part in various data science challenges online and creating an online project portfolio will give a good boost to your overall skill set.

In terms of programming, I’d suggest Python and R for data exploration and model building, and SQL or Hive for data extraction and manipulation. Shell scripting skills can come in handy during the model deployment stage. Apart from traditional machine learning, studying advanced topics such as graph modeling, active learning, and BERT for NLP are good skills to add on.

[Editor’s note: Naveed Ahmed Janvekar is a senior data scientist at Amazon where he is responsible for fraud and abuse prevention using machine learning. You can reach him via email. The opinions expressed in this article are solely those of Mr. Janvekar and do not express the views or opinions of his employer.]

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Q&A: Classification, Clustering, and ML Challenges

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Q&A: Classification, Clustering, and ML Challenges

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career