TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Articles

00 Days

00 Hrs

00 Min

00 Sec

Supervised vs. Unsupervised Learning (Part 3 in a Series)

Will a new website visitor be a good customer? We look at two approaches to creating an analytical model that can help you answer that question.

By David Loshin
March 17, 2016

Analytical models apply the core machine-learning building blocks I described in my previous article. Using the same example, an electronic commerce company might want to integrate a recommendation engine as part of their interactive websites to present product suggestions to visitors. Those suggestions would be identified in relation to presumptions based on the visitor's profile, its similarity to other customer profiles, and the types of products that people with those similar profiles buy.

That recommendation-engine model is developed through a sequence of machine-learning methods performing a process of collaborative filtering, which looks for patterns of product affinities among clusters of similar individuals. There are two approaches that can be used to create this analytical model.

The first approach organizes customers into groups based on predefined parameters and expectations. For example, the analyst might define four tiers of customers ("great," "good,", "okay," "undesired"), where each tier is classified in relation to specific sets of attribute values (such as sales volume, number of items purchased, frequency of purchases, etc.). Alternatively, some examples of each type of customer can be segregated and subjected to a classification method to see what similarities the members of each sample group possess. After those similarities have been identified, they can be used to form a segmentation model for new website visitors.

As a new visitor registers and provides information, that visitor's attributes are compared against the different models and the new visitor is assigned to one of the classes. This helps to configure the parameters of the recommendation model for that visitor so that it selects and promotes products that are commonly purchased by others within the same class.

The second approach performs the analysis with no prior expectations. Instead of starting with predefined customer tiers, all customer profiles are subjected to a clustering method that tries to allocate each customer into one of some arbitrary number of specific classes. Once those classes have been created, a classification model can be created to segment new customers, and again, the purchase patterns of the members of each class are used as the model for promoting products to each individual visitor.

These reflect two common approaches to model development. The first approach employs a "supervised learning" task that infers the resulting model based on sets of training data. The predefined characteristics and behaviors of the desired customer classes form the basis of the desired output of the model, and the machine-learning methods are used to find the best fitting set of attributes for classification of future customers in alignment with the desired outcomes. The second approach, called "unsupervised learning," has no preconceptions about grouping, leaving the determination of how customers are clustered up to the algorithm.

There are benefits are drawbacks to each of these approaches. For example, one benefit of the supervised approach is that your classes can be expected to reflect combinations of individuals that share common preferences and behaviors. A drawback, however, is that sometimes there are inherent biases in the training data sets that may prevent the creation of a reliable classification and recommendation model.

Alternatively, one benefit of the unsupervised approach is that there are no predetermined biases associated with the assignment of individuals to classes. The drawback may be that there may not be clarity as to the reason for grouping individuals into a particular cluster or what the specific commonalities are that the members of each group share.

Both are valid approaches, and the decision to opt for one versus the other may be dependent on the type of application, the data sets that are available as training data, and predispositions to specific desired outcomes. However, with the availability of easy-to-use machine-learning and analytical modeling tools, it may be worth trying both approaches and testing to see which leads to the best outcomes.

For more on integrating predictive analytics into your business processes, see the final article in this series.

About the Author

David Loshin is a recognized thought leader in the areas of data quality and governance, master data management, and business intelligence. David is a prolific author regarding BI best practices via the expert channel at BeyeNETWORK and numerous books on BI and data quality. His valuable MDM insights can be found in his book, Master Data Management, which has been endorsed by data management industry leaders.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Supervised vs. Unsupervised Learning (Part 3 in a Series)

Related Articles

Trending Articles

From Reactive to Proactive: Automating Data Quality in Petabyte-Scale Analytics Pipelines

From Pilot to Production: Why LLM Features Stall, and a Readiness Checklist for Data Leaders

The Inferencing Cost Problem No One Is Talking About: Unstructured Data Quality

The Hidden Cost of Poor Training Data in Generative AI

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Supervised vs. Unsupervised Learning (Part 3 in a Series)

Related Articles

Trending Articles

From Reactive to Proactive: Automating Data Quality in Petabyte-Scale Analytics Pipelines

From Pilot to Production: Why LLM Features Stall, and a Readiness Checklist for Data Leaders

The Inferencing Cost Problem No One Is Talking About: Unstructured Data Quality

The Hidden Cost of Poor Training Data in Generative AI

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career