How and Why Your Enterprise Should Democratize Data Science
To overcome the shortage of data scientists, many organizations are turning to democratizing data science. Here's what you need to know.
- By Ryohei Fujimaki
- January 29, 2019
Data science is now a major area of technology investment given its impact on customer experience, revenue, operations, supply chain, risk management and many other business functions. However, recent research indicates that although digital transformation and AI journeys are key initiatives, companies are struggling to get them off the ground. One of the key challenges is hiring the right team including a scarce commodity -- the data scientist.
One of the most noticeable trends to overcome the challenge and to accelerate enterprise data science is data science democratization that would empower citizen data scientists (such as business analysts and business intelligence engineers) to solve complex data science problems, making it possible for a broader range of practitioners to execute data science projects. Although this concept has been widely discussed, many enterprises have been struggling to truly democratize data science. This article discusses best practices for enterprises to follow when democratizing data science.
Data Science Project Requirements
A typical enterprise data science project is highly complex and involves many steps (as illustrated below), not just machine learning.
You'll first need to understand the business and domain so you can define the use cases with the highest impact. Any data science project needs to have clear business objectives with well-defined business hypotheses.
The data and feature pipelining step is the most manual, challenging, and time-consuming step that transforms business data in preparation for machine learning. This step requires deep domain knowledge to extract meaningful patterns from business data as well as data engineering skills to deal with large and complex data.
The machine learning model development step is about developing the best machine learning model to solve your problem; this often involves testing different machine learning algorithms to find those with the best performance. Different algorithms typically have different characteristics (e.g., accuracy versus interpretability) so you must choose the right algorithm based on the nature of the project.
In the visualization and validation step, you'll present the modelling outcomes to your business team and solicit their feedback. This is critically important because these outcomes must be accepted by the business team before implementation and adoption in business.
Production and operationalization is the final step that puts the data science pipeline into production. This is a highly complex area because many practical operational issues of data science (such as quality, scalability, maintainability, integration, and portability) have to be addressed during this step.
Figure 1. Steps and skills in typical data science projects.
Given the complexity of the data science process, one enterprise data science project typically takes several months to complete, even for an experienced team. Overall, democratization of data science is neither easy nor trivial for enterprises.
How to Democratize Data Science
Let's examine the practical steps to enable data science democratization for an organization.
Establish a data-driven culture. The outcomes of data science projects will be ultimately consumed by a business team. Therefore, it is important to educate them on the value -- as well as the limitations -- of data science. Data-driven culture and literacy for data science consumers are critical to a successful implementation of data science projects in business.
Enforce data and analytics governance. Data is undoubtedly the single most important asset in a data science practice. Strong data governance, which allows for secure and flexible data utilization, could enable "citizens" to execute more data science projects.
Train BI and analytics talents. "Reskill" your current staff from adjacent fields -- such as BI engineers or business analysts -- to be citizen data scientists. They already have the relevant background of data analytics and are familiar with the data science process. Experienced data scientists can play the role of an evangelizer to share data science best practices and guide the citizen data scientists through the process.
Introduce appropriate tools. Data science is an interdisciplinary field by nature; it involves many technical components, such as mathematics, statistics, and computer science as well as SQL, R and/or Python coding. The data science process itself is also highly complex as we've mentioned. Utilizing appropriate toolsets could significantly shorten the learning curve, empowering citizen data scientists to focus on what problems to solve, not how to solve them.
Tools to Support Data Science Democratization
Although traditional visual programming-based platforms (such as SAS, SPSS, and Alteryx) do not require coding skills (including SQL, Python, and R), they still require significant knowledge and expertise as well as manual effort to draw the entire pipeline from source data and feature engineering to machine learning.
Machine learning automation tools (such as DataRobot and H2O.ai) have recently become popular in data science democratization. These tools significantly simplify several of the steps we've mentioned, such as machine learning model development and visualization and validation.
However, data and feature pipelining for both development and production are still manual and time-consuming, impeding true data science democratization. Smart data preparation platforms (such as Trifacta and Paxata) are essentially intelligent data pipelining tools that provide an interactive environment for data processing and warehousing, simplifying master data management process.
In contrast, a new category of automation tools -- data science automation platforms (such as dotData, the company I lead) -- support all of the above tasks by automating the entire end-to-end data science process (including machine learning) and feature engineering to support data science democratization.
How Will Democratization Change Data Science?
Will data scientists be replaced by citizen data scientists? The answer is no. Enterprises still need skilled experts to deliver high-impact data science projects. Meanwhile, data science democratization will impact enterprises from a different perspective in three major areas.
Mitigate scarce resources. Experienced data scientists are important but scarce resources, given the significant shortage in data science resources in market (an interesting finding from this recent blog). From the perspective of enterprise growth strategy, data science democratization enables an enterprise to mitigate the risk of a data scientist shortage by leveraging citizen data scientists to execute data science projects.
Focus important resources on tasks that create higher value. There are hundreds of potential use cases that data science can address, but they are not equally important. You can focus your highly skilled data scientists on use cases that create more value while leveraging your citizen data science resources to address the rest of them.
Deliver 10x more data science projects. As we mentioned, typical data science projects are manual and time-consuming, impeding time-to-market for data science projects. Automation tools democratize data science practices, empowering your team to deliver more projects faster than ever before.
Data science innovation becomes indispensable as more enterprises transform themselves into data-driven organizations. Data science democratization through automation will unquestionably help accelerate both data science and business innovation, delivering greater and broader business impact.