Automated Machine Learning and the Future of Data Science Teams
With the advent of automated machine learning, data scientists will need to adapt their role in the data science life cycle.
- By Troy Hiltbrand
- April 12, 2019
Automated machine learning (autoML) is being adopted more broadly across all industries as companies try to get the most out of their data science programs. As this trend continues, many data scientists are questioning their value to the organization and what they can offer that autoML cannot. To understand this, it is important to understand just what autoML is and how it fits into the full data science life cycle.
AutoML is the umbrella term for tools and platforms that automate the steps of selecting the right model and optimizing its hyperparameters to generate the best model possible under a given set of data. There are libraries such as auto-sklearn and auto-WEKA that provide these autoML capabilities. There are also cloud platforms in this space that provide an entire ecosystem for automating machine learning, including Azure Machine Learning, Amazon Machine Learning, the Google Cloud Platform, and IBM Watson. These cloud providers fall under the category of machine learning as a service (MLaaS).
The goal of autoML is to shorten the cycle of experimentation and trial and error. It cycles through a large number of models and the hyperparameters used to configure those models to determine the best model available for the data presented. This is a tedious and time-consuming activity for any human data scientist, even if he or she is highly skilled. AutoML platforms can perform this repetitive task more quickly and exhaustively to reach a solution faster and more effectively.
The ultimate value of the autoML tools is not to replace data scientists but to offload their routine work and streamline their process to free them and their teams to focus their energy and attention on other aspects of the process that require a higher level of thinking and creativity. As their priorities change, it is important for data scientists to understand the full life cycle so that they can shift their energy to higher-value tasks and hone their skills to further elevate their value to their organizations.
Business Case Development
The first step in any machine learning initiative is to identify what problem the business has to solve. During problem identification, data science teams evaluate what defines success for the business and determine where the application of machine learning can assist the business in achieving its business targets.
In this step, it is vital that data science teams understand the business (and business in general) well. Team members must understand business processes, have expertise in existing and potential markets, know the competitive and regulatory landscape within which its business operates, and be able to navigate the political ecosystem in which the data science program lives.
Such business acumen is not always among the strengths of traditional data scientists -- whose focus has traditionally been the mathematical and computer programming aspects of their role -- but it must become so for the future. This is an opportunity for a data science team to expand its team composition. The more technical data scientists can coach individuals with more business expertise on the intrinsic value of machine learning. Traditional technical data science teams working jointly with techno-savvy business partners can improve their outreach into (and value to) the business.
Machine learning lives and dies by its ability to ingest and consume high-quality data. The lower the quality of the incoming data, the lower the quality of the model. With autoML, this requirement is equally valid. AutoML assists in accelerating the process, but it can just as quickly generate a poor model using poor-quality data as it can a good model with high-quality data.
The role of the data science team is to procure high-quality data sources both within and outside the organization. Data procurement also includes negotiating effectively with other departments to persuade them to share their information assets for the betterment of the organization. It includes finding and negotiating with third-party vendors who have valuable data that will strengthen the model.
If the data is not widely available within the organization or from a third party, data science teams often have to look to techniques such as Web scraping or even setting up data acquisition processes to capture data they need. Acquiring the right data can include operationalizing a data pipeline from upstream sources, such as the Internet of Things (IoT), to generate the data needed for their model development.
The next step is to provide structure to the input data. With correctly structured source data, autoML will be able to effectively create a model and optimize its hyperparameters. The process of input data manipulation is commonly known as data munging or feature engineering.
Structuring the data includes converting numerical attributes to categorical attributes, breaking attributes into more finite components, deriving attributes from other attributes, cleansing attributes, and normalizing attributes. Feature engineering is often as much an art as it is a science and requires data scientists to vigilantly curate the data to make it usable by a model and do so in a way that is repeatable and provides a lineage for each of the data attributes back to its source.
With the advent of deep learning, there have been many discussions as to whether feature engineering is also becoming irrelevant and is a candidate for automation in the future. Although many companies have started to show remarkable results with deep learning, its niche today is in cognitive domains such as image recognition, machine translation, and natural language processing. It is in these domains where the number of raw inputs is extensive, and each variable has little or no meaning on its own. It is also here where the application of deep learning can extract features based on this plethora of data points.
In other business domains, the curation of features by the data science team is still a highly valuable data preparation step. Even when companies use deep learning methodologies, they will often combine these automatically identified features with human-curated features to produce optimal results in model development.
Model Evaluation and Business Impact Evaluation
Finally, once autoML has efficiently run through thousands (or tens of thousands) of permutations to identify a model that works best with the data provided, it still is important that the data science team evaluate the results and validate that they will drive the intended business case. A highly tuned model can still miss the mark, and it is up to the team to assess this fit for purpose.
It is also vital that the data science team monitor the model once it has been deployed into production. The team needs to ensure that the model can perform as well with real data as it did with training data and that the business objectives are being achieved through the application of the model. This process of model evaluation requires close engagement with the business to identify metrics that have organizational value. Data science teams need to converse in terms that resonate with their business counterparts.
The Future of the Data Science Team
As autoML becomes a larger part of the industry, it alleviates the data science team from having to perform the repetitive and cumbersome process of model selection and hyperparameter optimization. As this evolution occurs, data science teams will continue to play a vital role in the organization, but they must identify what other processes in the data science life cycle need their attention and shift their energy.
This transformation is also an opportunity for data scientists to assess their skills and determine if they have the necessary competencies to enhance these other processes as their previously in-demand tasks are automated away.