Why Context, Consistency, and Collaboration are Key to Data Science Success
If you want your data science team to achieve more, make sure your data science meets these three criteria.
- By Joshua Poduska
- November 19, 2021
Given how quickly the fields of artificial intelligence and machine learning are growing and the resulting opportunities to discover profound insights, best-in-class data science requires more than one scientist on a laptop. Once you have a data science team, the members must work together; there's important information that needs to be shared about data prep, results of prior projects, and the best way to deploy a model.
Today, if you want your team to move faster, you need context, consistency, and secure collaboration in data science. In this article, we'll examine each of these requirements.
Model building is an iterative, try-it-and-fail experimental practice, and it is true that often one data scientist at a time performs this work. However, a great deal of institutional knowledge is lost if that data scientist doesn't document, store, and make their work searchable by others.
Further, what of junior or citizen data scientists looking to jump into a project to improve their skills? Both synchronous and asynchronous collaboration rely on context to know more about the data they're looking at, how people have addressed the problem in the past, and how prior work informs the landscape.
The process of documenting projects, models and workflows can feel distracting when faced with the more-immediate need to move a model into production. Leaders need to support a culture of knowledge sharing so the whole company benefits and the data science team can build a foundation of expertise and knowledge.
For example, leaders might consider looking at the insights data scientists contribute to the broader knowledge base as part of their standard review and feedback sessions so that collaboration is recognized as an essential principle at the company. Software systems, workbenches, and best practices can help streamline the process of capturing context that can improve discoverability in the future. Without knowledge management and context, new employees struggle with onboarding, slowing their ability to contribute, and teams spend time re-creating projects instead of adding to previous work, which can slow down the entire enterprise.
Building this foundation of knowledge also reduces key person risk. If someone goes on vacation or leaves a project, other team members have the necessary base from which to jump in and keep that project going.
We've already witnessed amazing results from the machine learning (ML) and artificial intelligence (AI) fields. Financial services, health and life sciences, manufacturing - all are going through foundational changes thanks to AI and ML. However, these industries are also heavily regulated and for an AI project to genuinely change such an industry, it needs to be reproducible with a clear audit trail. IT and business leaders need to know there's a consistency to the results that will give them confidence in making the strategic business shifts that AI can facilitate. With so much riding on these projects, data scientists need an infrastructure that will give them full reproducibility from beginning to end, and convince top executives of the project's significance.
As data science teams grow and the variety of tools, training sets and hardware requirements becomes more complex, getting consistent results from older projects can be challenging. Processes and systems for environment management are a must for growing teams. For example, if you're working off your laptop as a data scientist and the data engineer has a different version of a library running on a cloud VM, you may see your model generate different results from one machine to the next. This may occur because open source model-building libraries often change default parameter settings as new best practices become established, which will generate different models when using default settings for two different versions of the library. Collaborators need a consistent way of sharing the exact same software environments.
Retraining and updating data science models is becoming more important as the field matures and grows in relevance. Models evolve over time, and data can start to drift as more information is captured. Thinking of a model as "one and done" is incompatible with a changing business world that brings new pricing models or product offerings.
The key is to recognize that when business changes, the data changes, and the best leaders pay attention to refreshing and retraining their models on an ongoing basis. An inventory of different model versions will help manage changes and measure performance for different models over time -- and those models build on an institution's intellectual property.
We've seen how a foundation of prior knowledge can quickly accelerate new projects and how you need consistent results (or at least trackable results) when solving the complex questions that deliver value for businesses today.
You also need a third component. With the increase in remote work, many enterprises discovered that collaborating in data science is much harder than it was when employees worked shoulder to shoulder. Yes, some core work can be handled by a lone data scientist -- such as prepping the data, researching, and iterating on new models -- but too many leaders have made the mistake of not encouraging collaboration, reducing productivity.
How do you coordinate data scientists, engineers, and experts -- along with IT, operations teams, and executive leadership -- all while keeping your data safe? How do you bring these different perspectives and ideas together, ensuring everyone is working from a single source of truth -- and that this data is secured by enterprise-grade, cloud-based services. Shared documents, emailed grids, public code repositories and internal wikis are all quick and easy ways to share information -- but the easier it is to share information, the easier it is for information to leak out.
Not many people like digging through emails or comparing file versions to ensure they have the right data. Having to rely on a variety of sources just adds unnecessary cognitive load. By using a cloud-based tool, data science professionals can bring enterprise security to data science research and leverage IT best practices.
A Final Word
Seeing how far data science has progressed in the past few years has been amazing. Data scientists are helping companies around the world answer formerly unsolvable questions with confidence. However, as our field matures, it's time to move out of the "flying by the seat of our pants" mode. Digital tools such as software workbenches that provide context, facilitate consistency and enable secure collaboration will help us make data science more useful and more consistent with less effort.
Joshua Poduska is the chief data scientist with Domino Data Lab, a data science platform that accelerates the development and deployment of models while enabling best practices like collaboration and reproducibility. He has 18 years of experience in analytics. His work experience includes leading the statistical practice at one of Intel’s largest manufacturing sites, working on smarter cities data science projects with IBM, and leading data science teams and strategy with several big data software companies. Josh has a masters degree in applied statistics from Cornell University. You can reach the author via Twitter or LinkedIn.