TDWI Upside - Where Data Means Business

Is Collaboration the Critical Success Factor for Data Science?

The members of a data science team need to collaborate with one another in real time, and data science start-up Dataiku claims to have the best tool for the job.

What's the most critical success factor in data science? According to Florian Douetteau, cofounder and CEO of data science start-up Dataiku, it's teamwork and collaboration. No data scientist is an island, Douetteau argues -- nor could she be.

"What we do see today in the field, and specifically ... in large companies, is actually teams where you have a mix of [business] analysts, data engineers, and data scientists," he says, noting that the business analyst is sometimes the forgotten piece of the data-scientific enterprise.

"To oversimplify, analysts are people with a very good knowledge of the data source, the business process. People who are data-savvy and have obviously a contemplative mindset -- but aren't all that into coding."

These team members need to be able to share ideas, insights, and criticisms. They need to be able to interact with one another -- to communicate and to demonstrate: to show, not just tell.

In sum, they need to be able to meaningfully collaborate.

Most Data Science Tools Fall Short

Unfortunately, Douetteau contends, it isn't that simple. Few data science tools (more precisely, few tools that are used in and for data science) promote a meaningfully collaborative experience. On Douetteau's (decidedly biased) terms, even popular data science notebooks are insufficiently collaborative.

Enter Dataiku's flagship product, Data Science Studio (DSS). In addition to fostering collaboration, Dataiku DSS aims to promote reuse and repeatability, he says. If collaboration is indispensable for data science research and development, reuse and repeatability are no less essential. It's one thing to discover and refine insights; it's quite another to make practical use of them.

"Because it is team oriented, collaboration is a big bane for companies that embrace data science, just because it's hard to make data scientists work with one another and also with other people within a company," Douetteau argues. "The key problems are repeatability, reuse, and teamwork. In our product, we actually try to enable all three."

How Collaboration Works in Data Science

What do Douetteau and Dataiku mean by "meaningful" collaboration? He uses the example of dataflow design to illustrate how collaboration should happen.

"Think of real-time collaboration on designing a data flow, where people can build a part of the flow, connect [their parts] together with other [parts, and] assign tasks to [others on] the team," he explains.

"Real-time collaboration is important for teamwork just because you need to be able to see what other people are doing. When you're able to see what someone is doing while they're doing it, you get ideas. You can suggest a better way of doing it; you can point out other [factors] they should consider. What you don't really want ... is any system where you're pushing the results out to team members [after the fact] -- e.g., by sending them over email."

The futurists keep telling us email is dead, but this categorically isn't the case in the data science trenches, says Douetteau. For too many data science teams, email lives and even thrives as a means of collaboration and information sharing. The problem is that ideas and insights shared via out-of-band mechanisms such as email are effectively lost to an organization's knowledge base.

Documenting Analytics as Valuable IP

Information loss isn't just intellectual property (IP) loss; it translates into lost productivity, too: brainstorming, experimentation, research, even coding must be recreated from scratch -- each and every time it's needed.

"It is very important to realize that when you're building analytics, you're also creating some very important intellectual property for your company. Today, we're capturing this IP, documenting the process, and understanding where the data is," Douetteau says.

"Our customers usually build most of their analytics and do most of their collaboration inside the product, specifically anything related to sharing data, [such as] data sets or sharing models. [This means that] what the members of a data science team build stays within the product and is also shared within the product," he explains.

"This is actually much easier to do than exporting something as an Excel file and possibly sampling down the data to do so and sending it over email."

A Tool for Sharing Working Models

Dataiku DSS also addresses the problem of reuses and repeatability, he claims. It isn't primarily an environment for sharing analyses -- visualizations, dashboards, and the like. It's for sharing the data transformations, analytical models, and working data sets that enable prediction and analytical interpretation.

"The scope of our product is data transformations and machine learning stuff. When it's about sharing results, charts, dashboards, and so on, the customer would use the plugins we have to export [that] to other software, [such as] Tableau or Qlik. Our goal is to have an environment where people within a data team can easily share [analytics models] and other kinds of assets."

In future revisions, Dataiku expects to introduce integration with popular project management tools too, says Douetteau.

"Something we are specifically thinking about is how well we integrate the product with existing project management technologies and project management products. The analytics project these days is a strange beast. It's like a business project for all intents and purposes, but because of the technical part of it, it's usually a project that's managed as a technical project ... with milestones and tasks and so on, and so it requires some actual project management," he notes.

"We are looking at this problem and plan to deliver this integration next year. To give you some examples of possible integration we are thinking about -- products such as Jira or Confluence."

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at evets@alwaysbedisrupting.com.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, & Team memberships available.