By using website you agree to our use of cookies as described in our cookie policy. Learn More

TDWI Upside - Where Data Means Business

What Makes a Data Scientist Tick?

Data scientist Denny Lee shares three must-have skills for data scientists and argues that being asked to do the impossible brings out the best in an analyst.

In at least one critical respect, Denny Lee, a technology evangelist with Apache Spark's commercial parent company Databricks, is similar to many of his data science colleagues.

Lee didn't come to data science as part of a calculated career decision. In his case and many others, the field of data science seems to exert a kind of tidal force on would-be practitioners. Data scientists are pulled into its orbit from many different vectors, some explicitly statistical and analytical, some much less so.

For example, Lee came to data science from complementary work in physiology and statistics. After a post-grad dalliance with a Web analytics start-up, Lee landed a job with the esteemed Fred Hutchinson Cancer Research Center in Seattle, where he worked in HIV/AIDS research. This gave him his first taste of advanced statistical technologies and methods. "Examining all that data was my first exposure to data science and working with tools ranging from SAS and SPSS to S+ and R," Lee explains.

Honing His Spark Skills

After working with Fred Hutch for a few years, Lee took a job with Microsoft, first working on the Bing search engine and then switching over to Microsoft's SQL Server team. That gave him a chance to work with some of the biggest SQL Server implementations in the world -- including a 24 TB Analysis Services cube. Lee also worked on the "incubation team" that brought Apache Hadoop to Windows and Azure. This was his formal introduction to Apache Spark. In his next few stops, he honed his Spark skills, ultimately ending up at Databricks, where Spark means business -- literally.

At Databricks, Lee works with business customers, academic researchers, users in the community, and other Spark enthusiasts. "The work I'm doing is focused on helping Databricks customers in using data science in their analysis and understanding of their own data," he explains.

"We do this through simplifying data science and data engineering with [the] Databricks [platform]."

Asking the Right Questions

After years of working with and interpreting data, Lee has identified three skills would-be data scientists would do well to cultivate and hone. The first, he says, is an ability to ask questions.

Lee isn't talking about questioning in the abstract; he's talking about a capacity to ask the right questions. "Being inquisitive and asking questions helps you to look at problems in different ways -- putting you in the best position to solve data analytics problems," says Lee, who addressed this issue on his personal blog.

A second critical skill has to do with tool selection. Suffice it to say, it's important to use the best tool for the job. Don't try to make a problem fit your favorite tool (or the tool you're most comfortable using); approach problems on their own terms, selecting the best technologies, tools, and methods on that basis.

"It isn't about the technology, it's about understanding the data. Ultimately, [it's about] finding the best tool for the job, whether it be Scala, Python, R, Java, C++, S+, Matlab, or carrier pigeon. I'm such a big fan of Apache Spark because it's incredibly ideal to have one framework to solve many data problems, [irrespective] of the paradigm," Lee comments.

Always Learning

That said, the litany of technologies Lee lists gets at the third critical skill: a desire to continuously learn, not just about new technologies, but about new paradigms, techniques, and methods.

"It is important to have the ability to continuously push the envelope," Lee says.

This is no less true of data science in the enterprise. A large portion of the problems enterprises experience in data analysis are actually products of inside-the-box thinking. For example, Lee distinguishes between traditional sources of information -- "legacy systems," to use his language -- and nontraditional sources, such as open data sets, subscription data services, social media data, multimedia data, and geospatial data.

Because the former are found in abundance in enterprise environments, there's a temptation to rely disproportionately on both traditional data sources and traditional tools for analysis. This is a mistake, he argues. "Only using legacy systems, software, or proprietary solutions will ultimately inhibit analysis. These systems are often too complex and too expensive to address current and new data processing paradigms."

The same is true of legacy processes, Lee cautions. "Having a rigid mindset on using an existing set of processes incapable of changing or evolving will only make the analysis of data suffer," he says.

Just as data scientists must prioritize the acquisition and refinement of new knowledge and skills, enterprises must prioritize training -- with emphasis on new concepts. The idea isn't to reinforce what you already know, Lee urges. "Enterprises should always be focusing on training and teaching, instead of just re-teaching what they already know. Use your current knowledge as a solid foundation, but train your people to be flexible to new ideas and paradigms."

He also stresses the importance of evangelism -- you know, networking. The idea is to meet, greet, and interact as much as possible. You aren't doing this simply to hear yourself talk, Lee explains; you're learning and listening, too. "Continuously evangelizing with customers or at meet-ups gives you the ability to not just speak, but also [to] listen and understand problems, solutions, and pain points."

Databricks' Spark Stack

As a Databricks employee, Lee is a heavy user of Databricks' Spark software stack. Spark functions as an all-in-one toolbox for data preparation, exploration, and analysis. As Lee noted, it's important to pick the right tool for the job -- not to try to adapt the job to the tool. However, with a versatile platform such as Spark, he says this is basically a nonissue. "All the tools that I need are provided in one comfortable and powerful user experience: notebooks, clusters, data sciences, visualizations, [and] languages [such as] Python, R, Scala, SQL," Lee explains, noting that there's no lack of domain-specific Python libraries, many of which also integrate with Spark.

Another important tool that Lee uses on a daily basis is a surprisingly mundane one: Google. The strength of Spark, Python, and other open source software (OSS) technologies is that everybody can pitch in and create. Those domain-specific Python libraries Lee referred to above? Many of them are freely available on the Web. Rather than duplicating the efforts of others, Lee prefers to trawl Google in search of Python libraries, open data sets, case studies, and other prefab artifacts.

"Leverage the open source community to enhance your work and your community," he urges.

In a sense, being a data scientist involves regularly performing the impossible. You're always confronted with resource constraints -- especially with respect to time -- and extremely hard problems. Lee sees the impossible as a feature, not a bug, of data science. "Lack of time forces me to rapidly prioritize and deliver more out-of-the-box ideas and solutions," he comments. "Some problems that seem to be immensely complex often force me to be my most creative self to get them done."

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at [email protected].

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.