TDWI Articles

Q&A: Training Data Scientists in Eight Weeks

In a free eight-week program, the Data Incubator takes Ph.D.s and trains them to be data scientists, placing students at companies including Yelp, Microsoft, The New York Times, Foursquare, Pfizer, and eBay.

As a Ph.D. student entering industry, Michael Li saw firsthand the challenge facing a student trying to translate an academic science background into something more applicable to employers. Later as a hiring manager, he said, "I see people with great-looking resumes who don't have science chops -- the actual science knowledge, statistics, and programming really necessary to become a data scientist."

That's why he founded The Data Incubator, a data science education company that offers both corporate training and recruiting services. Its signature program offers a select group -- only 50 students are accepted each quarter out of thousands of applicants -- a free eight-week data scientist training program. The hiring companies (including eBay, Capital One, Palantir, and Pfizer) pay a fee only if they successfully hire a student.

Li has worked as a data scientist at Foursquare, as a quantitative analyst at D.E. Shaw and J.P. Morgan, and as a rocket scientist at NASA. He earned his doctorate at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall scholar. At Foursquare, Li says he discovered that his favorite part of the job was teaching and mentoring others about data science.

"Our philosophy is that data science is not a spectator sport," Li says. "For example, althoughlecture is an important aspect of our corporate training and fellowship, the focus of our curriculum is hands-on projects."

Upside: Can you describe The Data Incubator?

Michael Li: The Data Incubator is an eight-week fellowship that gives 50 students (who already hold masters and Ph.D.doctorates) the essential skills they need to become data scientists for industry. After the eight weeks, we match them with employers who pay for the training.

We get about 2,000 applications each quarter for those 50 slots. Employers range from tech companies such as Yelp, Betterment, and Foursquare to more established companies, for example, The New York Times, Microsoft, eBay, JPMorganChase, and Pfizer. We have also adapted our curriculum to offer customized corporate training in big data and data analytics.

Where did the concept come from, and what is your background as a data scientist?

The idea came from my experiences on both sides of the interview table. As a Ph.D. student who entered industry, I know the challenges that students with an academic science background have in translating that background into something more worldly. As a hiring manager, I see people with great-looking resumes who don't have science chops -- the actual science knowledge, statistics, and programming really necessary to become a data scientist.

Regarding my background in data science, I earned a B.S. in computer science and Ph.D. in applied mathematics from Princeton University before moving to Wall Street and then Silicon Valley. I worked at Google, Intel, Foursquare, NASA, and Andreesen Horowitz before starting The Data Incubator.

In your eight-week program, you train students -- often Ph.D.s already -- to be data scientists. How is the program structured?

We've worked really hard to make our curriculum top-notch and to keep it up-to-date with the latest technologies. Our fellows spend an intense eight weeks going through our curriculum, which is divided into four major topics:

  • Software development and the ability to manage data pipelines, handle structured and unstructured data, or handle databases

  • Machine learning, statistics, and specialized applications such as time-series and natural language processing

  • Distributed computing topics, e.g., Hadoop, Hive, MapReduce, and Spark

  • Visualization tools and techniques, including libraries such as matplotlib, bokeh, and d3js

You can read about out students' experiences in our alumni spotlights or view their project videos on our YouTube page.

What do you teach at The Data Incubator that isn't taught in business schools, as part of a technology degree, or on the job?

Colleges and universities are slow-moving institutions with highly siloed departments where instructors are rewarded for research, not teaching. Consequently, curricula are often outdated and theoretical. Our subject, data science, is practical, cutting-edge, and interdisciplinary -- exactly the kind of subject matter that universities have a hard time teaching.

We are constantly updating our curriculum and our technologies. For example, when we started, we did not have a module on Spark, but we have developed one since then. Pandas has had 12 releases since we were founded, and we have been upgrading the version we teach alongside those releases.

Our curriculum is completely interactive and Jupyter-based, which allows us to support multiple languages (including Python, Scala, R, SQL, and JavaScript). Although this is great for students, there's a lot of work that goes into getting all these tools to play nicely with one another (it's one of the biggest barriers to entry an aspiring data scientist faces). Our engineering team works full time keeping everything up to date.

That said, I do think universities are good at teaching fundamental concepts, for example, statistics, programming, or core computer science algorithms. We benefit greatly from U.S. universities and the National Science Foundation, which help train researchers in many of the core skills of data science.

Forty thousand students graduate from STEM Ph.D. programs in the U.S. annually, but only 20,000 continue in academia. Because of this, we are able to draw from a very talented national (and international) pool that has benefited from a strong university education.

In a 2014 article in the Harvard Business Review, you made an interesting distinction in the ways data scientist skills are used, a distinction that you said is critical in hiring. Can you recap that explanation?

Employers need to ask one key question when they hire a data scientist: Is the data scientist producing analytics for machines or humans? This distinction is important across organizations, industries, and job titles (our fellows are being placed at jobs with titles that range from "quant" -- quantitative analyst -- to data scientist, to analyst, to statistician).

Unfortunately, most hiring managers conflate the different types of talent and temperament necessary for these roles.

Data scientists who produce analysis for computers need exceptionally strong mathematical, statistical, and computational fluency to build models that can quickly make good predictions. They can piece together a myriad of technical tricks to build very sophisticated models that drive performance, and when even small gains are aggregated across millions of users and trillions of events, their efforts can result in huge gains in revenue.

Data scientists who produce results for people, on the other hand, have to think about how to tell a story from the data. They have to be comfortable drawing higher-level conclusions -- the "how" and "why." These aren't as easily observed in the data as the clear metrics enjoyed by their analytics-for-machines counterparts.

For an employee who has either some technical skills or some business skills and is interested in becoming a data scientist, what's your advice? What is an ideal set of skills for a data scientist?

Our philosophy is that data science is not a spectator sport. For example, although lecture is an important aspect of our corporate training and fellowship, the focus of our curriculum is hands-on projects that push students to work on canonical workflows in data science from A to Z. If you want to learn data science, you'll need to roll up your sleeves and get your hands dirty.

We have a number of blog posts to help employees interested in learning about data science get started; topics include visualization, manipulating data, and efficiently processing large amounts of data. We also have a YouTube page with plenty of free tutorials (and accompanying code snippets).

From the employer's viewpoint, what skills should they focus on in hiring a data scientist -- strong tech skills, business skills, or soft skills such as communication style (or perhaps all three)?

When hiring data scientists, employers tend to focus primarily on technical qualifications. It's hard to find candidates who have the right mix of computational and statistical skills. What's even harder is finding people who have those skills and are good at communicating the story behind the data.

Although it's important to focus on technical proficiency when hiring data scientists, it should never be at the expense of communication skills. Having valuable recommendations locked up in the mind of your data scientist is about as useful as not having generated any at all.

Knowing that the skill set currently commands a hefty salary, how can employers maximize their use of a data scientist?

How to maximize your data scientist depends on your business -- and whether you're hiring for a digital or non-digital department. In our experience, candidates usually come from one of two disciplines: computing or statistics.

Candidates with a strong science or math background have usually had rigorous statistical training in distinguishing between signal and noise and can tell when they are "overfitting" a complex model. Those with a computer science background frequently have the software engineering chops to handle large amounts of data by taking advantage of parallel and distributed computing.

Although all data scientists need to be functional in both areas, we've found that people coming from each of these backgrounds have quite different strengths and weaknesses.

Think about the departments in the digital economy that have a regular profusion of data, generated from mobile, tablet, laptop, or desktop sources. Mobile apps, e-commerce, wearables, and digital advertising companies are just a few examples that fall into this category.

When data is plentiful, analytics often benefits from the unreasonable effectiveness of data -- the idea that as we are able to learn from more data, we are able to achieve increasingly accurate models.

Doing so certainly requires a deep knowledge of statistics, but a strong computational background is needed even more. Companies with large amounts of data often benefit from having data scientists with software engineering backgrounds who can quickly build the systems that learn new trends in real time.

When hiring for non-digital departments, the ideal candidate profile is very different. Here, the data comes in more slowly and is more expensive to collect. In this case, pure software engineering will not be nearly as useful. Instead, a strong statistics background can ensure the findings withstand rigorous statistical scrutiny and do not overfit the data.

What are some ideas for employers to help them retain data scientists once hired?

Attracting and hiring great data scientists is only the first step: companies also need to motivate and retain them. Best practices for that fall into three main categories: support, ownership, and purpose. Support them by making sure they have the right tools and by investing in their education. Give them ownership by involving them in decision making and recognize them when their contributions have advanced your mission.

How are companies such as eBay and Pfizer using data scientists?

At eBay, the machine learning and data science team conducts a wide variety of activities, including machine learning, data mining, economics, user behavior analytics, information retrieval, and visualization. They analyze users, user behaviors, transactions, items, feedback, and especially searches.

For instance, eBay uses intelligence from advanced users and applies it to help what they call "the naive user" (a user who's not good with queries). A lot of effort goes into the first step of cleaning the data. After that, eBay goes six years back in time to analyze user behavior, pretty much in real time.

Pfizer is "using data to more efficiently develop medicines and better define which patients will most benefit," according to a 2015 article in Forbes. They are using big data to understand the obesity epidemic and how it may affect medicine development. They're combining clinical, genomic, and real world data to identify correlations and define subtypes of obesity among individuals, as well as predict who is at risk of diabetes or cardiovascular disease earlier than we can now.

These kinds of insights can help assess how certain types of people may respond to medication, allowing us to explore how we customize treatment depending on genetic, dietary, or even lifestyle factors.

In which industries are you seeing the most growth in the need for data scientists?

We've seen the most growth in big data analytics in a few industries:

Healthcare: The Affordable Care Act is forcing both payers and providers to look into using healthcare analytics to reduce costs as they shift from a fee-for-service model toward outcome-based medicine.

Finance: Ongoing regulatory scrutiny from laws such as Sarbanes-Oxley and Dodd- Frank is forcing banks to use analytics to strengthen oversight -- which is driving a lot of the innovation in the field. At the same time, alternative credit providers are using big data and analytics to improve their underwriting models to aggressively disrupt larger financial players.

Technology: There is excitement around artificial neural networks, which involve complex machine learning that requires no human intervention or domain expertise. On the other hand, we're also seeing that in the most complex problem domains, a combination of human expertise and advanced analytics is still the dominant strategy.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.