TDWI Articles

Career Switch Q&A: Negotiating the Path to Data Engineer or Scientist

There's an art to navigating the challenging path to becoming a data scientist or engineer.

Although the panic over data management staffing may have calmed down somewhat, there are many already on the path to being a data scientist or engineer. However, according to big data expert and educator (and long-time TDWI faculty member) Jesse Anderson, there's an art to navigating the challenging path to becoming a data scientist or engineer. Working with big data sets a much higher technical bar than managing a data warehouse, he says. "DBAs are facing the biggest crises from big data. They're faced with a changing landscape of technologies. Data used to be their purview and now they're finding a brand new team and title emerging."

Anderson is managing director of The Big Data Institute, where he provides training on big data issues and technologies, including Apache Kafka, Hadoop, and Spark. He has trained employees at companies ranging from startups to Fortune 100s, teaching thousands of students the skills to become data engineers. Anderson has published a new e-book, The Ultimate Guide to Switching Careers to Big Data -- Upgrading Your Skills for the Big Data Revolution.

Upside: What's the difference between a data scientist and a data engineer? Are those two terms interchanged improperly?

Jesse Anderson: The two positions are actually quite different, and yes, they are often used improperly. A data scientist, by definition, has much more of a statistical and machine learning background, with perhaps intermediate-level programming ability. A data engineer, by contrast, often comes from a software engineering background and should have advanced programming abilities.

Within a company, data engineers are generally responsible for creating the data pipelines; data scientists are then responsible for consuming those data pipelines, perhaps for creating a machine learning model and engine.

Do data scientists and data engineers generally work together?

Because the two positions are often working with the same sort of data, they often work closely together.

In my new book, which is about how to switch careers to work with big data, I talk about the importance of communication. If your data engineering and data science teams are going to be separate, they should have a very high [rate of information transfer] between them. Questions should be answered quickly and the two teams should be helping each other quite a bit. Note that there are sometimes different job titles -- in some companies the positions may be part of the same team.

One of the things I like to point out to managers is this: The data engineering and data science teams are often actually homogenous. Sometimes there are data scientists embedded in the data engineering team. It's very different from what we're used to with straight software engineering or with the usual types of analytics or BI.

We hear a lot about the shortage of data scientists. Is there a shortage of data engineers as well?

Definitely; there is a high demand for both positions. Part of what I try to do is to help people decide if there is sufficient demand in an area and then see if their skills can fulfill that demand. Definition, as we discussed earlier, is part of the issue -- people need a good definition of what a data engineer or data scientist is in order to determine if they have the right skill set to become one. Nebulous definitions make it tough for people trying to switch jobs -- or to fulfill the correct role on a current team, where the consequence is often lost productivity.

There is also high demand for what I call "qualified" data engineers, by which I mean people who have actually worked on projects and who have some background in distributed systems, as well as for software engineers -- those who have learned specific data skills and then can move over to a data engineer position.

To fill a data scientist position, someone in BI (maybe a data analyst) can use a skills progression. They can take their statistics or math background to a new level by adding better programming and SQL skills, then start doing more interesting analysis and eventually move up to machine learning.

If a company isn't sure whether it needs a data scientist or a data engineer or both, which should come first?

I've seen this happen -- companies hire a data scientist because it's touted as the next best thing and they believe it will turn around the business unit or organization. The problem is that a data scientist usually doesn't have the technical skills to create that data pipeline I mentioned earlier -- in essence, to lay the groundwork for their work. Therefore, I suggest that companies get the data engineer, build the data engineering team, and then start filling out the data science side. Sometimes they'll do it in unison, but it isn't a one-or-the-other approach. You have to do both.

What skills do data engineers, and data scientists in particular, need that aren't taught in business school or as part of a technology degree?

There's no school that can really teach big data skills. Coming out of school with an MBA doesn't prepare you to be a data scientist or a data engineer per se. That isn't to say that people with MBAs can't do it; it's just that school doesn't train them for it. One key issue is that schools -- even technical schools -- aren't turning out people with distributed systems backgrounds. Instead, most universities are focused on general-purpose technical educations, which are more geared toward Web development, mobile development, and backend systems development.

Distributed systems development is much more difficult and isn't generally taught, although there are some schools that are starting to offer classes and even bachelor's degrees in it.

I often find that people who are fresh out of school and end up on a data engineering team have a master's with some sort of distributed systems focus. Those with a bachelor's degree, I find, are usually mid- to senior-level in their careers, not just out of school. At some point, they've done some training in distributed systems and gone from there.

What about some of the other skills that a technical MBA degree might impart? Does anything transfer over to being a data scientist?

Someone usually isn't a data scientist with an MBA. That isn't to say that it hasn't been done, it's just that 99 percent of data scientists aren't MBAs. An MBA might be team manager, but he or she wouldn't have the title data scientist.

Speaking of managing data engineering teams, a core principle is that managing a data team is different from managing a software engineering team. That idea needs to be internalized. Otherwise, projects fail because management doesn't realize that big data is different from small data, so they'll use a team that hasn't been given the specific big data skills they need, or the extra time and resources. Then they wonder why the project fails.

The management teams I've mentored and taught understand the difference. They have a level of empathy for their data engineers and data scientists, a deeper understanding of how these projects succeed, and why you need to treat them differently.

I've written a book on this called Data Engineering Teams. It's for managers and it explains the skills that should be on the team and how to run it. Earlier, we talked about multidisciplinary teams. The group manager, the person with the MBA or the means to understand the multidisciplinary nature of the big data team, needs to understand all the skills that need to be on the data engineering team. Otherwise, you run a high risk of failure.

You sent us an early draft of your new book, The Ultimate Guide to Switching Careers to Big Data -- Upgrading Your Skills for the Big Data Revolution. In it, you say, "Of all the careers, DBAs are facing the biggest crises from big data. They're faced with a changing landscape of technologies. Data used to be their purview and now they're finding a brand new team and title emerging, data engineering team and data engineer." Why are database administrators in trouble and what should they be doing to save themselves?

That paragraph describes an interesting issue I see happening. When I first starting teaching developer classes, lots of DBAs would attend. These were data warehouse, SQL-focused team members coming to developer classes to learn Java and understand programming concepts.

At some point, I started wondering why so many DBAs were taking a developer class. Turns out, it was because there was no other way for them to increase their skill level and learn some programming. They saw the need and didn't see any other way to meet it. That's partly where this crisis is rooted -- in the fact that working with big data sets a much higher technical bar than managing a data warehouse. The average DBA is faced with the difficult challenge of leveling up to meet the different skills needed to be on big data teams.

Some skills do transfer over but a good portion come from learning how to program -- and program well -- in order to create new systems. In addition, although a data warehousing team is generally using a relational database, data engineering teams are using 10 to 30 different technologies, bringing them together, and making them work together. As a result, the DBAs that come through my classes are often facing a quandary. With all these new technologies and skill sets, they are faced with the difficult proposition of having to learn to program and to learn these new technologies.

Working with big data isn't about a relational database with a little spin -- there is actually very little comparison with previous technologies. It's going to be difficult to move to a new level. For example, most DBAS have heard of NoSQL and may understand it at some level, but DBA teams that work with NoSQL are apt to fail because they don't have the deep technical background to work with it correctly.

What strategy should be followed by an IT person interested in becoming a data scientist -- say, a DBA or data warehouse architect? It sounds like learning programming is a big part of that.

Yes. You don't have to be an expert programmer, but a data engineer has to know technologies such as SQL [to a greater depth] than a data scientist needs to know the math. That's one of my big concerns [in working with clients.]

People ask me, Should I be a data scientist? Should I be a data engineer? How well do I need to know these technologies in order to succeed? For a business intelligence persons or an analyst, the technical bar is lower, but it's not so low that you don't need to know those technologies at all. The bottom line is, you're going to need to learn the programming in order to be successful.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.