Using Data Science to Analyze Data Scientists’ Salaries
O’Reilly Media has released an interesting new data science salary survey. The authors actually used data science to analyze the survey data and even provided information about their model for further exploration.
- By Lindsay Stares
- October 4, 2016
O’Reilly Media has released a new salary survey, “2016 Data Science Salary Survey: Tools, Trends, What Pays (and What Doesn't) for Data Professionals.”
Most organizations would just report the numbers and provide a few simple correlations or charts. O’Reilly used data science algorithms to analyze these data science salaries and even provided information about their model at the end for further exploration.
The survey represents over 900 respondents from 45 countries and 45 U.S. states who work across a variety of industries. All of them are data professionals and almost half (45 percent) have the job title “data scientist.”
In the analysis, the authors attempt to tease out specific factors that distinguish higher and lower salary levels in the data science space.
They report that about half the variance in salary can be attributed to location and experience. The highest median salaries in the U.S. are found in California, the lowest in the Southwest. Salaries rise significantly with both experience and age before dropping at 17-20 years of experience and age 60+.
Job titles are a better predictor of salary level than tasks are, although specific tasks do have an effect. For example, those who spent significant time developing prototype models made $7,000 more than their peers on average.
Gender is a factor; holding other variables constant, the analysis found that women still make less than men.
Data Outliers May Be Managers or Executives
Despite lower salaries for the group with 17-20 years of experience, the small fraction of respondents (2 percent) with over 20 years of experience had the highest median salaries. Similarly, only 5 percent of respondents spend over 20 hours a week in meetings but they had higher salaries than those spending less time in meetings.
It seems likely that both these factors correspond to those in leadership or management positions but the report does not confirm this.
It does, however, confirm that respondents with titles related to senior management (C-suite, directors, vice presidents, etc.) had by far the highest median salaries. Tasks related to management roles were correlated with higher salaries as well.
Clustering: A New Way to Define Data Science Roles
The report’s authors used clustering algorithms to sort the respondents for further analysis. The four clusters they found can each be defined by a combination of tools and tasks:
1: Analysts and developers who don’t use many tools. Some in this category might be better defined as programmers rather than data scientists. These respondents had the lowest median salaries.
2: Those who use primarily Microsoft products and other proprietary tools. This group represents mostly analysts who do some data science. Those in this group are more likely to perform tasks that don’t require coding knowledge, such as developing dashboards.
3: Those who do a great deal of coding and use primarily Python tools. This group is, on average, younger and less experienced and they don’t spend much time on tasks that could be considered managerial.
4: Those who use the widest array of tools, including significantly more open source tools. Respondents in this group had the highest median salaries, and they are also more likely to work with ETL and data management systems.
We’ve seen suggestions for how to talk about different types of data scientists or roles within data science. This four-way split is interesting because it distinguishes between those who use many versus just a few tools, those who use proprietary versus open source tools, and respondents who spend more or less time developing models and coding.
Regression Model for Further Study
The authors acknowledge that their model only accounts for three-quarters of salary variance -- the final quarter is up to individual factors.
The full ebook is available on OReilly’s website (registration is required) and it contains many more interesting insights. At the end of the report you’ll find their regression model so you can compare their salary data with your own.
Lindsay Stares is a production editor at TDWI. You can contact her at firstname.lastname@example.org.