Your Data Scientist IQ
        
        What are data scientists, what do they do, and what skills do they have?
        
        By Deanne Larson, President, Larson & Associates
[Editor's note: Dr. Larson will be conducting sessions on data mining with R, project management for BI, and data integration principles and practices at the TDWI Conference in Las Vegas (February 22-27, 2015).]  
Ask anyone branding himself or herself as a data scientist: "What is data science and what does a data scientist do?" You are likely to get a variety of answers. Technically, anyone claiming more than 5 years of data science experience is questionable, as data science is still evolving and leading experts are still debating the precise definition of their craft. At a recent lecture on machine and statistical learning at a leading university, the professors, well-known statisticians all, explained that they are now referring to themselves as data scientists. 
What, then, are data scientists, what do they do, and what skills do they have?
The official definitions of data science and data scientist continue to be debated, but one thing is certain: the need for data science skills is growing. Initially, job descriptions for data scientist positions included the requirement of a Ph.D. in statistics. According to the 2013 Survey of Earned Doctorates, only 1% of new Ph.D.s (approximately. 500 of all doctorates earned yearly) are in statistics. A recent search of job boards worldwide returned about 50,000 open positions that included the key words "data science." Demand for data scientists is clearly greater than the supply, and it doesn't help that the definition of what a data scientist is, exactly, is changing. 
One thing is clear about the majority of the job descriptions for data scientists:  data scientists are expected to be experts in programming, data wrangling, communication, statistics, and data visualization.  We expect them to have extensive domain experience. Simply put, a person fitting this job profile rarely exists. An interesting approach was used by Harlan Harris, author of Analyzing the Analyzers, where data science (clustering) was used to create a profile of several hundred data science practitioners. The results appear to support the job description; however, the amount of expertise varied based on the primary role of the data science practitioners. 
The skills listed included domain expertise, machine learning, big data, mathematics, operations research, programming, and statistics. The roles listed (the primary job focus of the data science practitioners surveyed) included domain expert, data developer, data researcher, and data creative. The data creative role is a catch-all for practitioners who considered themselves a jack-of-all-trades, an artist, or a hacker. The convergence of skills needed -- and those used by -- data science practitioners indicates that the profile of a data scientist is becoming clearer.
One thing all data scientists have in common is that each is interested in figuring out ways to solve important problems with data. Important problems are not always the most glamorous, such as accurately predicting presidential election results or changing the game of baseball as in the movie Moneyball. Important problems are often of social value and those with backgrounds in sociology, journalism, biomedicine, and social welfare are moving to data science to address these problems.  
Who are the real data scientists? They are problem solvers from many backgrounds and industries, who work with computational problems and large data sets.  They deal with the challenges of data structure, quality, size, and complexity.
The results of Harris' study support the variety of data scientists and all have a common set of skills -- using statistical algorithms and technical skills to solve problems with data. For the sake of simplicity, programming, data wrangling, machine learning, and big data skills are grouped together as technical skills. Data wrangling is a catch-all phrase for the work that goes into the process of preparing the data for processing and, in some cases, training statistical models. 
Much of the rise of data science has been attributed to the technologies that have made working with larger data sets easier.  As a result, most people assume big data and data science go hand in hand. Another interesting result of Harris' survey is that out of the data science practitioners surveyed, most indicated that they rarely used data sets larger than a terabyte. Even with the group of practitioners having a deep skill set in machine learning and big data, data sets used were rarely larger than a terabyte. 
With the diverse skill set needed for data science, it is not uncommon for organizations to develop data science teams. These teams focus on developing the core competencies required of a data scientist. A data science team may include engineers, scientists, big data developers, statisticians, and analysts who solve problems collaboratively. These teams often report to a chief data scientist who would be responsible for setting the company's data strategy. Data science teams require a significant financial investment, so the demand for a few key data science resources who wear many hats will remain.
Where should you focus your efforts to create a base skill set for a data scientist? The skill of most importance in data science is to be able to bridge the gap between identifying the problem and solving the problem with data. Technical resources can program and wrangle data but may not be able to apply the most appropriate model that influences what and how to code. Business-domain experts can identify the problem and possible data sources but may not be able acquire the data, or if they do, determine what to do next. Statisticians understand how models can be applied but may not have the technical or domain experience to take the problem-solving process further. Having an understanding of classical statistics which includes regression modeling and machine learning algorithms (such as classification, clustering, and decision trees) can go far in effective problem solving with data, which essentially is what data science is.
   
Dr. Larson is an active practitioner and academic focusing on BI and data warehousing with over 20 years of experience. Dr. Larson completed her doctorate in management in information technology leadership.; her doctoral dissertation research focused on a grounded theory qualitative study on establishing enterprise data strategy. You can contact the author at [email protected].