How Enterprises are Overcoming the Data Scientist Scarcity
What can your organization do to overcome the lack of data scientists?
By Mehul Shah, Senior Manager, Capital One and Vibha Dhawan, Program Manager and Solutions Architect, CGI Consulting
Big data analytics has gone mainstream; organizations in a variety of industries realize the necessity of leveraging it to stay ahead of nimble competitors. Data has become an asset that many companies are using in innovative ways.
Data by itself only holds potential, untapped value until organizations discover and act upon it, which is why data scientists -- who can singlehandedly turn the data (raw material) into actionable insights -- are so important. As companies embark on their big data and analytics journey, their need for data scientists becomes critical. It's no wonder that the data scientist has become pivotal for the success of big data initiatives, and why the data scientist has been regarded as the sexiest profession of the 21st century.
Great Expectations
There are three key dimensions required of a true data scientist.
- Business- or subject-matter-specific domain knowledge and soft skills such as making presentation, communicating analytics results, and assistance in decision making
- Technical expertise in big data tools, coding (SQL and No SQL or scripting-based) related to data integration, transformations, and aggregation
- Knowledge of decision sciences, including statistics, mathematics, machine learning, and data mining techniques
Being a master in all three areas is an expectation and very few people can live up to. At most individuals can master two of the three dimensions, creating a gap for data-driven success. Enterprises have high expectations for turning data into profits (directly impacting the bottom line) or positive outcomes (indirectly impacting the bottom line).
Combining Jobs: Data Analyst, Data Engineer, and Statistician
Companies are making huge investments in big data-based platforms (such as Hadoop), but these tools require intelligent human analysis to leverage them effectively. Given the scarcity of data scientists, enterprises are looking at how to leverage the existing skills of their staff to fill the data scientist gap without comprising quality.
Organizations have combined three distinct job families that can collaborate to create a true data-driven organization.
Data analysts have sound business knowledge, subject matter expertise acquired over many years, and talent in presenting the results of analysis to management for faster and more effective decision making. They are also usually good with SQL/EXCEL and some basic tools. With some additional training in how to leverage big data tools, they can easily discover the trends hidden in an organization's data. The primary skills of a data analyst include business domain knowledge and making presentations and enabling communication for decision making; the secondary skill is their knowledge of technology.
Data engineers have a technology and coding background and have excellent SQL, No SQL and scripting language (such as Python or Perl) skills. They also tend to have a good understanding of data modeling, profiling, and basic data architecture. Additionally, these individuals have a good understanding of business requirements and can translate them into data solutions. Their primary skill is technology knowledge; they have strong business domain knowledge.
Statisticians have a strong background in mathematics and/or statistics and are good with various data modeling techniques such as regression, logistic regression, association rule mining, hypothesis-driven decision making, Bayesian techniques, decision/classification trees, clustering, neural networks, and artificial intelligence-based techniques. They are adept at building, training, and tuning complex analytical models. They also have knowledge of basic SQL/EXCEL and dashboarding tools to perform standard data preparation. Their skill with statistics is their leading talent; they are also knowledge of machine learning and technology.
Data analyst(s) can be at the project's front end, defining the business intent, preparing and presenting the results of their analyses. Statisticians can build models and work with data analysts to tailor the models to fit specific business problems and opportunities. Data engineers can perform data cleansing, transformations, and munging. Data analysts and statisticians can leverage Data engineers to prepare required data and automate data collection/aggregation process.
The combination of these three roles can aid in data driven success and act as a true data scientist, thereby enabling organizations to realize the value of big data. Organizations of all sizes and in various industries can benefit from the ever-increasing promise of data gold as they continue their search for a data scientist.
The recommended approach -- one data scientist role fulfilled by three individuals -- is pragmatic and easily implemented without making significant changes within an organization. However, for a fast-growing start-up, it might still make sense to hire a good data scientist and pay top dollar for that resource. The super hero data scientist would also prefer to work for an innovative enterprise, but in a midsize or large established organization, these three skill sets already exist. Therefore, given the scale and size of some of initiatives, it might be easier to have team of two (or more) data analysts, one statistician, and a shared data engineer who can support two teams at once.
For large data initiatives, it makes sense to run them as a large data science program consisting of multiple teams of four or five resources (two or three data analysts, one or two statisticians, and one data engineer). The program can be broken into loosely interdependent initiatives, each having its own dedicated team. The senior data leader(s) can act as glue to ensure the teams are closely aligned and working towards a broader goal. In addition, teams can be arranged around specific functional area (marketing, HR, finance, etc.) or a line of business (brand A , brand B, etc).
Conclusion
By combining the existing three job families into one and dividing them via specific analytic initiatives, organizations of all shapes and sizes can benefit from the data-driven revolution. Organizations can create dedicated small teams which includes people from all the three job families. These small teams can be aligned with specific analytical projects with a clear business mandate and agenda. Thus effectively leveraging the existing skill sets readily available internally or in marketplace, organizations can easily jump start the big data analytics initiatives and scale them appropriately.
Mehul Shah is a senior manager focusing on information management, data analytics, and governance for a Top Ten financial services company. He has over 14 years of experience managing and architecting large, complex enterprisewide information management, data warehousing, data governance, data migration, customer MDM, business intelligence, Web analytics, and reporting application projects. Mehul has an MBA in marketing and analytics and MS in computer science from the University of Maryland and is also a PMP- and CSM-certified practitioner. You can contact the author at [email protected] or visit his blog at http://mehulshah008.blogspot.com.
Vibha Dhawan is a technical manager with CGI, a large global consulting firm. She has over 14 years of experience working in the public and private sectors and has successfully led key IT projects for Center of Medicaid Services, the Federal Trade Commision, the U.S. Department. of Housing and Urban Development, and Marsh & McLennen Companies. Her latest engagement was driving the key project for the Affordable Care Act. Vibha has an MBA from Virginia Tech and BS in computer science and is a PMP certified practicioner. You can contact the author at [email protected].