Tomorrow's Heroes: Developing the Data Curation Role
As data becomes ever more valuable, so does the role of the data curator.
- By Brian J. Dooley
- January 10, 2019
The data curation role has emerged from obscurity since the beginning of big data. It is now poised to enter a new level of significance in the AI/machine learning (ML) era.
Data curation is the interface between data engineering and the data user in which business needs are matched to data sets. Chief tasks of this role include addition of metadata, verification of sources, ensuring data fitness to task, ensuring availability and accuracy with respect to business needs, ensuring reusability, providing ongoing oversight of data through its life cycle, and handling specific business-related issues such as regulatory compliance.
Requirements of this role change according to the specific situation of the business and its industry. Although data curation overlaps with data engineering and master data management, it differs by focusing on the interface between business needs and data sources.
Data curation is becoming important because data is becoming important. Good decisions need data that is correct, reusable, and meets the needs of the business. With increased scrutiny of transparency and regulatory compliance in decision making, curation will continue to grow in significance.
Needs of a New Regime
The complexities of data management have exploded under big data as unorganized data lakes filled with unstructured data have proliferated. The number of technologies involved in analytics infrastructure has also grown, adding complexity as differing database systems and storage architectures such as Hadoop, NoSQL, and the cloud increase in importance. Complexity is driving the need for new roles to bridge the gap between technical requirements and business needs.
On the business side, data curation is evolving to ensure that data collection and storage continue to meet real needs as the underlying technology changes. Self-serve analytics make curation more urgent. Citizen data scientists understand enough about data to perform analysis with self-help tools but they are not proficient in establishing procedures for ensuring optimal fit between data sources and queries. Data engineers understand the mechanics of the data system but not necessarily the nuances of the data sets within the business environment. An intermediary role ensures that the requirements of both business and technology are adequately served.
Although data curation has been a part of scientific and medical research, big data has brought it into the mainstream of business analytics. However, even as big data has made curation more important, it has also made it more difficult. Enormous volumes of data must now be processed, going well beyond the realm of simple automation. The curator's task of adding value to data is on the verge of further evolution.
Something New in Curation
Data curation is often laborious, demanding special skills and a deep knowledge of data characteristics along with domain knowledge. The tasks are not scalable for an individual or a small team without significant aid. An increasing amount of automation is now being applied, along with crowdsourcing (depending upon the type of data) and ML. ML can find patterns, apply metadata, and help normalize data (among other tasks) but it is only part of a larger puzzle.
New platforms are emerging to automate some work and integrate with data management, but data curation still requires individuals to verify that the process itself is carried out correctly and that guidelines and processes are in place to ensure that the curation process continues to meet both the requirements of the business and the needs of its citizens data scientists, data scientists, and stakeholders.
As with many other tasks now being automated due to complexity and volume, a truly autonomous data curation solution is unlikely to appear soon. There are startups offering automated and semi-automated solutions, but many issues remain unresolved. The terrain is constantly shifting as can be seen in the continued development of analytics centers of excellence. Data curation fits naturally into the center, but curation requirements shift with changes in the types of data collected and its purpose and with technological advancements and changes in corporate culture. New concepts of curation provide a few more spokes in the wheel for analytics and data science.
Changes are needed because analytics is approaching a new level of maturity. As data becomes increasingly valuable, it is clear that a finer understanding of quality and fitness to purpose is needed.
About the Author
Brian J. Dooley is an author, analyst, and journalist with more than 30 years' experience in analyzing and writing about trends in IT. He has written six books, numerous user manuals, hundreds of reports, and more than 1,000 magazine features. You can contact the author at [email protected].