The Growing Role of Data Lineage in Modern Data Management
Determining data’s origins and transformations as a way of understanding its business value, trustworthiness, quality, and applicability for specific use case.
- By Philip Russom
- November 16, 2020
End consumers of BI reports, analyses, data sets, and other data-driven products have questions that they regularly ask: Where did the data come from? How has it been aggregated and transformed? Who has used it? What is the quality of this data? How trustworthy is it?
If users do not receive credible answers, they will not trust and consume the data and BI products. Hence, it behooves data management and analytics professionals to put data lineage solutions in place that can accurately answer these and other questions about data origins, history, transformations, use, condition, and trustworthiness. You need data lineage tool functionality so you have accurate information about data available quickly when you are questioned by users, developers, auditors, governors, and managers.
Data Lineage Information Answers Questions About Data
Data lineage records the journey data takes as it moves from original sources, gets repurposed (via aggregation and transformation), and goes into BI and analytics products. Data lineage may also record various attributes of data by cataloging data by domains, subjects, and other categories either manually or automatically. Other attributes include the condition of the data’s quality, metadata, models, plus rankings contributed by users about its trustworthiness and usability.
In additional to tracking individual data flows, data lineage unifies them so you can draw a comprehensive data map that many types of users and applications can access. When data lineage functionality is fully automated it can autonomously track, record, and catalog data with little or no human intervention, thereby boosting developer productivity and assuring an up-to-date map of data across an enterprise.
Ramifications for Missing or Limited Data Lineage Answers
If users do not receive credible answers to lineage questions they will not trust the data. They will not use the data or the BI products it is delivered in, such as reports, analyses, and data sets. The resulting low adoption of data and BI products can be construed as a failure on your part as a data or analytics professional. In some cases this can be a career-limiting failure.
Some users will react to data they do not trust by creating their own data sets. For example, rogue data marts and other low-quality, contradictory data silos can be a consequence of poor data lineage information.
Without broad data lineage information, many tasks are slow and inaccurate. For example, developers can take too long to produce a solution because their time is burned up reconstructing data lineage on the fly. Many users select inappropriate data sources when information about data is limited. In particular, mildly technical self-service users need all the guidance they can get, and data lineage information can help them find and use data with greater ease and control.
Finally, one of the worst ramifications of missing data lineage concerns audits required by regulatory auditors, financial institutions, your own management, accountants, or partnering firms. Auditors get suspicious and the audit process takes longer than necessary when data is poorly documented or the documentation lacks credibility.
Business Use Cases for Data Lineage
Improving the information about your data via data lineage practices and tool functions can lead to greater business value from analytics investments. This is especially true for solutions for BI, reporting, advanced analytics, data warehousing, and data integration. This is because BI products become more accurate, targeted, and trusted, so they are used more often with better business outcomes.
Credible data lineage information can also contribute to improvements in the governance, stewardship, curation, and compliant use of data. For example, lineage can locate, tag, and track personally identifiable information (PII), which helps to avoid unauthorized or noncompliant uses of it. Conversely, when lineage information also tracks users and usage it can reveal and help remediate non-compliant uses of sensitive data. In case of an audit, data lineage with extended tracking enables you to prove compliance credibly, so the audit goes faster and with fewer business ramifications.
Technology Use Cases for Data Lineage
The additional information about data provided via data lineage can guide developers as they look for data and profile it thoroughly when building data-driven applications. Lineage information also enables quick data discovery for rapid-prototyping and agile development methods. In these cases, data developers receive a productivity boost.
Data lineage information also boosts the reverse engineering of data sets, integration solutions, and other BI products. From data lineage information, a developer gets a quick understanding of a BI product built by someone else, which assists in maintenance and updating.
Finally, data lineage information (similar to detailed metadata) can contribute to multiple data disciplines, such as quality programs, migrations, and integration. Lineage information can reveal redundant data sets that need merging, or unused data that needs to be archived. It also contributes to time-consuming tasks such as information life cycle management (ILM) and database administration.
The Role of Automation in Modern Data Lineage
Advanced tools for data lineage scan data automatically regardless of data locations, structures, and other traits. The scanning results in a broad “data map” that represents all known data, along with information about the source, transforms, and uses of most data assets. The fact that a data map is generated automatically -- and is likewise maintained automatically as the data ecosystem evolves -- saves hours of developer and administrator time. This accelerates projects and reduces payroll costs.
The lineage-driven, automated data map also enables further productivity gains. This includes fast reverse engineering for old data-driven solutions, easy source-to-target mapping for new solutions, and impact analysis to understand the ramifications of changing data before making the changes.
The Big Picture Seen Via a Data Lineage Map
A data lineage map provides a view into all available data assets. Therefore, the lineage map can be used as an inventory of data to be queried, browsed, searched, and governed.
The map provides semantics for very broad but controlled data exploration, which is key to discovery-oriented analytics (mining, statistics, machine learning) and self-service data practices (self-service data prep and visualization).
Depending on how data lineage services are extended, the data lineage map may also record data structures, interfaces, and dependencies. This information is invaluable to data architects and system optimization experts who are hard pressed to understand -- much less improve and design -- large, multiplatform data architectures and systems architectures. The lineage data map is especially profound when covering data in all on-premises and cloud systems. Architecture aside, the “big picture” visualized by a data lineage map can facilitate the design and execution of data platform administration, data migrations, data consolidations, multiplatform process optimization, and enterprisewide data governance.
For Further Learning
Replay the 2020 TDWI webinar this article draws from: “Why Data Lineage is a Must-Have for BI,” available online.