Creating Self-Service Organizations with Data Catalogs
Data catalogs help you create a truly self-service organization by efficiently providing analysts with the means to find, understand, and trust their data.
- By Aaron Kalb
- March 28, 2017
In the BI world, "self-service" often means simply giving individual analysts the ability to define their own metrics and dimensions without involving IT. Although those analysts relish the freedom that comes from such disintermediation, it can result in a proliferation of reports and stats that actually make it harder for business users to self-serve because they don't know what they can trust.
For a careful analyst committed to accuracy, IT isn't the biggest bottleneck. The hardest part of the job is finding and understanding relevant and trustworthy data sets. The modern data catalog, powered by machine learning and designed for collaboration, has emerged to overcome these challenges, allowing analysts and business users to work quickly and correctly.
Find the Data
Finding the right data asset in a modern enterprise can be like trying to find a book in a massive library. In the 20th century, libraries used card catalogs to make it easier for book seekers to search by title, author, or category. In the digital age, Amazon and Google eclipsed their predecessors, largely by developing superior catalogs.
Much like Google indexes the Internet, the modern data catalog crawls, parses, and indexes all of an organization's data -- including information in BI tools, wikis, and usage logs -- to enable a single search function over a diverse array of data assets. Raw data elements can be annotated and tagged via both via expert curation and machine learning.
For instance, algorithms can train on existing documentation to make educated guesses about the logical meanings of inscrutable field names full of abbreviations and acronyms. With such "translations" in place, natural language search terms can yield useful results (e.g., a search for "daily revenue" can find "dly_rvnu").
In addition to identifying all relevant candidates, a modern catalog should -- like Google -- rank them so the most promising are near the top. Ranking by popularity -- a measure capturing recency and frequency of use -- can help data consumers identify the best assets based on the prior behavior of their peers. The result is an easy-to-use single source of reference for all of an organization's data assets with the context necessary to determine which are most applicable to the analytics question at hand.
Understand the Data
For analysts, finding the data is only the first step. Understanding data requires rich context such as definitions and information on history and usage.
An analyst needs to understand the shape of the data set, where it came from, whether it is up-to-date, who else has used it, and how it was used. In aggregate, the organic use suggests roughly how the data has been used historically. Data catalogs also show who in particular has used the data, helping analysts find the experts who know the data best (which can otherwise be quite challenging in organizations with hundreds of analysts.)
Top user lists can also indicate meaning: if everyone listed is in a particular team (such as the finance, risk, or marketing department), that can be a helpful hint. Just as a shopper on Amazon considers factors such as star ratings, price, delivery date, pictures, and other users' purchasing patterns to select the right product, a data consumer should be given a 360-degree view of each data asset in a catalog.
Trust the Data
Finally, analysts face the challenge of determining whether they can trust the data -- whether it is accurate and can yield meaningful insights. Think about trying to find a restaurant offering tasty dishes. A restaurant may have "tasty" written all over its website, but you can't necessarily trust that description. Data has a similar issue -- just because a table or file is named "q3_results_final_final_final," that doesn't necessarily mean it's actually final. If anything, such suffixes should raise suspicion -- presumably the "final_final" version looked conclusive at some point.
Traditional data documentation systems limit contribution permissions to a small, trusted group. The result is more accurate documentation for a few data assets but far less breadth of coverage. This method is also slow, and the documentation often becomes stale. It doesn't suffice for a self-service environment.
Modern data catalogs draw on third-party information to verify whether the data can be trusted. They incorporate active signals such as analyst endorsements (similar to the star ratings on Yelp) and mine passive signals (much like how Google PageRank interprets hyperlinks as votes of confidence). Together these active and passive signals provide good indicators of data's trustworthiness.
Modern data catalogs also trace a data asset's lineage to allow analysts to determine its origins. If the data set was underlying the CFO's quarterly earnings presentation, it is probably (hopefully!) trustworthy. Using the wisdom of the crowd, aided by machine learning to make recommendations, modern data catalogs overcome the limitations of traditional documentation systems.
The Insight-Driven Organization
Self-service analytics are critical for any organization striving to be insight driven. In this new paradigm, traditional data documentation is too slow and restrictive. The modern data catalog provides the mechanism for creating a truly self-service organization by efficiently providing analysts with the means to find, understand, and trust data.
Aaron Kalb is head of product at Alation. He has spent his career crafting empowering human-computer interactions, especially through natural language interfaces. After leaving Stanford with a BS and an MS in Symbolic Systems, he worked at Apple on iOS and Siri (doing engineering, research, and design in the Advanced Development Group). Aaron now leads the design team and guides the product vision at Alation. You can contact the author at email@example.com.