Wrangling Data into a Holistic View
What challenges do enterprises face when bringing data together and what best practices can your organization use to prevent integration mistakes?
- By James E. Powell
- February 20, 2018
Are you struggling to wrangle data from across your enterprise to feed the demand for holistic analytics? You're not alone. We caught up with Bedrock Data's CEO and cofounder Taylor Barstow to discuss this growing challenge and how to overcome it. Taylor Barstow is a technology innovator with experience as both a developer and an engineering executive. Prior to Bedrock, Barstow was a VP of engineering for the enterprise graphics workflow solution Workflow by Design.
Upside: Why is it important for organizations to have "holistic" analytics? What's driving demand?
Taylor Barstow: The first reason is that business leaders want a 360-degree view of their organization. That gives them the perspective and insight to optimize the business -- acquiring customers, growing customers, and making customers successful.
The second reason is that as customers engage with more channels and departments, unifying data from many systems orients teams around solving problems. Rather than debate discrete data sources, departments can analyze integrated data and learn which key activities are leading to success. From product management to customer management to marketing, teams have the tools to connect the dots between investments and outcomes.
Why is providing that 360-degree customer view so difficult?
Companies use many cloud applications. When these different data sets are scattered -- across Web forms, chat logs, marketing automation, support tickets, sales quotes, and ERP systems -- wrangling data is very complex and cleaning it is time-consuming.
What challenges do businesses face when bringing data together?
Traditionally, IT teams have had to manually extract data from each application, format it for consistency, eliminate duplicates, resolve conflicts, and normalize data into a single schema. As the number of data sources grows, cleaning raw data "by hand" is increasingly cumbersome. Even after the data is cleaned, before analysis can begin, IT teams still need to map relationships between fields and join tables into a master set.
What are the biggest mistakes enterprises make when addressing these challenges?
The first mistake is being overwhelmed by the scope of the project and either doing nothing or going down a path of centralized manual data clean up that doesn't go anywhere. Another mistake is overlooking the importance of relationships between data objects. Insights require more than just a "flat data file." You need a full set of data objects and their corresponding relationships. For example, you need a contact or person assigned to an account with associated opportunities, activities, tasks, and tickets.
What best practices can you recommend for enterprises to prevent these mistakes?
Rather than managing data cleanup as a central project, I recommend first deduplicating data within each silo. This will speed up the overall project because once that step is done, automation can be applied to mapping fields between systems and building the integrated data set. The integrated data set should also carry over the concept of object relationships from the source applications to build a complete, queryable data set around a customer and all of their attributes and actions.
How might organizations handle conflicts in their data?
There are two main approaches, "system of record" (SoR) and "last modified." SoR ranks systems to determine -- for a specific type of data -- which system should win out in a conflict. "Last modified" means your record uses the most recently updated data across systems for a given field.
How should an enterprise choose between the two? Are there any rules of thumb?
First, you shouldn't necessarily have to choose one or the other. The best option will vary based on the type of field.
The question to ask is: What is the most likely deciding factor for resolving a conflict for a given field? Do we give more weight to where the data originates or how recent it is? If the field is closely tied to a specific department -- such as marketing data from a marketing system or finance data from a finance system -- these are the types of fields where SoR makes sense. For fields more commonly shared across systems, such as a phone number or job title, a "last modified" approach will likely best ensure the data is as up-to-date as possible.
SoR also works well when there are values you'd rarely, if ever, want overwritten. For example, marketers might use HubSpot to record a first conversation event, an original source type, the number of page views -- values they wouldn't want a sales rep to overwrite. SoR is also a good fit for automated processes between connected systems. For example, if fields in your events system map to your marketing automation tool, you would use SoR. Many finance systems are also mapped to robust CRMs for reporting purposes. You might bring data from Zuora into Salesforce to report on monthly recurring revenue, number of customers, average selling price, or annual contract value. In this case, SoR is your best bet.
Are there any other benefits to centralizing data?
Absolutely. Creating common definitions that can evolve and be easily maintained vastly improves how your SaaS applications function. Foremost, centralized data orients the entire organization around solving problems, freeing teams to spend their time acting on data rather than debating its accuracy.
A centralized data warehouse is also a smart GDPR compliance strategy. When all customer records are available via a trusted, centralized resource, it's easier to return data or erase sensitive data upon request. You can notify customers about a breach the moment it happens while knowing where the data originates and who can access which records.
What innovations is Bedrock providing to make the data wrangling process easier?
Over the past year, Bedrock has evolved from syncing data across applications to automating the steps (from gathering to validation) that go into data wrangling and ingestion -- the entire virtual data pipeline.
Our special sauce is the automation we apply around bringing together data from disparate applications into a unified, consolidated, normalized data set of related objects. We map relationships between systems by pulling raw data records from external APIs and running a rules engine to produce a unified set of "fused records." These fused records can be fed into any data warehouse, analytics, dashboard, or business intelligence tool for analytics.
What's a fused record?
A fused record merges and normalizes records into one document. If you have a contact record in one system and a contact record in another, the fused record merges and normalizes the contact record and all related objects, cleaning your output so you see zero repeats and consistent formats everywhere.
Is there anything different about this approach?
Yes. By getting to a unified data set of related objects as part of a unified data pipeline process, it means data from your business applications can be made available immediately for analytics and executives. Other approaches that bring together multiple data sets will put heavy onus on the database administrator to make sense of the data once it's sitting in the data warehouse. Our approach eliminates that step and bridges the gap between applications and analytics.
Where do you see the technology heading in the future?
I see holistic analytics as playing a more crucial role in driving business decisions. With the rise in AI and machine learning, in the future you'll be able to automate actions on changes to records in real time or create workflows for actions such as notifying an account manager or contacting a customer via social media. This starts with having a trusted data source to feed these predictive analytics systems and technologies that make the data available regardless of format or where it lives.
For Further Reading:
Data Integration and Machine Learning: A Natural Synergy
In the Middle of DatA Integration Is AI
Data Wrangling Versus ETL: What's the Difference?
About the Author
James E. Powell is the editorial director of TDWI, including research reports, the Business Intelligence Journal, and Upside newsletter. You can contact him
via email here.