5 Use Cases for Integrating Big Data Tools with a Data Warehouse
These five use cases are good examples of how you can use big data tools to stage data for your data warehouse.
- By Stan Pugsley
- July 20, 2018
All of us like to think we are above average. We like to think that our big data is bigger than average, and when we hear about big data tools, we flatter ourselves into thinking that our organization is a perfect candidate for the most advanced tools and architectures.
Is that really the case?
When You Actually Need Big Data Tools
Think of big data tools such as Hadoop, Spark, NoSQL, and massively parallel databases as the freight trains of the data world. A freight train is amazingly powerful and efficient, but it is slow to start up, limited in its routes, and frustrating to reschedule or change. If you are looking for a quick, agile vehicle to deliver packages, a freight train might be your least effective option. But if you are moving a mountain of materials every day, the train could be your best, and perhaps only, option.
In general, you will know you have a big data scenario when your:
- Data velocity increases 100 times -- from thousands of transactions per hour to hundreds of thousands
- Data volume increases 100 times -- from millions of rows to hundreds of millions
- Data variety increases 100 times -- from dozens of data sources to hundreds
At those points of data throughput, your frustration with -- and the limitations of -- traditional SQL databases will grow to a tipping point. You will feel you are moving a growing mountain of data with a hand shovel and only getting further behind. If you aren't absolutely drowning in data, you can probably deal with your data in SQL tools with some tuning and good architecture.
The Best Tool Depends on the Job
Traditional SQL ETL and reporting tools, used within the data warehouse architecture, are best suited for primary business outcomes, such as sales, payments (or other transactions), account sign-ups, and unsubscribes.
Big data tools, used in a data lake architecture, are ideally suited for secondary business events that track the detailed (but often meaningless or redundant) steps on a customer journey or repeated (and often meaningless) messages from an Internet-connected device, such as:
- Browsing history
- Mobile app in-app actions
- Device-activity monitoring
- GPS location tracking
A data warehouse is the ideal destination for summarized trends from such secondary business events, aggregated into models that reflect the business processes.
Using Big Data with the Data Warehouse
One example of how big data tools can complement a data warehouse is an alarm company with Internet-connected sensors in homes across the country. There would be little value (and huge expense) in storing each sensor response in a SQL data warehouse, but that data could be retained in cheap storage in a data lake environment and then aggregated for use in the data warehouse. For instance, the company could define combinations of sensor device events that constitute a person locking up a home and departing. That aggregated event could be stored in the data warehouse in a fact table that records arrivals and departures.
Here are four more use cases for using big data tools to stage data for a data warehouse.
1. Summarize and filter IoT data into fact tables. A large national bed manufacturer is now including biometric sensors in their high-end mattresses. The individual sensor readings could be kept in a data lake (using storage such as Apache Hadoop). Using a tool such as Apache Spark to aggregate and filter the signals, the data warehouse could be populated with aggregated data to create time-trended reports and log alerts when boundary metrics are exceeded.
2. Merge live data with historical data. Financial institutions need real-time access to market data such as interest rates, but they also need to store that market data and show it in the context of historical trends. A tool such as Apache Kafka or Amazon Kinesis could facilitate this integration between the two sets of data. There is no scheduled batch process to delay the information, and data is streamed directly to the visualization tool.
3. ETL based on continuous training of data science models (i.e., machine learning ETL). Internet retailers continue to refine their models for segmenting and targeting customers. These techniques can be applied in Web analytics tools (such as Adobe's Marketing Cloud) to drive Web content. They can also be captured in the data warehouse to shape reporting and forecasting. Data warehouse dimensions can contain hierarchies and attributes that are built dynamically from statistical models, and those values can be modified over time.
4. Sessionization of clickstream, GPS tracking, or device monitoring data. A trucking or delivery company may collect GPS and delivery event data 24 hours a day, but most of that data is of little value as individual data points. The goal is to group those events into trips to show the overall statistics for distance traveled, timeliness of delivery, and other key metrics. Grouping all the event data into distinct trips is a difficult and resource-intensive process that, at high volumes, requires parallel processing in a tool such as Spark. The final trip metrics can then be loaded into a fact table in the data warehouse.
A Final Word
Your goal is to have the best of both sides of the data pipeline -- by collecting as much raw data as possible about any customer or business activity but selecting and organizing the final business outcomes into a data warehouse designed for business decision making. By using the right tool for the right job, you won't bog down your database server with an unmanageable mountain of staged raw data, and you won't impede reporting and business decision by trying to drive it with freight train. A data lake that leverages big data tools feeding into a data warehouse based on SQL is a smart way to keep all of your stakeholders happy.
Stan Pugsley is a senior business intelligence architect at Xerva, a BI consulting and DWaaS (Data Warehouse as a Service) company based in Orem, UT. He has been building business intelligence and data science solutions for 20 years, having worked at PricewaterhouseCoopers, HP, Sharp Analytics, and a number of startup companies. You can reach the author at StanP@xerva.com.