Flashpoint

Feature Story

Rethinking Data Movement:
From Data Flow to Data Pipelines

Dave Wells
Data Industry Consultant and Educator

From data sources to applications, the world of data management is substantially more challenging today than in the early days of business intelligence when we worked primarily with structured enterprise data and relational databases.

Variety and complexity of data and processing create new challenges. The demands of agility and reduced time to value add further difficulties. With new challenges, we need to think about the problem differently. It is time for data pipelines that move data efficiently and store data sparingly.

The increase in data management complexity begins with the variety of data sources, including many sources of data external to the enterprise. New sources drive changes in how we ingest and persist data, in topology by which we organize data, and in techniques to increase data utility. All of these changes are necessary to expand use and reuse of data. Figure 1 illustrates many of the factors contributing to added complexity.

(Click for larger image)

Handling Complexity and Demand for Speed

Methods of data persistence have expanded rapidly with the emergence of NoSQL databases. In simpler times we stored most data in relational databases and occasionally used XML or other tagged data storage methods. Today we also work with key value stores, document stores, graph databases, and columnar and big table databases. Choosing the optimum storage method for various kinds of data requires more consideration than when RDBMS was the default.

Topology for data management is especially challenging. The data warehouse—once the centerpiece of enterprise data integration—is now just a single component of a complex data environment that may include multiple data warehouses together with any combination of data lakes, data sandboxes, federated and logical data warehouses, and much more.

Data utility is a difficult topic because quality and usefulness of data are directly related to the goals and needs of the individuals using data—suitability to purpose. With the variety of data users and many distinct use cases, there is no “one size fits all” data structure that suits the needs of everyone. Single-point integration for many uses (the data warehouse approach) meets some needs. Individual data blending per use case (the self-service approach) meets other needs. Balancing methods while achieving consistency and honoring governance constraints can be especially difficult.

The variety of use cases for data is both the catalyst and the beneficiary of added complexity. Data applications that range from basic reporting to advanced analytics demand choices and flexibility of data sources, data structures, data access methods, and data preparation techniques.

In addition to complexity, we have the pressures of demand for speed. Analytics is often a real-time endeavor where the timeline from data to insight must be highly compressed. Eliminating delay and waste between data sources and data applications is essential. The number of points at which data is stored along the path from source to application increases cycle time when getting from questions to answers.

Yet data must be stored to prevent loss of information, lack of history, and failure to meet needs. Storing data many times, however, creates a long path from data to value. When needed data is first staged, warehoused, and then extracted to an analytics sandbox before analysis begins, time to value is directly affected. Figure 2 illustrates typical data stores that contribute to a long data-to-value path.

(Click for larger image)

Which data to store—and in which data stores to allocate it—are architectural questions and design challenges. Trade-offs of speed, reliability, accessibility, and maintainability are key considerations.

Utilizing Data Pipelines

Accelerating the data-to-value chain, however, takes more than simply reducing the number of points at which data is stored. Rethinking how we move data is essential. Traditional data warehouse design looks at data movement as data flows. Standard data flows move data through ETL and ELT processes with the goal of storing data in a warehouse or data mart with access by many people and processes. This approach is optimized to store data, not to use data.

We need to rethink design principles for data movement—think data pipeline instead of data flow. Data flow to a single point of integration (data warehouse) is no longer enough. The variety of users and use cases creates the need to have many data pipelines—perhaps as many as one per use case. A pipeline delivers data to the point of consumption, not to a reservoir from which it can be drawn. The pipeline is optimized to use data, not to store data. Figure 3 illustrates the concept of multiple data pipelines.

(Click for larger image)

Design data pipelines for fast, efficient movement of data from source to business application. Store data when needed to capture and retain history. Store data when there is a clear and certain need for many people or processes to access a single, standard data structure. Focus on data movement instead of data storage. Move data as quickly as is practical from where it is stored to where it is needed. You’ll serve the business well and step up to many of the challenges inherent to today’s complex world of data.

Dave Wells is actively involved at the intersection of information management and business management. He is a consultant, educator, and author dedicated to building meaningful connections throughout the path from data to business value. Knowledge sharing and skill building are Dave’s passions, carried out through consulting, speaking, teaching, and writing. He is a continuous learner—fascinated with understanding how we think—and a student and practitioner of systems thinking, critical thinking, design thinking, divergent thinking, and innovation.

TDWI Onsite Education: Let TDWI Onsite Education partner with you on your analytics journey. TDWI Onsite helps you develop the skills to build the right foundation with the essentials that are fundamental to BI success. We bring the training directly to you—our instructors travel to your location and train your team. Explore the listing of TDWI Onsite courses and start building your foundation today.

Announcements

NEW Best Practices Report
BI, Analytics, and the Cloud: Strategies for Business Agility

NEW Infographic
Improving Data Preparation for Business Analytics

NEW Ten Mistakes to Avoid
In NoSQL

NEW Business Intelligence Journal
Business Intelligence Journal, Vol. 21, No. 3

NEW TDWI E-Book
How Your Business Can Maximize Data's Value

NEW Checklist Report
Governing Big Data and Hadoop

contents

Rethinking Data Movement

The Cure for Ailing Self-Service Business Intelligence

Barriers to Making Modernization Happen

Mistake: Ignoring the Quality of the Data Loaded

Education & Events

Seminar in New York
Modern Data Modeling //
Big Data
India House
October 17–20

TDWI Conference in Austin JW Marriott Austin
December 4–9

TDWI Conference in
Las Vegas
Caesars Palace
February 12–17, 2017

Webinars

BI, Analytics, and the Cloud: Strategies for Business Agility
Wednesday, October 12

Combat Rising Integration Complexity with dPaaS
Friday, October 14

The What, Why, When, and How of Data Warehouse Modernization
Tuesday, October 18

Marketplace

TDWI Solutions Gateway
Informatica – Data Management for Next-Generation Analytics

TDWI White Paper Library
Big Data Analytics
Buyer's Guide

TDWI White Paper Library
Five Common Hadoopable Problems

Member Discounts

Ready to take the CBIP Exams or attend our next conference? Take advantage of these exclusive member discounts.

10%

Discount

on TDWI Austin

$10

Discount

on CBIP Exam Guide

Flashpoint Insight

The Cure for Ailing Self-Service
Business Intelligence

Marsha Burke, Wayne Simpson, and Shad Staples

Many self-service business intelligence models falter shortly after they are deployed. The underlying reasons are based on the same factors that discourage us from performing other do-it-yourself tasks to accomplish important outcomes. These include lack of know-how, the inability to access the right tools, time constraints, the need for professional quality, and a lack of patience. By understanding these failure points, organizations can take steps to improve the results and longevity of self-service BI.

This article expands on the reasons for failure and suggests how do-it-yourself models can be made more successful through implementation of a centralized approach. The centralized approach offers service and support that directly address the common failure points of self-service BI. By centralizing the approach and backing it with IT support, standardized and customized BI models can be developed, validated, and deployed in a trusted and timely manner. Organizational benefits include cost-effectiveness, validated results, ease of use, and sustainability for the information that fuels decision making.

Learn more: Read the entire article by downloading the Business Intelligence Journal (Vol. 21, No. 3).

TDWI Research SNapshot

Barriers to Making Modernization Happen

Philip Russom

Data modernization has its benefits; however, it also has many potential barriers, according to survey results (see Figure 9). The issues span multiple areas.

Inadequate organizational support. As with most data-driven programs, DW modernization can be limited by poor stewardship or governance (40%) or a lack of a business case or sponsorship (30%).

Technical team deficiencies. Technical success depends on the team, which may suffer inadequate staffing for data warehousing and related disciplines (39%), inadequate skills for new technologies and practices (33%), or a lack of experience with new big data types and their analytics (28%).

Cost issues. Financing modernization can be inhibited by the cost of implementing new technologies (34%) and the cost of hardware and software upgrades (21%).

(Click for larger image)

Data limitations. Whether focused on new big data, traditional enterprise data, or both, modernization can be threatened by the poor quality of data (27%) or metadata (18%).

Design challenges. Applying new architectures to an existing solution requires substantial retrofitting when the current DW was designed for standard reports and OLAP only (20%). Likewise, moving to the complex, multiplatform system architectures typical of modern DWs can be stymied by the difficulty of architecting a modern, complex environment (21%) and the difficulty of managing a multiplatform DW environment (14%).

DW platform limitations. The DBMS and hardware platform under an existing warehouse can be a substantial barrier when the current DW environment cannot scale up to big data (16%) or ingest data fast enough (14%) to leverage large volumes or streaming data.

Missing ancillary tools. Modernizing the ecosystem around a warehouse requires the acquisition or upgrade of many tool types. Otherwise, the results of modernization are limited by a lack of tools for analyzing new big data types (12%) or for integrating and managing new big data types (12%).

Stodgy mindsets. Twenty-seven respondents selected “Other” (6% in Figure 9) and entered additional barriers to modernization, most of them relating to mindset issues. Sometimes the problem stems from upper management mindsets, as when “management does not prioritize infrastructure investment”; “the business does not understand the potential of data”; or “top management is not committed to innovation.” At other times, everyone suffers from an “inability to rethink the technological choices made earlier” or “the momentum of current or traditional thinking.” When it comes to getting resources for modernization, few organizations are immune to “company politics.”

Read the full report: Download TDWI Best Practices Report: Data Warehouse Modernization in the Age of Big Data Analytics (Q2 2016).

Flashpoint Rx

Mistake: Ignoring the Quality of the Data Loaded

William McNight

Data quality can easily be disguised by the fact that we are dealing with such massive volumes of data (quantity over quality).

NoSQL data is still data, and data will still be processed by systems and humans. Those actions will only be as good as the input—that is, the data.

We recommend spot reviewing the data against technical and business expectations of the data from perspectives of completeness, accuracy, and conformance. Data could lack referential integrity. It could lack proper unique identification and cardinality expectations, and other logical triangulations across the data could have unexpected results. The data itself could just have incorrect values that don’t conform to expected domains, repeated rows, and inconsistencies in the representation of standard data. Where violations are found, systemically correct them in the load or, better yet, at the source.

Those who load NoSQL should not believe they are loading generic data. They must be as familiar with the contents they are injecting into the organization as they are with the NoSQL setup itself.

Read the full issue: Download Ten Mistakes to Avoid in NoSQL (Q3 2016).