Data, Time, and the Data Lake: Putting it All Together
Extensive design and development effort has been expended over more than three decades to allow relational databases to handle time properly. Can data lakes manage it as well?
- By Barry Devlin
- December 4, 2017
Time-dependent data -- particularly from the Internet of Things (IoT) -- is growing fast in volume and business importance. Consider, for example, using the time-based data from a train's onboard computer to plan preventive maintenance. When was this axle temperature measured? How fast is it increasing compared to the temperature of other nearby axles? Did the spike occur during braking or when vibrations increased? What are the implications for analysis if temperature is measured every five seconds but vibration measures are averaged over a minute by the sensor? How do we handle missing measurements?
Such time series data is deceptively simple in structure. It consists of a series of data records, each recording a source (sensor) ID, a timestamp, and one or more measured variables. Time series analytics, however, is a relatively complex and specialized topic in statistics. Data scientists typically undertake such work using specific R or Python analytics packages in the data lake where IoT data normally lands.
Predicting the likely date of failure of a train component is, of course, important in terms of both safety and economics. However, real business value accrues from optimizing the date for performing preventive maintenance considering the train's schedule, the impact of withdrawing it from service, availability of spares and skilled engineers, and so on. Such information resides not in the data lake but in the company's traditional production systems. Combining time series data from the data lake with production data is challenging because time is treated differently in these two environments.
Time Series in Production
Business production computing occurs in operational systems and data warehouses. Operational systems "run the business" minute-by-minute, reliably capturing and safely storing in a timely manner every transaction that builds the formal record of business activities. Data warehouses "manage the business" by building a usable and consistent view of its state as a base for decision making. These business aims drive how time is represented in both systems.
Every business transaction occurs at a single point in time as one in a series of events. In essence, this is also time series data. However, most operational systems and data warehouses store and operate on the business state, which exists over a period of time.
The difference is most clearly seen through an example. Deposits and withdrawals of money are the basic transactions on a bank account. However, it is the balance in the account at points in time and over varying periods that are of most interest to both the bank and its customers. Balance is a business state delimited by two timestamps: a start and end of a period of activity (the so-called period of validity). In addition, the times at which data is actually recorded in production systems (for example, during a nightly batch update) are important for control and analysis purposes, although they may differ from the actual business event times.
Business state data tagged with business and DBMS start and end timestamps is called bitemporal data. This structure is one of the most useful ways that relational database designers record and manage time-related data. Several databases, including IBM DB2 and Teradata, have included internal support for bitemporal data since early this decade.
Bitemporality is at the heart of data warehouse consistency and enables operational systems to manage the creation of state data from time series. As relational databases are the basis of both operational systems and data warehouses, extensive design and development effort has been expended over the decades to handle time properly in this environment.
Time in the Data Lake
The data lake is not, however, architected to handle time as used in current operational systems and data warehouses. Rather, its development has been largely driven by the needs of Web-centric businesses, where time series data and its direct analysis are the norm. Despite this, traditional businesses are now adopting the data lake widely as a replacement for the data warehouse. The lake's one-dimensional time series approach can give rise to significant implementation challenges in more complex data warehouse use cases.
The inability of the data lake (at least for now) to handle bitemporality in data and processing begins with its schema-on-read approach. With a laissez-faire attitude of landing and storing data as the user likes, a consistent temporal representation is missing. Such problems fall within the scope of the "data swamp" quality criticisms of the data lake, and are significant.
However, the process aspects, particularly for IoT data, are more challenging. IoT data arrives first in the data lake and is analyzed to build predictive models of events of interest. To arrive at actions, these models are applied in real time against incoming IoT data, correlating the outcomes with production data to plan and execute preventive maintenance. The work thus moves from the data lake to the operational environment, the reverse of traditional business data and process flow from operational to analytical.
To handle this situation, many data lake proponents envisage moving production-level data and computing to the lake. However, today's data lake doesn't have the required reliability and maintainability characteristics. The relational database, in contrast, with more than three decades of history in both data warehousing and operational systems, offers a more likely environment for the operationalization of IoT (and other) analytics work. This leads to the novel concept of the production analytics platform.
Expanding into a New Idea
This idea of a production analytic platform emerged from discussions with Teradata about the function they have been adding to their database in recent years. The opportunity is to build on the strengths of traditional relational databases in operational and informational activities, on their mathematically integrated model, and on their extensibility into new data structures and analytics function.
Built on the foundation of the enterprise data warehouse, the production analytic platform bridges the operational and analytical worlds and provides an environment where analytics models can be exercised with real-time data to drive operational decisions with due respect for temporal considerations and production needs.
I invite readers tired of "doing time" in the data lake to share their thoughts directly with me to explore the possibilities -- and limitations -- of this germ of a new architectural idea.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.