TDWI Articles

The Case for Smarter Data Integration

To support IoT and Industrial IoT use cases, data integration must become both more automated and more explicitly analytical. In other words, it needs to be smarter.

By now, you've heard of IoT, the Internet of Things. IoT is a shorthand description for the connected reality in which many of us increasingly live and work. It's a reality rife with networked devices of all kinds, from the familiar (laptops and mobile phones) to the still-strange -- connected thermostats and refrigerators or "smart" coffee makers and toasters.

Before consumer-grade IoT appeared, some industries were already using the manufacturing equivalent: Industrial IoT (IIoT). If you think the "Connected House" is filled with smart devices, you ain't seen nothing yet. The factory floor is home to hundreds (in some cases thousands) of machines, each of which can potentially be fitted out with sensors.

In other words, a manufacturing plant might be home to hundreds of thousands of signallers, each of which transmits information at different intervals. Some signallers transmit data as a constant stream, others transmit only intermittently or unpredictably.

Data Deluge, Diversity

The challenge of IIoT is a problem of data deluge and diversity. Simply put: the complexity of ingesting, profiling, transforming, analyzing, and (if necessary) persisting IIoT data at scale far outstrips the capabilities of conventional data integration (DI) technologies. It isn't just the sheer amount of IIoT data -- although an IIoT-capable oil rig produces 7-8 TB of operational data per day -- or the fact that IIoT data tends to stream, pulse, or trickle in ways that aren't conducive to batch ingest.

It's that the combination of massive data volumes and variable data periodicity creates an unprecedented data management problem. Furthermore, if managing data at IIoT scale is hard, integrating data at IIoT scale is that much harder.

In some cases, the transformations you need to apply to IIoT sensor data are trivial -- e.g., parsing event messages to extract key-value pairs. In other cases, however, parsing and transforming IIoT data can be considerably more complicated. Extracting key-value pairs is easy; linking disparate event types on the basis of keys -- or combinations of values buried in event messages and other kinds of data -- is not.

Think of the unions, intersections, and splits (among countless other transformations) that are used to prepare and transform data for time-series analysis. A time-series database (TSDB) can accelerate these workloads, but it isn't practical or desirable to co-locate a TSDB with each and every oil rig, factory, or electrical substation. Besides, the data must still be transformed -- profiled, prepared, engineered -- before it can be loaded into the TSDB. Transformation of this kind is much more involved than simply parsing messages for timestamps.

In the era of IoT and IIoT, more capability must be built into DI software.

Not Smart Enough

Commodity DI software is insufficiently intelligent today: it doesn't use analytics (such as machine learning algorithms) to automate or accelerate the process of parsing and transforming data payloads -- or of building and instantiating data flows or ETL jobs. It can't be used easily with unstructured or streaming data because commodity DI software gives priority to the manipulation and management of relational data.

In fact, the dominant technology used to manipulate, manage, and access data -- SQL and SQL query -- is ill-suited for most of the data payloads and workloads involved in IIoT analytics. Think about it: the result of a SQL query is a homogeneous set of things. However, the events we want to collect in IIoT analytics aren't easily reducible to homogeneous sets.

"When the problem is to link event A to B then [to] F or X, it's no longer a simple relational problem. It's a problem that's dependent on ordering between discrete events," comments Mark Madsen, a research analyst with information management consultancy Third Nature.

"Finding patterns and analyzing them requires looking at these dependencies. It also requires linking unrelated things -- not one set of events that are all the same, but many different events."

Data Value

There's another wrinkle here: not all IIoT data is valuable or important. That IIoT-enabled oil rig which generates 8 TB of data per day? It sends only a tiny fraction of this data off site. Sending 8 TB of data every day would be prohibitively -- ridiculously! -- expensive, especially if you multiply that by the number of IIoT-enabled oil rigs in the field.

Instead, the oil company uses a combination of onsite DI software and advanced analytics to ingest, parse, and (when necessary) persist IIoT data. It uses statistics, automated machine learning algorithms, and other types of advanced analytics to automatically produce smaller, statistically representative data sets that can be used for analysis. It likewise uses onsite analytical technologies to quickly identify patterns, anomalies, and so on. This permits the company to proactively address critical or emergent events.

Which brings us back to our position: in order to support these and other use cases, data integration in the context of IIoT must become both more automated and more explicitly analytical.

You know, smarter.

There's some evidence that DI and analytics vendors understand this. This April, for example, Microsoft touted a new IIoT customer -- Jabil Circuit Inc., a global contract manufacturer that generates almost $20 billion in annual revenues. Jabil Circuit is using Microsoft's Azure Machine Learning (AML) platform-as-a-service to analyze the sensor data generated by the machines on its factory floors.

The idea is that AML's predictive technology will help Jabil Circuit to improve efficiency and reduce waste in the manufacturing process. Microsoft and Jabil didn't explicitly address the problem of DI and IIoT; it's nonetheless part of this project.

A better example -- one that directly addresses the challenge of IIoT-related DI -- is the recent concord between computing giants Cisco Systems and IBM. To recap, Cisco will embed IBM's Watson analytics technologies in or with its edge switches and routers. The collaboration focuses on critical IIoT use cases -- i.e., the use of onsite analytics to identify emergent issues such as impending equipment failure -- as well as the need to build more automation and intelligence into the software, methods, and processes we use to support DI.

"There are clients ... in remote locations, for whom the cost of transmitting data is high and the reliability is low. There are times when there's no connectivity at all, let alone high-speed connectivity. With every passing minute, the value of their IoT data diminishes," said Harriet Green, general manager of IBM's Watson IoT unit. "We're combining ... cognitive computing and analytics capabilities ... [to eliminate the] need to send the data over cellular, Wi-Fi, or enterprise networks. [You can] send the right data, not all the data, to store and secure in the cloud."

If these partnerships are successful, hopefully we'll see additional examples of more intelligent, more efficient DI systems soon.

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at [email protected].


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.