Wanted: A Data Architecture for On-Demand Data Access
The way we integrate and provision data is incompatible with the requirements of new use cases such as data science.
There's a consensus that the way we integrate and provision data is incompatible with the requirements of new, "exploratory" use cases, such as data science.
There is no consensus as to what to do about this, although some people have a few ideas.
"If we think about what [the new] data architecture should look like, we can isolate a few key attributes: it must facilitate abstraction, promote reuse, be center-less, and be parallel-aware," says Mark Madsen, a research analyst with information management consultancy Third Nature.
The Problem of On-Demand Data Access
The most complex piece of the data architecture puzzle is the piece that enables ad hoc or on-demand data access. After all, the needs of data scientists and similar power users are largely unpredictable; in many cases, they're one-off. Data scientists want to access data whenever they need it from the platform or environment they're most comfortable working in.
This is why next-generation data architectures will likely borrow key features and concepts from the self-service data preparation and data virtualization (DV) paradigms. However, neither paradigm addresses the complete range of key attributes identified by Madsen.
Self-service data prep tools, for example, are end-user-oriented offerings, designed primarily for business analysts and data scientists. By contrast, a next-gen data architecture must be flexible enough to serve both exploratory (data scientists, business analysts) users with unpredictable needs and traditional (information consumers) users whose data access needs are more schedulable. Like it or not, something like a data fabric middleware is required to support both of these constituencies.
At this point, self-service data prep tools also do little to substantively address the core needs -- reuse and repeatability -- that are prerequisites for manageable, governed access at scale. These are likewise problems the software category we call "middleware" evolved to address.
DV, on the other hand, is a more plausible contender for a next-gen data architecture. It enables a virtual abstraction layer designed to mask the complexities -- e.g., physical location, instance type (physical, virtual, cloud) -- of data sources. DV is notionally middleware, albeit with a twist; think of it as a kind of middleware that is also its own architecture.
However, DV architecture isn't in any sense decentered: its virtual abstraction layer is enabled by a DV engine at its center, orchestrating queries, events, and messages.
Another problem is that DV isn't a commoditized technology: best-in-class products are expensive to license, install, and maintain, and there's a dearth of robust open source DV-like technologies. A DV-only architecture risks some degree of vendor lock-in.
Adding Complexity: Platform and Data Source Distribution, Massive Data Volumes
On-demand access isn't strictly a problem of connecting -- via reliable, predictable internal network transport -- to physical instances of data sources running elsewhere in the on-premises enterprise. On-demand access is complicated by the inevitability of data source and data platform distribution, to say nothing of the phenomenon of ever-increasing data volumes. The principal challenges include:
- User-initiated access: The needs of data scientists and other exploratory users are unpredictable, so access must be on demand.
- Disparate sources: Data sources are distributed across the enterprise, cloud services, the Internet, etc., so data access requires negotiating stateful (such as ODBC or JDBC), stateless (such as RESTful cloud APIs), and other types of connections to data sources.
- Minimized movement: Because it can be practically impossible to move data at petabyte-scale volumes, particularly over the Internet, the data access solution must minimize data movement.
- Smart parallelism: Data access must be smart about when and where it processes data. If a user needs to access data on an upstream system, the solution should be able to exploit upstream parallelism, either on the data source itself or on a system that is local to it. This is true, for example, of data stored in Amazon's S3 cloud storage service. Instead of extracting and moving data in bulk from S3 storage, a smart on-demand access solution would exploit local parallelism (e.g., Amazon's Elastic MapReduce service) to reduce the size of the data set before moving it.
Madsen's solution for on-demand data access borrows from DV's concept of abstraction between physical sources and targets and also makes use of self-service features and concepts. It exploits the equivalent of query federation -- the underlying technology that enables DV -- to knit together distributed data sources, be they local or far-flung.
This means a data scientist who wants to use a Spark cluster to perform her analysis should be able to initiate access to the data she needs from Spark, no matter where this data is located. She shouldn't have to open up a separate program or go to another system. The solution would bring the data she needs to her, along the lines described above.
In Madsen's view, on-demand data access is implemented as a data-fabric middleware, permitting self-serving users (or IT itself) to monitor data flows and identify jobs that should be instantiated as scheduled processes. (This also promotes reuse, along with repeatability.) On-demand access is center-less too, in that the middleware knits together all systems, such that a user can initiate data access from any system to any system and vice versa. Finally, access is parallel-aware, in that the middleware exploits upstream or downstream parallelism where available.
The Very Model for an On-Demand Access Solution for Exploratory Use Cases
One good model for the on-demand access solution Madsen envisions is Teradata's QueryGrid technology. QueryGrid enables the equivalent of a center-less data fabric that serves both the more predictable needs of traditional information consumers and those of exploratory users, who require on-demand, self-initiated access to data.
"The query federation approach Teradata takes with QueryGrid is suitable for ad hoc use, where data requirements cannot be anticipated in advance. With QueryGrid, the user can [initiate] access from [the environment] they're most comfortable in. They write a series of queries to explore the data so they can identify the data they actually need. If necessary, they can [use queries to] move the [relevant] data or create views [i.e., presentations of the data]," Madsen says, noting that users can instantiate these as repeatable and reusable data flows.
QueryGrid addresses several other critical issues, too. He points to Teradata's focus on high concurrency and QueryGrid's support for platform-specific optimizations as examples. The upshot, he argues, is that even if a consensus modern data architecture doesn't yet exist, QueryGrid is emerging as a credible, albeit Teradata-specific, alternative. "QueryGrid hides the technical complexity of accessing data on multiple platforms. Access [is via] a single SQL dialect that is available on any connected platform," Madsen says.
"This [abstraction of complexity] extends to data movement, too, so QueryGrid automatically performs type conversion if data is moved between systems, or linked [from one system to another]. It's parallel- and location-aware, too, so [it] tries to move data as efficiently as possible. All of this makes for an easier-to-use environment [for exploratory users]. That's the goal."