Why Data Management and App Dev Must Come Together
For too long, data management and application development have been tracking on two radically different vectors, each with its own distinctive momentum. That must change.
- By Stephen Swoyer
- November 18, 2014
At this summer's O'Reilly Open Source Convention (OSCON), Byron Ruth, a lead analyst/programmer with the Children's Hospital of Philadelphia (CHoP), spoke about a topic that's near and dear to the hearts of data management (DM) practitioners everywhere: ETL.
The title of Ruth's talk, appropriately enough, was ETL: The Dirty Little Secret of Data Science.
However, although ETL (data preparation, data transformation, or what the IEEE calls "data engineering") is a well-defined problem for DM, it's a somewhat newer problem for mainstream IT practitioners.
On the other hand, Ruth and his team at CHoP aren't in any sense "mainstream" IT practitioners: ChoP, a preëminent research hospital, is attached to the University of Pennsylvania. Given that data engineering is a well-defined problem for research -- and it is -- you'd expect that the ETL approach Ruth outlined in his OSCON presentation would partake of or be consistent with DM-like concepts. It did.
Although Ruth discussed problems or challenges that should be familiar to DM practitioners (e.g., the means of access to data; the kinds of transformations and manipulations that are required to prepare data for analysis), he did so using somewhat different language. Not necessarily radically or even disturbingly different -- just different.
Take, for example, the problem of what in DM is known as "data lineage." It's no less important in preparing (and controlling for) the use of data in medical research than in enterprise data management, as Ruth himself acknowledged -- except Ruth didn't use the term "lineage," opting instead for a semantically similar term: "data provenance." This, he explained, is "a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing."
What Ruth calls "provenance" is essentially the same thing as what a DM practitioner might call "lineage." They're both describing essentially the same concept, and (all things being equal) have essentially the same practical scope.
Does this matter? Isn't it arguably just another instance of what research analyst Marc Demarest, a principal with information management consultancy Noumenal Inc., likes to call "terminological vacuity?"
No, actually, it isn't -- and Ruth's preference for the term "provenance" isn't arbitrary or idiosyncratic. In fact, the hugely influential World Wide Web Consortium (WC3) has already defined a "data provenance" standard -- viz., "Provenance," or "PROV-DM" -- for application developers.
According to the W3C, "PROV-DM" is "a core data model for provenance for building representations of the entities, people and processes involved in producing a piece of data or thing in the world. PROV-DM is domain-[agnostic], but [is characterized by] well-defined extensibility points allowing further domain-specific and application-specific extensions to be defined." PROV-DM, the W3C continues, also has its own "abstract syntax notation" (viz., PROV-ASN) "which allows serializations of PROV-DM instances to be created for human consumption, which facilitates its mapping to concrete syntax, and which is used as the basis for a formal semantics."
Different Domains, Different Discourses
Does this matter? Yes, actually, it does.
At a high level, it's indicative of two radically different domains -- viz., DM and application development -- each of which has its own distinctive force and direction. Think of them, then, as two not-at-all-parallel vectors, each encoded with its own velocity (that is, well-defined terminology, practices, and methods) and directionality (which takes the form of domain-specific priorities, values, concerns), etc.
Mark Madsen, a research analyst with Third Nature Inc., points to what he says is another frustrating example: the use of predicates such as "structured" and "unstructured" to describe data. The term "unstructured data" is particularly fraught, according to Madsen: what's "unstructured" to a data warehouse architect probably isn't "unstructured" to an application developer, and vice versa.
That there's a vice versa isn't the issue here, says Madsen. Just about everything has structure, from the logs created by systems or devices to the event messages generated that are by embedded sensors to the fields (checkboxes, validators, or drop-down options) exposed by form-entry systems.
Syntax is structure. Semantics is structure, Madsen says: "Think of it this way: a programmer ... actually created all of this 'unstructured' data. It's generated by data structures stored inside code that [this programmer] wrote. On these terms, it has structure, so why do we keep calling it 'unstructured?' Why do we keep treating it as if it's 'unstructured?'"
Thankfully, data warehouse architects and application developers have reached a consensus of some kind concerning the data that's most structured: viz., normalized data sourced from OLTP systems, tabular data, or other kinds of data that can easily be expressed in tuples.
In other words: the kind of stuff that's traditionally managed by relational database systems.
Ironically, this very distinction between "structured" data (which describes data that's in a normalized or tabular format) and "unstructured" data (which includes everything else) has produced a NoSQL phenomenon that Madsen argues is both overly simplistic and frustratingly naïve. In the former case, the distinction between "structured" and "unstructured" data isn't just arbitrary -- as Madsen argues above -- it ignores how data is disseminated and used in practice.
"You are transmitting schema-less, metadata-less data. [From an app-dev perspective] we aren't dealing in objects any more, so leave object-orientation out of this. An object contains information and behavior and hides that data structure from the outside world. In the data collection and use business ... multiple sources of data have to be read. [We] can't ask developers what that is, [each and every time we need it]: instead, we need metadata and schema to figure it out. Most of the time, we have to read code to figure it out anyway," he points out.
As for naïveté, customers also fail to consider how extensively a NoSQL repository will be used in practice. Developers tend to code to (and use) what they're comfortable with; if they're more comfortable coding for NoSQL, that's what they'll use -- even for applications (e.g., those involving normalized data from OLTP or other structured sources) for which NoSQL is not optimal.
We're All to Blame
None of this is to say that the W3C, application developers, or the NoSQL community should and must defer to the authority of DM -- for example, because it was there first, or because it can claim to have more experience (or more specialized expertise) with respect to the problem of data engineering. The maturity, soundness, or viability of DM's terminology, practices, or methods isn't the issue here.
The issue has precisely to do with the phenomenon of parallel -- and largely ignored -- evolutionary trends in both camps. The upshot is that data management does its thing without necessarily taking note of (or influencing) what's happening over in app-dev; app-dev, meanwhile, goes about its business without necessarily taking note of (or influencing) what's happening over in data management. Both camps could and should learn from (and influence) one another.
On the app dev side, the shift to "DevOps" -- a fusion of the software development life cycle with system operations that emphasizes rapid-fire, iterative application development; scheduled (and frequent) production of application deliverables; a rapid-fire, deploy-and-test cycle; and the integration of operations with/in all of these phases -- is fundamentally transforming how, and how frequently, apps are developed and delivered. (This likewise has the effect of altering what application consumers expect and demand from their internal IT departments. This is by far the more important of the two changes.)
Data management has been largely insulated from the DevOps phenomenon: when BI luminary Colin White invoked the term at a prominent industry conference this summer, more than a attendees had blank looks on their faces. (Similarly, White's simultaneous call-out to "Git," the hugely popular version control system that powers an increasing number of open source and non-open source projects alike, prompted at least one attendee to ask for clarification.)
Data management should and must absorb some of the best qualities of DevOps -- the integration of testing and operations with development as well as rapid, consumer-oriented development and deployment -- to the degree feasible. The claim that DM is somehow a special case, that the constraints of governance, consistency, and an overriding need for "quality" effectively rule out DevOps-like innovations, either in whole or in part, is untenable.
At the same time, data management isn't a house of cards. It comprises a set of methods, practices, and concepts that have been hardened by time and practice. Data engineering is a hard problem -- and it's made even harder by virtue of the non-technological issues (mostly involving people and process) that data management practitioners have been grappling with for decades. Application developers (including DevOps innovators) and the NoSQL community should and must benefit from this expertise. If the feverish pace of activity in the Hadoop space -- much of it focusing on developing Hadoop-specific data management features (such as ANSI SQL query and ACID-like transactional rigor) is any indication -- many in the NoSQL community recognize this.
Madsen spoke truth to power when -- at GOTO's International Software Development Conference 2013 in Aarhus, Denmark -- he told a gathering of developers: "We're both at fault."
What's important isn't finger-pointing, he urged, but outreach, engagement, and collaboration. Developers and DM practitioners have much to learn from one another. Their respective domains aren't symbiotic but systemic. They're both ultimately supporting the same revenue-generating and decision-making business organism, and both groups have a responsibility to learn from and to help one another. "When you change code, you change the output: you change the data itself; this creates downstream incompatibilities," Madsen argued.
The same, he stressed, could just as easily be said of folks on the DM side.