Q&A RE: The State of Big Data Integration
It’s still early days, but users are starting to integrate big data with enterprise data, largely for business value via analytics.
By Philip Russom, TDWI Research Director for Data Management
A journalist from the IT press recently sent me an e-mail containing several very good questions about the state of big data relative to integrating it with other enterprise data. Please allow me to share the journalist’s questions and my answers:
How far along are enterprises in their big data integration efforts?
According to my survey data, approximately 38% of organizations don’t even have big data, in any definition, so they’ve no need to do anything. See Figure 1 in my 2013 TDWI report Managing Big Data. Likewise, 23% have no plans for managing big data with a dedicated solution. See Figure 5 in that same report.
Even so, some organizations have big data, and they are already managing it actively. Eleven percent have a solution in production today, with another 61% coming in the next three years. See Figure 6.
Does data integration now tend to be haphazard, or one-off projects, in many enterprises, or are architectural strategies emerging?
I see all the above, whether with big data or the usual enterprise data. Many organizations have consolidated most of their data integration efforts into a centralized competency center, along with a centrally controlled DI architecture, whereas a slight majority tend to staff and fund DI on a per-application or per-department basis, without an enterprise strategy or architecture. Personally, I’d like to see more of the former and less of the latter.
What are the best approaches for big data integration architecture?
Depends on many things, including what kind of big data you have (relational, other structures, human language text, XML docs, etc.) and what you’ll do with it (analytics, reporting, archiving, content management). Multiple big data types demand multiple data platforms for storing big data, whereas multiple applications consuming big data require multiple processing types to prepare big data for those applications. For these reasons, in most cases, managing big data and getting business use from it involves multiple data management platforms (from relational DBMSs to Hadoop to NoSQL databases to clouds) and multiple integration tools (from ETL to replication to federation and virtualization).
Furthermore, capturing and integrating big data can be challenging from a data integration viewpoint. For example, the streaming big data that comes from sensors, devices, vehicles, and other machines requires special event-processing technologies to capture, triage, and route time-sensitive data—all in a matter of milliseconds. As with all data, you must transform big data as you move it from a source to a target, and the transformations may be simple (moving a click record from a Web log to a sessionization database) or complex (deducing a fact from human language text and generating a relational record from it).
What "traditional" approaches are being updated with new capabilities and connectors?
The most common data platform being used for capturing, storing, and managing big data today are relational databases, whether based on MPP, SMP, appliance, or columnar architectures. See Figure 16 in the Managing Big Data
report. This makes sense, given that in a quarter of organizations big data is mostly or exclusively structured data. Even in organizations that have diverse big data types, structured and relational types are still the most common. See Figure 1.
IMHO, we’re fortunate that vendors’ relational database management systems (RDBMSs) (from the old brands to the new columnar and appliance-based ones) have evolved to scale up to tens and hundreds of terabytes of relational and otherwise structured data. Data integration tools have likewise evolved. Hence, scalability is NOT a primary barrier to managing big data.
If we consider how promising Hadoop technologies are for managing big data, it’s no surprise that vendors have already built interfaces, semantic layers, and tool functionality for accessing a broad range of big data managed in the Hadoop Distributed File System (HDFS). This includes tools for data integration, reporting, analysis, and visualization, plus some RDBMSs.
What are the enterprise "deliverables" coming from users’ efforts with big data (e.g., analytics, business intelligence)?
Analytics is the top priority and hence a common deliverable from big data initiatives. Some reports also benefit from big data. A few organizations are rethinking their archiving and content management infrastructures, based on big data and the potential use of Hadoop in these areas.
How is the role of data warehousing evolving to meet the emergence of Big Data?
Big data is a huge business opportunity, with few technical challenges or downsides. See figures 2 through 4 in the report Managing Big Data
. Conventional wisdom says that the opportunity for business value is best seized via analytics. So the collection, integration, and management of big data is not an academic exercise in a vacuum. It is foundational to enabling the analytics that give an organization new and broader insights via analytics. Any calculus for the business return on managing big data should be based largely on the benefits of new analytics applied to big data.
On April 1, 2014, TDWI will publish my next big report on Evolving Data Warehouse Architectures in the Age of Big Data
. At that time, anyone will be able to download the report for free from www.tdwi.org
How are the new platforms (such as Hadoop) getting along with traditional platforms such as data warehouses?
We say “data warehouse” as if it’s a single monolith. That’s convenient, but not very accurate. From the beginning, data warehouses have been environments of multiple platforms. It’s common that the core warehouse, data marts, operational data stores, and data staging areas are each on their own standalone platforms. The number of platforms increased early this century, as data warehouse appliances and columnar RDBMSs arrived. It’s now increasing again, as data warehouse environments now fold in new data platforms in the form of the Hadoop Distributed File System (HDFS) and NoSQL databases. The warehouse has always evolved to address new technology requirements and business opportunities; it’s now evolving again to assure that big data is managed appropriately for the new high-value analytic applications that many businesses need.
For an exhaustive discussion of this, see my 2013 TDWI report Integrating Hadoop into Business Intelligence and Data Warehousing
Posted by Philip Russom, Ph.D. on January 22, 2014