TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Architecting a Modern Martech Stack for Speed, Scale, and AI Readiness August 26, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Data Integration and Analytic Heterogeneity

If you want to understand data integration in an age of analytic heterogeneity, you must follow the process: process movement, not data or workload movement, is where it's at.

By Stephen Swoyer
April 8, 2014

When it comes to data integration (DI), the industry's fixation on data movement misses the point, says Rick Glick, vice president of technology and architecture with Actian Inc.

"I think that we need to get beyond looking at integration as [a question of] just data flowing between systems and start talking about process movement, [which means] driving processes to the right place in your environment," says Glick, who was CTO for analytic database specialist ParAccel Inc. "There's lots of interesting engines with lots of interesting capabilities, and in most cases, you're going to want to use best-of-breed -- the best [engine] for the purpose. This may or may not be what vendor [with which] you've spent several million dollars wants it to be."

As Glick sees it, this has to do with the heterogeneity of advanced analytics, which requires more of DI than did traditional business intelligence (BI) and data warehousing (DW).

Data integration is an enabling technology for both BI and analytics. DI for traditional BI is a relatively straightforward proposition: its focus is the DW, which is also (in most cases) its terminus, too. The BI or analytic discovery use cases change this, but they're still working almost exclusively with SQL or semi-structured data. DI for advanced analytics is a very different proposition, however: advanced analytic processes tend to consist of multiple analytical workloads and mix traditional structured (SQL) data with multi-structured data from semi-structured (machine logs, event messages), semantic (texts, e-mail messages, documents, blog postings), and file-based (audio and video files, etc.) sources. All of this data must somehow be staged, transformed, and prepared for initial analysis, which -- by definition -- is itself a mere prelude to additional analysis.

Hence the emphasis on process: from a DI perspective, an analytic process will be staged, transformed, and moved multiple times, for multiple kinds of analysis, with (usually, but not always) a goal of producing smaller and more refined data sets. Movement is a part of this, but not the most important part: in traditional BI, by contrast, DI is its own discrete process.

"You see people rushing to put SQL interfaces on other databases, to make it easier to get at [access, manipulate, move] data, but this kind of misses the point: it's not really about the language, although I do think it's better to not have to have hoards of [Java] programmers and [be able] to get a little leverage from the SQL ecosystem, but it's not really about language, it's not really about making it easier to get at the data -- it's about having a diverse set of capabilities [on different systems] working together to make it easier for data to flow [between systems] as part of process."

Automating the Process

When Actian acquired ParAccel last April, Glick got a chance to make this vision a reality. He points to Actian's acquisition of the former Pervasive Software Corp., in early 2013, which gave Actian best-of-breed data integration (DI) technology. In DataRush, Pervasive had developed a DI and analytic technology that could run natively (as a parallel processing engine) across the Hadoop distributed file system (HDFS), as well as on its own -- i.e., as a traditional ETL platform, albeit one that's able to scale linearly across dozens of processor cores in large SMP system configurations.

DataRush, which Actian has rechristened "DataFlow," takes this one step further by embedding in-process analytics in its DI routines. This was a step in the right direction, argues Glick.

"DataFlow is kind of an interesting piece because it has a bunch of data mining algorithms, a bunch of transformational algorithms, and a bunch of connectivity to a variety of data sources. However, the most important part of this is that [DataFlow] can reside on the same node where that data is sourced," he points out. To illustrate what he means, Glick uses a "bogus" -- i.e., oversimplified -- example involving a Cassandra data store.

"Let's say I'm doing something inside of Cassandra that's a bit OLTP-ish in nature, because Cassandra is really good at that kind of stuff, but I want to take that and do a regression [analysis]. Dataflow allows us to read from Cassandra and do a regression in the same physical platform, then we can take that and join it with some things going on in Matrix."

Ideally, Glick explains, all of this would happen automatically: the process itself is automated such that workloads get scheduled and kicked off (in ordered sequence) on separate systems, data gets moved from one platform to another at the right time, and so on. Eventually, a subset of the data in Cassandra gets moved to -- or persisted -- in Matrix, which is what Actian now dubs the former ParAccel massively parallel processing (MPP) database. From Actian's Perspective, Glick says, it could just as easily be moved (or "flow to") an Oracle, IBM DB2, or Microsoft SQL Server DW, too.

"In this example, I'm driving dataflows in Cassandra, Dataflow, and Matrix and I'm able to move around a minimum set of data to give me an answer. The process is automated and has fewer moving parts," he explains. "This is a far different story from, 'I do some ETL work in Cassandra and then pipe [that data] into R, where I do a regression, then I take the regression and ETL that into another platform.' A lot of work today is focused on doing these sort of spoke integrations, but not really pushing process along. Process is subordinated to architecture."

This is still just an ideal vision, Glick concedes: ParAccel's been a part of Actian for just under a year, Pervasive for slightly more. Actian is still fitting pieces together, engineering and coding, and so on, says Glick -- but intra-process data flow of this kind is the ultimate goal. It even has an irresistible logic: run the constitutive parts of an analytic process where it's most cost-effective to do so -- with "cost" understood as a function of processing and storage requirements, data movement, and, of course, time. "Today, it's a hybrid. Software will figure out some of this for you, tools will figure out some of this for you, but the person using the tools will have to figure out most of it," he comments.

"The plan is for the software to ultimately be smart enough to do this for you. You'd say, here are my interfaces, fire off a job, and the system will figure out the best way to make that work, based on the typical cost and actually based even on the platform costs of the [constitutive] workloads."

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Data Integration and Analytic Heterogeneity

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research