TDWI Articles

Big Data Management Comes of Age

Several start-ups market products designed to address the lack of data management features for big data platforms. For example, Diyotta provides a scalable architecture for big data integration including old-world enterprise amenities such as metadata management.

The challenge with big data integration is getting data -- the right data -- into and out of big data platforms. This has to do with the immaturity of critical data management services for big data systems, such as robust metadata management services and resilient automation features.

This is true of the big data ingest cycle, which lacks anything comparable to the automated, repeatable ETL and/or ELT processes that undergird data ingest in the data warehousing world.

It's even more true of the big data extraction process, said Krish Krishnan, president and founder of information management consultancy Six Sense Advisors. There's a big difference between querying against or manually extracting data from Hadoop, MongoDB, Cassandra, Spark, and other big data environments and doing so in the context of a consistent, resilient, repeatable process.

"The problem ... that I am seeing [with respect to big data] is that ... people don't know either how to push data in effectively or pull data out effectively," Krishnan said during a recent analyst call.

Krishnan noted that one of his clients ingested approximately 600 TB of data into its Hadoop system. This client discovered it's easier to load data into Hadoop than to get it back out again.

"The problem is that they don't know how to make it a repeat[able] process and they don't know how to pull data out. In the world of data integration, there's a blockage that I'm seeing in the world of Hadoop. They have no clue on how to do repeat[able] ETL/ELT work, they have no clue how to do CDC [change data capture]," Krishnan said.

"More important, once the data is in Hadoop, people are struggling about how to operationalize on top of it because they don't know how to get the data back out successfully. There are a lot of tools ... but the answers that people want versus what they're getting, there's a huge gap. I can talk about Spark all day long. I can talk about SQL-on-Hadoop all day long. The question is how do you get this across [to prospective adopters]."

Upstart Solutions

This isn't a new problem. It's arguably the primary reason Teradata Corp. acquired the former Revelytix -- the metadata management and big data prep assets of which it now markets as Teradata Loom -- almost two years ago. It's the reason data integration (DI) powerhouse Informatica Corp. announced a new Informatica Big Data Management offering late last year, complete with metadata management, security and governance, and data lineage-tracking services for big data platforms.

It's the backstory behind the emergence of start-ups such as Alation, which markets metadata management technology for Hadoop and other big data platforms, and Tamr, which markets technology for profiling, cataloging, and cleansing data that's siloed in traditional and big data sources. It's likewise the logic behind any of several open source projects -- including Apache Atlas. (Alation recently announced a strategic partnership with Teradata whereby "Alation will be the primary solution for Teradata customers seeking a product to help increase the productivity of their data consumers as well as the effectiveness of their data stewards," according to a blog posting by Alation CEO Satyen Sangani. Interestingly, Alation markets software that overlaps (in whole or in part) with Teradata's Loom technology.)

Finally, it's one of Diyotta's reasons for being, which claims to offer a scalable, resilient architecture that's fit for any data integration scenario. "Our core architectural principle is to use the platform for what it is best for," said CEO Sanjay Vyas, referring to his company's flagship offering, the Diyotta Modern Data Integration Suite.

"It could be just [used for] sending the data. Let's say mobile [data sources are] sending JSON data: we collect it, we process it, and we send it to Hadoop -- or it could be cloud. You could have a data lake in the cloud, you could have [an] MPP [database] sitting on prem[ises]. You could easily use our agent-based architecture [in this scenario]," Vyas said.

Data Integration Challenges

Diyotta's Modern Data Integration Suite addresses three thoroughly modern DI challenges, Vyas pointed out. The first is that of cloud to on-premises data integration -- and vice versa. Apps in the cloud need data from on-premises systems; on-premises apps -- particularly business intelligence (BI) and analytic apps -- need data from cloud apps.

The second DI challenge is what Vyas calls the "digital revolution." This has to do with integrating and managing polystructured data, as with data from social media sources, for example.

The third challenge has to do with the collapsing distinction between on- and off-premises modes of data integration. Now more than ever, organizations are keen to integrate data from multiple on-premises locations -- some Diyotta customers have presences in every U.S. state, in addition to different regions around the world -- as well as from a mix of cloud and social media sources. The challenge is to integrate -- as seamlessly as possible -- across all contexts. This challenge is compounded by virtue of the fact that many of the vendors in this space -- Vyas mentions a prominent DI vendor by name -- require customers to license several different versions of their products in order to cover traditional, cloud, and big data integration scenarios.

So much for the explicit DI challenges. Diyotta's Modern Data Suite also brings critical data management services to big data integration, Vyas maintained. "Some of the big-data value creation ... is not only about the new world of data, it's also about how you bring the old enterprise standards ... [such as] automation, orchestration, enterprise data governance, data glossary, all of those things [to big data]," he explained. "It is also about speed and agility. How fast you can get those things done, how fast can you deliver these new projects, along with the automation aspects."

The last and most important data integration challenge is that of future-proofing, Vyas argued. Big data is a protean beast. It's likely to remain a protean beast for some time to come. Three years ago, the Hadoop platform was ascendant and MongoDB had just eclipsed a $1 billion venture-capital valuation. In just the last 12 months, Apache Spark became The New Thing, Apache Cassandra came on strong, and MongoDB's valuation surged to $1.6 billion. A truly resilient big data-ready DI platform must be tolerant of platform migration or -- even -- architectural transformation, Vyas argued.

"We don't need developers to know Pig [Latin] or Spark SQL or anything related to the new [disruptive platforms]. Even though it matters, what we believe is that we bring the best practices to the table so you leverage that rather than creating some siloed framework which would not scale to the next level in the near future. Today we are talking Spark, but tomorrow it could be something else," he concluded. "[T]oday when you want to, you can port from MapReduce to Spark. For us, it's just a matter of porting your existing code into Spark, or from Spark to a future engine."

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at [email protected].


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.