Page 2 of 2
Q&A: What's Ahead for the Data Landscape
We spoke with Ravi Shankar, CMO at Denodo, to learn what's ahead for the role of data integration, and how you can prepare for changes in the data landscape.
Upside: What broad trends will drive the data landscape in 2018?
Ravi Shankar: I expect machine data, edge computing, citizen data integrators, and enterprise metadata catalogs to be the most important trends this year -- in addition to the trends already in progress: cloud, big data, and mobile computing.
Early last year, an article in The Economist ("The world's most valuable resource is no longer oil, but data") explained how, given the evolution of self-driving cars, smart homes, automating production with robots, and drone delivery, data is becoming more pervasive every day. Devices enabled by the Internet of Things (IoT) are adding to this deluge and spewing out tremendous volumes of data every moment.
For example, an aircraft engine has 5,000 sensors and generates up to 10 GB of data per second. On a 12-hour flight, that can add up to 844 TB of data. All of this machine data needs to be collected, stored, and harnessed for intelligent use. Edge computing promises to process this data at the source (aka the edge) and bring only the results to a central location to be pooled with similar data sets, providing a holistic view of operations.
I should mention another trend: business users will no longer have to depend on IT for data collection and movement. This is important because nearly every business user, regardless of industry or position, requires data from different systems to perform day-to-day operations. Instead of relying on tech experts, these individuals can now collect the data themselves with simple-to-use data integration tools. These "citizen integrators" can now easily understand what data resides where using business definitions stored in enterprise metadata catalogs.
The rise of citizen integrators and business-oriented tools to aid data movement will help liberate business users from the clutches of IT. Think of this as self-service data integration. As you can see, the world is moving to a decentralized model -- whether it be machine data or edge computing -- and empowering a class of business users to be self-sufficient is part of that movement.
What challenges will organizations face to keep up with the changing data landscape?
The exorbitant data volume and number of systems will continue to present problems for organizations struggling to get their arms around all of this data and deliver it rapidly to business users.
These days, data originates across a multitude of locations -- in the data centers, cloud, third-party data providers, business partners, social media, and now machines. As you can imagine, this data comes in various formats and from different systems. Some of the data is structured and stored in tables in relational databases, star schema data warehouses, and OLAP cubes. Other data is unstructured and stored in Hadoop, NoSQL, and graph databases. Yet other data resides locally in the form of Word or Excel documents. The real challenge is how to bring these different types of data together and normalize them under a single format.
However, data collection is only half of the challenge. The other half involves data delivery -- business users need the data quickly and in a format that they can understand. Compounding the problem is that everyone's needs are different. This creates bottlenecks and a lack of efficiency because it is not easy to convert the data into the various formats business users need and deliver it rapidly. This will only get worse with the amount of machine and sensor data pouring in from these IoT-enabled devices.
How are enterprises addressing these challenges? Why are these efforts not sufficient?
Enterprises are resorting to applying the traditional methods they have been using for decades to these new trends as well. These traditional approaches involve moving the data to a central location and then standardizing it for business consumption. Since the 1990s, companies have relied on data warehouses for storing analytics data and operational data stores for operational data. However, these data repositories handle only structured data.
These new trends generate an enormous amount of unstructured data that cannot be stored in these structured data repositories. That is analogous to driving a square peg into a round hole. Even if they try to transform the data into structured components for storage, it is prohibitively expensive because of the volume of data. Data warehouses and operational data stores are not cheap.
To overcome these challenges, companies have been adopting Hadoop and NoSQL systems for storing unstructured data. The principle behind these big data systems is to store the data in raw format and then convert it to the desired destination format at the time of consumption by the business users.
Even though these systems have solved the format and cost challenges associated with data warehouses and operational data stores, they have not solved the problem of making it easy to convert the data into the formats business users require. Nor are they able to deliver it quickly, often turning these data lakes into data swamps.
What role does data integration play in meeting these challenges? Why are current methods insufficient?
Data integration involves moving the data from source systems to destination systems. In the process, current data integration methods such as ETL rely on the notion of "collecting the data" in yet another repository to transform it to the desired format before delivering it to the data consumer.
There are several challenges with this approach. First, data needs to be moved to a central place. We just discussed how data comes in various formats and is stored in a multitude of systems. Although ETL technology has worked well for moving structured data from databases to data warehouses, it cannot scale to meet the volume of machine data or handle those various formats.
Second, ETL requires specialized IT professionals to use specific developer-oriented tools to code the transportation of data, so business users are at the mercy of their tech teams to deliver the data to them, delaying data availability by a few days or weeks.
Finally, ETL takes time to transform and upload the data to the target systems, so data is not available in real time.
How will data virtualization evolve to support these trends?
The best way to make all data available to business users in real time is to simply "connect to the data" using tools such as data virtualization, which allows data to stay where it is -- on premises, in the cloud, or in local systems. Data virtualization performs the required computations at the source systems and exposes only the minimal result sets. It easily scales to new types of machine data and it handles edge computing.
Data virtualization also stores the business definitions of the data in an enterprise metadata catalog. Using this catalog, citizen integrators can apply these definitions to create views of the resulting data they want to use. Because data virtualization acts as a pass-through system without storing the data, the data is delivered instantaneously from the source systems, so data is as fresh as the source.
What advice would you give to companies that want to modernize their data architecture?
I would strongly recommend CIOs, CTOs, CDOs, and enterprise and data architects consider making data virtualization the core piece of their modern data architecture.
As in many other parts of the business, change is the only constant in the technology landscape. Today, it could be machine data, edge computing, citizen integrators, and enterprise metadata catalogs. Tomorrow it could be something else. Technologists need to build flexibility into their architecture to meet the demands of the future. They need an enterprise data layer that abstracts the data sources from the data consumers.
Data virtualization is an enterprise data fabric that seamlessly stitches together the data from various sources and delivers it in real time while abstracting the underlying data sources. This liberates business users from the underlying technology changes and enables the organization to sustain its growth without worrying about those technology changes.
What is your vision for Denodo in 2018?
In 2017, we saw data virtualization turn a corner. Even though the technology had been around for more than a decade, many people were still not aware of its powerful capabilities. Then leading analysts such as Gartner started prophesizing the "connect vs. collect" concept. Denodo Platform 6.0 included a dynamic query optimizer, a feature that increased the performance of the solution more than tenfold, and our data virtualization platform is uniquely provisioned in the cloud for both AWS and Azure. We also added data management capabilities such as self-service data search and query.
In 2018, we are adding new capabilities that will support the major industry trends we discussed, including in-memory computing to handle large volumes of data from machine and IoT devices, an enterprise metadata catalog to support business glossary functionality, and an intuitive user interface to empower citizen integrators to incorporate and leverage the data by themselves. We will also support additional cloud platforms such as Google Cloud and containers such as Docker.
James E. Powell is the editorial director of TDWI, including the Business Intelligence Journal and Upside newsletter.