Choosing Data Virtualization/Federation Tools
A pilot evaluation project can be an effective way to choose the best data virtualization and federation tools to integrate your EDW and Hadoop systems.
- By David Loshin
- September 6, 2016
Hadoop platforms are increasingly integrated into the enterprise, providing an environment for high-volume data collection, processing, and analytics. However, Hadoop is not yet positioned to replace the enterprise data warehouse (EDW), nor would it make sense for organizations with significant investments in data warehouse technologies to suddenly abandon the business applications running on their EDWs in deference to what is still perceived to be evolving technology.
For at least the near future, many organizations will use a hybrid computing environment for business intelligence and analytics. That hybrid environment will require methods for sharing data between the core platforms. For example, you must ensure that the data sets needed for MapReduce and Spark applications are moved to the Hadoop/HDFS environment, that the results can be conveyed back to the EDW, and that both can be accessed using visualization and analysis tools.
In other words, ensuring data accessibility and availability across the hybrid environment is a paramount concern for many enterprises.
Benefits of Virtualization and Federation Tools
One of the key challenges is providing transparency to the data consumer, be it a real individual or an automated process. Data virtualization and federation tools enable your enterprise to use data across multiple platforms.
These tools allow a data modeler to layer a semantic data model on top of interfaces that access data from specific sources. The virtualization tool creates the canonical model, and the federation tool creates a data request appropriate to each source then accumulates and packages the results for presentation back to the data consumer.
Running an Evaluation Project
A pilot evaluation project can be an effective way to choose the best data virtualization and federation tools to integrate your EDW and Hadoop systems. Such a project involves reaching out to selected data virtualization vendors to assess their product's suitability for your hybrid environment.
Your assessment steps should include:
- Determining if the vendor products can be configured to work with your company's Hadoop environment
- Devising an evaluation plan that would be similar (if not identical) to one or more typical data access use cases common to your application landscape
- Specifying performance variables for evaluation
- Installing the tools, testing the use cases, and comparing the performance according to the criteria you defined
Configuring an evaluation using a practical pilot project can help you accomplish some specific objectives in comparing and contrasting various tools.
For each tool you evaluate, ensure that you:
- Evaluate the installation and configuration complexity
- Understand the pushdown capabilities of the tools, in which the data requests are transformed into queries that can be "pushed down" to execute using the data source's native environment to take advantage of high-performance platform capabilities
- Understand the optimizations the tools provide to reduce data latency
- Understand the barriers to high performance and what can be done to mitigate those situations
Most important, going through this process a number of times with a number of candidate vendors will help you identify any other potential issues associated with using data virtualization and federation with Hadoop/HDFS.
As we can expect the conventional EDW to coexist with Hadoop platforms for the foreseeable future, make sure that you choose the data virtualization and federation tool that best suits your environment.
About the Author
David Loshin is a recognized thought leader in the areas of data quality and governance, master data management, and business intelligence. David is a prolific author regarding BI best practices via the expert channel at BeyeNETWORK and numerous books on BI and data quality. His valuable MDM insights can be found in his book, Master Data Management, which has been endorsed by data management industry leaders.