Don't Let Data Integration Be the Downfall of Your Cloud Data Lake
How to address the common challenges you'll face when moving data lakes to the cloud.
- By Ravindra Punuru
- October 25, 2019
As organizations deploy data lakes in the cloud, expecting to improve data performance and optimize analytics, they often encounter a variety of unexpected integration challenges.
Cloud data lakes must connect to complex, hybrid environments that include SaaS applications and cloud data warehouses as well as on-premises data sources and data lakes. They're increasingly likely to be part of a multicloud approach that leverages cloud infrastructures from multiple vendors, including Google, Amazon, and Microsoft.
With data lakes in the cloud, it's important to look at the technologies used to connect all the components in the framework. Legacy and custom data integration solutions often stand between organizations and the increasing data sources they want to explore.
In a previous article, I discussed reasons data lakes are moving to the cloud. Here, I'll review common challenges that arise when making that shift.
For a cloud data lake, you'll need to establish connectivity from various SaaS applications and on-premises source systems to the many components of the cloud platform: storage, compute, database, and data warehouse. Additional connections need to be made as you bring on new data sources.
Legacy frameworks or existing data integration tools typically don't have the ability to connect on-premises and cloud systems, which means you either have to build each individual connector manually (which can be time-consuming, costly, and error-prone) or find a technology stack that can help.
Cloud-native data integration solutions offer built-in connectivity to the major cloud infrastructure providers as well as hundreds of source systems, including social media channels, SaaS applications, and enterprise data stores. This saves significant time and investment.
Another common challenge when building data lakes in the cloud is the lack of native integration within the cloud frameworks, including security access controls, storage formats, and transformation capabilities.
Data integration architectures that aren't designed with the cloud platform architecture, functionality, and language in mind usually result in data latency, inconsistency, and risk. These elements need to interact seamlessly to keep your data moving quickly:
- Convert data for storage. Every cloud has a different object storage mechanism (for example, Microsoft Azure Blob or Amazon S3) and data must be converted to the optimized format recommended by the cloud vendor (e.g., Apache AVRO or Parquet formats) before it can be uploaded.
- Orchestrate communications between components. As data is moved within the cloud platform's network, from storage to database/data warehouse components, the data integration solution has to be able to orchestrate the communications telling the systems what data to move, where to move it, and what to do with it.
- Leverage cloud compute for transformations. Whether the target destination is a cloud data warehouse, data lake, or processing platform such as Hadoop or Spark, your data integration solution should push down the data processing into the platform with native-language instructions. This is faster, more reliable, and more efficient than moving data to an external ETL solution before loading it into the target system.
- Manage security access controls. When interacting with different frameworks in the cloud, data integration solutions must honor the platform's security features, including access controls (e.g., understanding and honoring which users can access, read, and change the data).
When moving data from on-premises sources to the cloud, security is a major concern. With legacy data movement solutions, data didn't leave the firewall, so security wasn't as much of a concern. Some solutions may not be equipped with the security features that cloud ingestion demands.
Specifically, when moving data to the cloud, it should be encrypted using Advanced Encryption Standard (AES) algorithms at the on-premises source before it heads into the cloud and be decrypted only when it arrives at its target destination. Additionally, the frameworks should leverage secure key management components offered by cloud vendors.
If your data integration solution requires custom coding and additional servers to expand data sets or tap into new data sources, you won't soon see the value of the cloud data lake in terms of scalability, speed, and cost.
- Modular architecture. Existing ETL solutions often require additional servers to take on new data, effectively eliminating the cost benefits of the cloud. A more modular architecture would allow you to scale up the solution automatically or with minimum configuration, allowing you to distribute data processes on premises and in the cloud. For example, data could be collected and encrypted locally from on-premises systems and moved to the target systems in the cloud.
- Pipeline development. Consider how easily you can build data pipelines. Do you need a highly specialized staff to build them? Will it require custom coding? Look for solutions that offer fast pipeline development (e.g., a drag-and-drop user interface) and the ability to do it in bulk.
- Metadata. A metadata-driven data integration approach allows enterprises to preserve the location of data and transformation rules; these details are required for auditing in some industries. Find a solution that provides a metadata-as-a-service layer to support data lineage, enable impact assessment, and validate design and operations.
When moving multidirectionally from cloud and on-premises systems and orchestrating processes within cloud systems, it will be critical to have the proper process controls and notifications to protect the integrity of the process and the data.
Enterprise features such as high availability and process load balancing in data integration solutions are crucial to ensure data availability for end users. The best solution will have a monitoring approach to ensure loads run properly and that failures and issues will be logged so they can be recognized and fixed.
Again, this functionality requires integration with the cloud vendor APIs, which can be complicated within traditional ETL and homegrown data integration solutions.
Plan for Challenges, Accelerate Time to Value
Cloud data lakes are excellent solutions for organizations that need to manage massive volumes of structured and unstructured data, but it's critical to know what you're getting into. It's not just about the data lake; it's about the whole technology stack that supports data movement, data storage, business intelligence, security, and much more. Understanding common challenges in advance will shorten the time to value for your data lake.
Ravindra Punuru is cofounder and CTO of Diyotta, Inc., where he is responsible for modern data integration technology strategy, product innovation, and direction. With more than 20 years of experience in data management and consulting, Ravindra has broad knowledge of corporate management and the strategic and tactical use of cloud-based, data-driven technologies to improve innovation, productivity, and efficiency. Ravindra’s past roles have included architecting and delivering enterprise data warehouse programs, with large corporations including AT&T, k, Time Warner Cable, and Bank of America. You can contact the author via email or LinkedIn.