Executive Q & A: Bursting Data to the Cloud
With so much data to store and analyze, no wonder hybrid cloud solutions are gaining favor. Haoyuan (H.Y.) Li, the founder and CEO of Alluxio, offers his perspective and best practices.
- By James E. Powell
- July 31, 2020
Upside: What trends are you seeing regarding the cloud and data storage?
Haoyuan (H.Y.) Li: We see an increased adoption of hybrid cloud solutions, where users run analytics and AI workloads in public cloud environments and keep their data lakes on premises. With the separation of storage and computing resources, hybrid data lakes are being built in a couple of ways.
The first case is when the computing infrastructure on premises is running out of CPU resources to meet user service-level agreements. A public cloud is attractive for ease of provisioning and elasticity to keep the costs in control for ephemeral or bursty workloads. The data itself may not need to be moved to public cloud storage in this case.
The other case we see is when the on-premises storage cluster is overloaded because of scaling limitations. We see partial migration to on-premises object storage as well as cloud storage for nonsensitive data.
Doesn't having some of your data on premises and doing analytics processing in the cloud pose problems?
Yes, it is challenging to natively run processing in the cloud while the data resides on premises. Data and metadata locality for computation-intensive applications must be achieved in order to maintain the performance of analytics jobs as if the entire workload were running on premises. There are a few techniques to address this issue.
The first is pre-fetching data and metadata intelligently based on the application's access pattern. The storage access pattern for traditional analytics frameworks is not suited for storage over a high latency link.
The second technique is to translate this access by the compute resource into a more optimal storage access pattern using storage APIs better suited for this kind of a network link.
How can companies leverage hybrid cloud data analytics and avoid analytics latency?
A hybrid cloud for data analytics can have data spread across storage silos being accessed by multiple compute clusters. Common considerations to tackle this challenging problem include the need to rewrite applications, managing the movement and replication of data across silos, and future-proofing the architecture by scaling the on-premises component as well as the cloud components, even across multiple vendors.
A data orchestration layer helps alleviate these issues by providing a single access layer for data spread across silos. In addition, the movement of data across silos can be managed by this layer so that analytics applications remain unchanged, even if the data location changes. This last point is key for decoupling the applications from the physical location of data with the new access layer in the middle of compute and storage.
Speaking of where data is stored, what best practices concerning the architecture can you suggest?
We recommend not copying data manually to create new silos, which are hard to maintain and add to the cost as well. Instead, copying data on demand using a data orchestration platform can bring locality to a separated compute and storage environment. Instead of analytics applications communicating with data silos directly, applications should interface with a data orchestration layer. Highly distributed caching and management capabilities seamlessly move hot data where it can be accessed and analyzed while moving cold data to cheap storage.
Is security still an issue when shuttling on-premises data to the cloud?
Yes, very much so. The common pillars of security are authentication, authorization, and encryption for data at rest and in motion. For authentication and authorization, it is important to plug into the on-premises infrastructure. With a hybrid cloud, the access control model for the public cloud component may differ from the infrastructure on-premises, and there is a need to bridge the gap. Encryption is equally important, if not more so, when data is shipped over a network outside the private data center.
What other best practices do you recommend?
Careful planning is needed to execute a strategy to address the needs of today and plan for future growth. The options for data storage systems as well as analytics compute frameworks continue to explode. With this evolution, it is inevitable that data will be spread across multiple locations, not just on premises and a single cloud but also multiple cloud vendors.
The same computing environment can be run across all these environments in the same manner using containerization. Kubernetes, as the container orchestration technology, allows the same toolset to be used regardless of the compute environment. Similarly, we recommend a data orchestration layer to bridge the gap for storage by providing the same data access technology across silos. We also recommend policy-based migration across data silos as the architecture evolves. This also fits well in the orchestration layer.
What products or services does Alluxio offer in this space, and what distinguishes them?
Alluxio offers a mature, open source data orchestration platform with a large community. There are three distinguishing features of the product. First is a scalable caching capability to move hot data close to the computing resource, auto-detecting changes to the data silos to ensure cached data is never stale. The second distinction is policy-based data management. You can migrate data to cheap storage based on the access time. The third feature which makes this migration seamless is the ability of the orchestration layer to serve the logical data from wherever (in whichever physical store) the data resides. Even after data is migrated across silos, applications continue to access data in exactly the same way.
About the Author
James E. Powell is the editorial director of TDWI, including research reports, the Business Intelligence Journal, and Upside newsletter. You can contact him
via email here.