Alluxio Boosts AI/ML Support for Its Hybrid and Multi-Cloud Data Orchestration Platform
New features improve I/O efficiency for data loading and preprocessing stages of an AI/ML training pipeline to reduce end-to-end training time and costs.
Note: TDWI’s editors carefully choose vendor-issued press releases about new or upgraded products and services. We have edited and/or condensed this release to highlight key features but make no claims as to the accuracy of the vendor's statements.
Alluxio, the developer of open source data orchestration software for large-scale workloads, has released version 2.7 of its Data Orchestration Platform. This release has led to improved I/O efficiency for machine learning (ML) training at lower cost by parallelizing data loading, data preprocessing, and training pipelines. Alluxio 2.7 also provides enhanced performance insights and support for open table formats such as Apache Hudi and Iceberg to more easily scale access to data lakes for faster Presto and Spark-based analytics.
Alluxio 2.7 Community and Enterprise Edition features new capabilities, including:
- Alluxio and NVIDIA’s DALI for ML: NVIDIA’s Data Loading Library (DALI) is a commonly used Python library which supports CPU and GPU execution for data loading and preprocessing to accelerate deep learning. With release 2.7, the Alluxio platform has been optimized to work with DALI for Python-based ML applications that include a data loading and preprocessing step as a precursor to model training and inference. By accelerating I/O heavy stages and allowing parallel processing of the following compute-intensive training, end-to-end training on the Alluxio data platform achieves performance gains over traditional solutions. The solution is scale-out as opposed to other solutions suitable for smaller data set sizes.
- Data loading at scale: At the heart of Alluxio’s value proposition is data management capabilities complimenting caching and unification of disparate data sources. As the use of Alluxio has grown for computation resources and storage spanning multiple geographical locations, the software continues to evolve to keep scaling using a new technique for batching data management jobs. Batching jobs, performed using an embedded execution engine for tasks such as data loading, reduces the resource requirements for the management controller lowering cost of provisioned infrastructure.
- Ease of use on Kubernetes: Alluxio now supports a native container storage interface (CSI) driver for Kubernetes, as well as a Kubernetes operator for ML making it easier to operate ML pipelines on the Alluxio platform in containerized environments. The Alluxio volume type is now natively available for Kubernetes environments. Agility and ease-of-use are a constant focus in this release.
Insight Driven Dynamic Cache Sizing for Presto
An intelligent new capability, called Shadow Cache, makes striking the balance between high performance and cost easy by dynamically delivering insights to measure the impact of cache size on response times. For multitenant Presto environments at scale, this new feature reduces the management overhead with self-managing capabilities.
Free downloads of Alluxio 2.7 open source Community Edition and of Alluxio Enterprise Edition are generally available at https://www.alluxio.io/download/.