Business never holds still. Data pipelines must be agile enough to adapt to changing requirements and circumstances.
Reusability is the key to a truly agile data pipeline. This is the ability to rapidly compose new data engineering pipeline processes from pre-existing logic. It enables new data ingest, validation, transformation, cleansing, enrichment, masking, logging, monitoring, and other pipeline processes to be provisioned and orchestrated wherever, however, and whenever they’re required. It allows new pipeline flows to be automatically provisioned on the fly in order to operationalize data preparation and thereby accelerate data-driven insights.
In this keynote presentation, TDWI senior research director James Kobielus will discuss key steps for enterprises to take in designing agile, extensible, modular, and reusable data pipelines, especially:
- Avoid building large monolithic pipelines that cannot be easily subdivided, recombined, or repurposed for new DataOps workflows
- Break the pipeline into stages with associated metadata that can be used to drive runtime behavior, rather than rely on hard-coding of pipeline code
- Package pipeline artifacts and workflow logic—such as ETL code, metadata, and APIs--as reusable patterns, templates, fragments, components, functions, and microservices
- Use visual no-code flowcharting to design and deploy pipelines from reusable services
- Deploy a central cloud catalog to enable collaborative reuse, development, publishing, and management of transformation logic, machine-learning models, metadata, service APIs, and other reusable pipeline artifacts
- Provide a shared team environment that enables collaborative development, integration, deployment, management, logging, monitoring, and checkpointing of data pipelines
- Deploy reusable data pipelines on cloud infrastructure that enables on-demand provisioning of new source connectors, data processing engines, data stores, file systems, and other platforms
- Incorporate pipeline automation to enable rapid migration, conversion, configuration, optimization, and deployment of reusable functions
- Use embedded machine-learning-driven processes to present contextual recommendations and otherwise augment the productivity of data engineers, data scientists, and others involved in the building and operationalization of reusable pipeline functions