Data Engineering with Apache Hadoop: Highlights from the Cloudera Engineering Blog
June 21, 2016
Data engineering is the process of building analytics data infrastructure or internal data products that support the collection, cleansing, storage, and processing (in batch or real time) of data for answering business questions (usually by a data scientist, a statistician, or someone in similar role, but in some
cases these functions overlap). Examples can include:
- The construction of data pipelines that aggregate data from multiple sources
- The productionization, at scale, of machine-learning models designed by data scientists
- The creation of pre-built tools that assist data scientists in the query process (e.g., UDFs or entire applications)
Data engineers rely on Apache Hadoop ecosystem components, such as Apache Spark, Apache Kafka, and Apache Flume, as a foundation for this infrastructure. Regardless of use case or components involved, this infrastructure should be compliance-ready with respect to security, data lineage, and metadata management.
This white paper contains selected posts from the Cloudera Engineering Blog about some key concepts pertaining to building and maintaining analytics data infrastructure on a Hadoop-powered enterprise data hub.