RESEARCH & RESOURCES

Data Engineering with Apache Hadoop: Highlights from the Cloudera Engineering Blog

June 21, 2016

Data engineering is the process of building analytics data infrastructure or internal data products that support the collection, cleansing, storage, and processing (in batch or real time) of data for answering business questions (usually by a data scientist, a statistician, or someone in similar role, but in some cases these functions overlap). Examples can include:

  • The construction of data pipelines that aggregate data from multiple sources
  • The productionization, at scale, of machine-learning models designed by data scientists
  • The creation of pre-built tools that assist data scientists in the query process (e.g., UDFs or entire applications)

Data engineers rely on Apache Hadoop ecosystem components, such as Apache Spark, Apache Kafka, and Apache Flume, as a foundation for this infrastructure. Regardless of use case or components involved, this infrastructure should be compliance-ready with respect to security, data lineage, and metadata management.

This white paper contains selected posts from the Cloudera Engineering Blog about some key concepts pertaining to building and maintaining analytics data infrastructure on a Hadoop-powered enterprise data hub.

Your e-mail address is used to communicate with you about your registration, related products and services, and offers from select vendors. Refer to our Privacy Policy for additional information.


TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.