RESEARCH & RESOURCES

How to Avoid Turning Your Enterprise Data Lake into a Data Swamp

Take these three steps to achieve mature enterprise-strength Hadoop.

By Tripp Smith, CTO at Clarity Solution Group

As the Hadoop ecosystem has matured and gained increasing traction in mature industries, it is clear that the potential benefits for data management and analytics are staggering. For example, some estimates claim Hadoop is 25 times less expensive per terabyte than leading proprietary relational database management systems (RDBMSs). The industry hype around Hadoop and the concept of the enterprise data lake -- a large object-based storage repository that holds data in its native format until needed -- has generated enormous expectations reaching all the way to the executive suite. At the same time, many Hadoop programs have stalled or failed to deliver on their aspirational value proposition, resulting in a substantial gap between expectations of analytics consumers and the ability of analytics programs to deliver.

Enterprises introducing Hadoop are seeking common goals:

-- Information agility through the centralization of data and decentralization of analysis with a single point of access to high-value enterprise master data, transactional assets and dark data that is made available to analytics consumers across the enterprise

-- Increased capacity with near-linear scalability to data volumes or calculation complexity and ease of adding hardware resources

-- Expanded capability with increased depth of conventional analysis combined with real-time and full-volume analytics

-- Reduced expense with cheaper storage than an enterprise storage area network (SAN) or RDBMS, reduced licensing costs, readily available open-source tools, and scalability with hardware, not expensive application-performance tuning

The conventional industry hype suggests that setting up a cluster, streaming data into a Hadoop Distributed File System (HDFS), and unplugging the enterprise data warehouse (EDW) will immediately activate each of these goals with numerous high-profile, high-tech success stories. Although the Hadoop ecosystem is maturing at an extraordinary pace and shows little sign of slackening, the capabilities to enable security, business continuity, data governance, and accessibility are still maturing. Bridging the capability gap for mature industries leveraging Hadoop requires a considered strategy that delivers on enterprise needs while enabling the value of a Hadoop ecosystem.

The Need for Enterprise Strength Alternatives

An effective Hadoop implementation requires a balanced approach that addresses the same considerations that conventional analytics programs have struggled with for years, such as establishing security and governance, controlling costs, and supporting numerous use cases.

Failure to address these concerns can be disastrous, resulting in increased operational maintenance costs. IT teams would struggle to make sense of the chaos; information agility would be reduced as analysts spend more time on data forensics than analysis; the business impact of analytics would decline as information is questioned for accuracy; and consumer trust would diminish if their personal information is not managed securely. The successes of companies in high-growth industries and seemingly limitless pools for investment in IT does not necessarily translate to cost-conscious, mature industries -- and the aspirational promise of Hadoop seems less enticing when the stakes for getting it wrong could include jail time for the CFO.

A considered approach to achieving information agility with Hadoop in an enterprise context addresses architecture, governance, and enablement and provides a clear framework for achieving analytic maturity while enabling agility.

The MESH Framework for Mature Enterprise Strength Hadoop

A Mature Enterprise Strength Hadoop (MESH) implementation is achievable, now more than ever. Core platform capabilities are enterprise strong and hardened, but as years of EDW implementations have taught us, platform capabilities alone do not necessarily result in success. A MESH framework answers the need for governance processes across the enterprise stack by providing an interoperable matrix of architecture, governance, and enablement, accelerating the real lifetime value of Hadoop.

An architectural approach that naturally and organically enables data governance is more effective at achieving information agility than one that requires forceful policing and limits enablement and access to highly valuable. A core assumption of maturity is that processes that are integral to the operation of the business, like data security and data governance, should be repeatable as well as simple to implement and enhance.

By providing an organic approach to enablement, a MESH framework ensures that agility is achieved across the breadth of Hadoop use cases: acquisition and ingestion, archival data management, real-time event processing, master data integration, data transformation, information delivery, discovery analytics, and machine learning.

MESH Layers

MESH enables data maturity through an organic refinement approach. Analytics and data transformation tend to follow natural usage patterns and MESH enables these patterns inherently within the architecture. As data passes through the MESH framework, it is refined through stages:

  • Managed, raw data is brought under management within the analytic ecosystem. Managed data is the raw material from which information is generated, but may be highly flawed, programmatically corrupt, or defined by nonsensical source semantics. Nonetheless, managed data must be secured and governed by service-level agreements (SLAs).

  • Structured, cleansed, and semantically defined data is processed for discovery analytics and formal business transformations. Structured data is refined to a point that analysts, engineers, and data scientists with specialized skills can ingest and process this data into consumable information.

  • Governed, cleansed, semantically defined data is presented through business views to business consumers and other end users for consumption. Presentation carries its own security and SLA constraints, along with user experience considerations such as accessibility and performance.

Clearly defining naturally occurring layers is a first step to enabling a mature enterprise Hadoop ecosystem ready to scale to enterprise needs.

Three Steps to Hadoop Success

Implementing Hadoop-based solutions as the Hadoop ecosystem has matured has established best practices that can dramatically improve the quality and success of an enterprise Hadoop implementation. Before embarking on an enterprisewide Hadoop journey, consider these simple tips to get off to the right start:

  • Take the security question off the table: A rigorous approach to access, authentication, and authorization proves to stakeholders that you will protect their data as an enterprise asset. High-profile data breaches prove that threats may exist behind the firewall. Define a plan for sustainably managing network access, user authentication, user roles, and data security and retention requirements. Establishing a process-driven approach to address the security aspects of data management ensures that onboarding new data doesn't hit roadblocks or compromise your commitment to protecting the data of your business, partners, and consumers.

  • Establish a platform "zoning map": The MESH approach organically refines data from managed to structured to governed so users know what to expect when they interact with data on your platform. Just like city planners may zone property as industrial, commercial, or residential, combining data governance SLAs with a modular platform architecture creates a visible "data zoning map" of your platform. This approach clarifies the types of applications that should be constructed with data of each type and informs users about the level of effort or confidence they should have in the data, e.g., suitable for directional analysis or highly accurate reporting. Similarly, the zoning map provides guidance for workload management and steers application developers and end users to the correct layer within the analytics platform for their specific use case and needs.

  • Integrate pattern recognition into your data management and analysis processes: Providing reliable pattern-based development approaches ensures that governance standards are followed and enhances the ability to provide reliable estimates for development between layers. The MESH approach facilitates agile data integration and analytics by making delivery repeatable and predictable, enabling development of analytic capabilities to follow a business service approach with frequent releases of incremental improvements and parallel execution on analytic capabilities. Pattern-based development and predictable capacity extend platform stability to program stability, resulting in achievable value in excess of business expectations.

A mature enterprise-strength Hadoop is readily achievable, but many organizations fail to plan for success because of uncertainty around the platform, wait-and-see approaches or in other cases, acceptance that dysfunction is a natural byproduct of the Hadoop ecosystem. Experience shows that diving into a data lake without a strategy for maturing at scale is more likely to create a data swamp that fails to deliver on the value proposition of Hadoop and generates countless processes that become swamp monsters requiring expensive ongoing maintenance. Starting with a rigorous framework that readily encourages information agility and enablement rather than suppressing opportunity helps your enterprise realize the aspirational value proposition of Hadoop while maintaining governance, security, and accessibility.

Tripp Smith is CTO at Clarity Solution Group, a recognized data and analytics consulting firm. Contact him at [email protected].

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.