January 12, 2015
Big data technologies have rapidly evolved to address the
absorption, organization, and analysis of growing volumes of
different types of structured and unstructured data. Volume growth
continues to accelerate, spurring C-level executives to drive the
exploration and rapid adoption of big data technologies.
This trend is borne out in a recent TDWI Best Practices Report
on managing big data. According to the report, 40 percent of the
surveyed respondents noted that big data activities are underway,
with half of those being committed projects, deployed, and some
relatively mature. Although an additional 37 percent responded
that big data management is “under discussion,” 60 percent are
confident that big data management solutions will be in production
within the next two years.1
Although one aspect of the strategy involves acquiring hardware and
software to support big data initiatives, challenges exist in adoption,
application migration, and staffing (the dearth of employees with
the requisite big data design, development, and analytics skills).
Although a variety of technologies can be categorized as big data,
much of the investigation (as well as speculation regarding the
business benefits) focuses on the open source Hadoop ecosystem
and its various enterprise-class variants.
The open source software stack for big data is evolving rapidly, and
many business and technology leaders are considering Hadoop’s
value proposition. Many are still experimenting to assess its usability
and suitability as part of the enterprise environment, and adoption
has been solid among those using the ecosystem for fundamental
storage augmentation such as capturing log data, extending a
data warehouse, and possible use as a platform for queries and
reporting.
However, a growing community is broadening its use to sophisticated
analytics for monitoring, prediction, and prescription. These users
seek to take the system beyond the prototypical adoption patterns,
and the developer community is responding. The maturation of the
componentry within Hadoop (such as improvements to the execution
model integral to YARN, evolving tools such as Spark that enable
in-memory cluster computing, and SQL engines with scalable
performance such as Impala, Stinger, and Spark SQL) reflects a
better understanding of true business application performance
requirements.
This imminent migration among early adopters away from the
concept of Hadoop as a platform solely for storage extension (“data
lake”) and toward a more effective platform for real-time analytics
implies the need for a mature big data environment that flexibly
balances performance with oversight and governance. Ultimately,
however, performance will become the most critical need motivating
innovation in this space, leading serious adopters to seek systems
that can simultaneously serve multiple analytics batch and low-latency
workloads that are suited to in-memory, supercomputer-class computation.
By reflecting on the trajectory of big data adoption, this TDWI
Checklist Report examines how selecting the right components will
guide your transition and integration strategy to realize a mature big
data platform that can be integrated within a production information
technology environment.