Ten Mistakes to Avoid in Hadoop Implementations
TDWI Member Exclusive
February 18, 2015
Data management and analytics are foundational requirements for creating, managing, and executing a successful business. From an infrastructure perspective, however, it is a struggle to build an integrated data platform that can support the information architecture required by an enterprise data repository and analytics hub.
In the past decade, we have seen a successful set of distributed processing architectures—including Google and Nutch—that inspired us to bring distributed data processing architecture with Hadoop and its ecosystem of projects. Enterprises have explored Hadoop since 2009, and many start-ups are now focusing on that ecosystem.
Today, this infrastructure distribution is being implemented as the enterprise hub for all data; some implementations are successful, but many others are abysmal failures. Why do so many fail? Where did they go wrong? How do we identify and avoid the mistakes?
When inspecting failures and listening to companies and teams, we see that fundamental steps have been missed or ignored, including end-user management, data security, performance tuning, infrastructure configuration, and sizing. From the Hadoop infrastructure perspective, simply applying workarounds to implementations doesn’t work.
In this Ten Mistakes to Avoid, we identify the mistakes with the most negative impact on Hadoop implementations and recommend solutions you can apply to your own environment.