In Praise of Modular BI
At TDWI's recent Executive Summit in San Diego, Facebook's Justin Ward made a case for what he called "building block" business intelligence.
At TDWI's recent Executive Summit in San Diego, Justin Ward, Facebook's manager for data architecture, made a case for what he called "building block" business intelligence. Far from being outmoded or irrelevant, BI is, in fact, more vital than ever, Ward argued.
The Warehouse Is the Foundation
Now as ever, Ward said, the foundation for most scalable BI implementations is still the data warehouse. To explain why, Ward drew an analogy between data warehouse architecture and a coffee shop owner who decides to open up several new locations.
"As you get tons of new users, or coffee drinkers, you're probably going to want to build some sort of physical infrastructure so that these people can come to each of your shops without having to wait in a huge line and lose that personal experience," he said.
"This is like a data warehouse. You need to build infrastructure that can support [query processing] in parallel and ... at scale ... and to do that you're going to need to do things the right way."
Ward's "right way" is what he calls building block BI. It entails using "simple, efficient building blocks throughout your processes" and hiring people who commit to this strategy, he said.
When an organization takes a modular approach to building BI, "you can scale and make your coffee shop customers happy and hopefully [make] your data warehouse customers happy. You can scale up to big data size and as long as you have the right fundamentals in place you're ... not [going to] have to go back and revisit things or rearchitect your warehouse."
Modular Building Blocks Versus Architectural Monoliths
Ward alluded to Facebook's own (somewhat tortured) history with BI and data warehousing, albeit without going into details.
At first, Facebook tried various ways to develop the big-data equivalent of its data warehouse architecture. For example, Hive, one of the best known SQL interpreters for Hadoop, was initially developed at Facebook. In 2013, however, Ken Rudin, then head of analytics at Facebook, announced that the social media giant was implementing a traditional data warehouse.
"We started in the Hadoop world. We are now bringing in relational to enhance that. We're kind of going [in] the other direction," Rudin told attendees at a 2013 TDWI conference in Chicago. "We've been there, and [we] realized that using the wrong technology for certain kinds of problems can be difficult. We started at the end and we're working our way backwards, bringing in both."
Ward emphasized that perceived shortcuts almost invariably create costs -- i.e., "technical debt" that you have to repay with later work -- in the long term.
For example, Facebook famously developed Hive to address a set of point-in-time requirements. It based Hive on a general-purpose parallel processing technology (Hadoop) that it believed would scale. Hive did scale -- for a limited set of reporting requirements. When Facebook tried to scale Hive for its decision support needs, it was unsuitable for ad hoc and interactive query requirements. Strictly speaking, Hive wasn't a shortcut, but it wasn't a modular building block, either.
The best practice is for engineers and architects to build modular, scalable architectures that minimize technical debt, Ward told attendees. This is as simple as choosing the right tool for the right job.
Facebook now uses a massively parallel processing (MPP) data warehouse. An MPP database screams for certain kinds of data-processing workloads. It is unsuitable for others, Ward said: "MPP databases are fantastic when used in the right context but they can be really and truly awful if they're used like a traditional data warehouse or if they're used for the wrong domain."
The lesson: an MPP warehouse is one piece of a modular BI architecture. It might be -- and in Facebook's case is -- complemented by other warehouses, data stores, and tools.
From the perspective of designers, however, the temptation to take shortcuts -- to simplify up-front design and reduce up-front costs, to accelerate performance -- is often overwhelming.
Ward used the example of an organization that builds and optimizes its decision support infrastructure to support a limited set of reporting requirements. Optimization of this kind gives you short-term convenience that almost always comes at the expense of long-term flexibility, Ward said.
"That shortcut of putting everything in the one table because it was easy for [a certain] report comes back and hurts us. There should be simple building blocks that are modular throughout this process," he said. "If you have [to add] another source, build another target for it and make sure you have a data lineage that reflects logically what's happening in your data warehouse."
Taking shortcuts adds complexity in the long term. It squanders productivity, Ward argued. "This is also potentially extremely wasteful. At some point, you're going to hit a situation where you need to revisit something. If you agree with me that it's right to make it modular you should do that the right way the first time or you'll probably never get around to it," he told attendees.
He noted that human hubris also contributes to this problem. "As I talk to engineers, a reaction I always get is 'I'm really good at my job. I can put this stuff in later if I need it.'"
In the long term, the up-front cost and inconvenience of modular design is worth it, Ward argued. You not only have much more flexibility, you also have better insight into your data infrastructure. "You ... get to the point where the business is giving you a break and they say, 'Hey, we don't need this report anymore.' If you're in this suboptimal architecture, at that point what do you do? You don't know if you can deprecate that target table or that source. You don't know what pieces you can take out."
He concluded, "If you have [a more modular architecture] you have a very linear flow of what you can take out. This becomes very important when you're trying to move off of an architecture."