Inside Facebook’s Relational Platform
At TDWI's World Conference in Chicago, Ken Rudin, director of analytics for Facebook, surprised many attendees when he revealed that Facebook has built itself a conventional data warehouse.
- By Stephen Swoyer
- May 6, 2013
During his keynote address at today's TDWI's World Conference in Chicago, Ken Rudin, director of analytics for Facebook, surprised many attendees when he revealed that Facebook has built itself a conventional data warehouse.
The Facebook model has always been held up as an exemplar of The New. The company helped to develop Hive, the SQL-like semantic layer for Hadoop, which it used to power its Hadoop-based “data warehouse” environment.
In his keynote, however, Rudin staked out a pragmatic position that many TDWI attendees -- and most data management (DM) practitioners -- could easily endorse. [Editor's note: You can view the entire keynote address here.]
“[Facebook] started in the Hadoop world. We are now bringing in relational to enhance that. We're kind of going [in] the other direction,” Rudin told attendees. “We've been there, and [we] realized that using the wrong technology for certain kinds of problems can be difficult. We started at the end and we're working our way backwards, bringing in both.”
Rudin invoked an aphorism by author James Collins, with whom he studied at Stanford University's Graduate School of Business. Collins, a critic of zero-sum decision making, famously championed “the genius of 'and'” as an inclusive alternative to “the tyranny of 'or.'”
Big data is a great example of an inclusive “and” scenario, Rudin argued.
“What traditional systems like relational are really good at are [answering] the traditional business questions that we all still ask and will ask, and that's not going away just because the new technologies are there,” he explained.
Rudin suggested “not only SQL” as a backronym for the term “NoSQL,” which has been used to describe schema-less technologies such as Hadoop. Queries or workloads that execute in fractions of a second on a relational platform will run orders of magnitude slower on Hadoop, he observed. The reverse is true of analytic algorithms running against large sets of multi-structured data: they'll run orders of magnitude faster on Hadoop. “Trying to do ... [a] complicated algorithm in a relational system is [going to be] very, very painful,” said Rudin, using the example of an analysis of user-submitted Facebook photos.
“You want to use the right kind of technology for the right kind of question.”
Throughout his keynote, Rudin distinguished between analytic pragmatism and analytic perfection. In acknowledging the role of statistical rigor in product testing, for example, Rudin stressed that what's most important is the spirit and not the ideal of experimentation: don't forestall or eschew testing simply because you can't come up with a statistically perfect test.
“The most modern incarnation of [experimentation] is A/B testing. This is figuring out ... whether my great idea is actually great,” he said. “Maybe you can't do a perfectly statistically controlled A/B test, but ... there's always some way ... to figure out how we've improved versus historical trends. It's the spirit of the experimentation versus the actual statistical significance of it that actually makes all of the difference.”
Earlier, Rudin had discussed the problem of managing and governing data in the context of an analytic-driven organization. There's a long-standing tension between the DM-oriented need to tightly profile, control, and manage data on the one hand, and the countervailing analytic desire to access and consume data on terms that are determined by the analyst herself.
In its old model, Facebook would have used a non-relational platform such as Hive and Hadoop to address both requirements; in its new, inclusive model, Facebook uses non-relational platforms to empower analytic discovery and experimentation; its relational data warehouse functions as a consistent or reference platform for core business data.
“Think about the core elements of the data that you must manage and then don't worry about everything else. That's uncomfortable for a lot of us ... particularly in a ... relational environment where you want to have nice structured schemas,” he said.
Facebook, he said, has hundreds of thousands of database tables in its analytic environment. “Only on the order of several dozen” of these tables are core to its business, however. These are tables “that we must manage very, very carefully. ... For the things that we need to have consistency on, ... that's a core table, that's managed relationally.”