The BI Year That Was: 2014
The year that was 2014 was a transitional year in business intelligence and data management, with vendors emphasizing data visualization, storytelling, databases, Spark, and intriguing inventions.
- By Stephen Swoyer
- December 16, 2014
The year that was 2014 was a transitional year in business intelligence (BI) and data management (DM). It was a year in which a once-isolated BI industry continued, sometimes forcibly, to diversify itself. It was a year, too, in which diverse forces from without – from Silicon Valley upstarts to open source app dev specialists – continued to crowd and press their way into domains that once seemed vouchsafed to DM and BI. The upshot is that data management is becoming more worldly.
It was a year in which marketers began touting their own craft – viz., storytelling – and in which yet another new-and-game-changing NoSQL technology, Spark, exploded in popularity. It was a year in which cloud players unveiled BI-in-the-cloud architectures that look an awful lot like the BI-in-the-enterprise architectures they're destined to replace. Above all, it was a year in which the information management industry, that irrepressible engine of innovation, once again reliably delivered the goods.
The Revolution Will Be Visualized
The self-service-ifaction of BI and analytics continued unimpeded in 2014, with updated versions of visual discovery tools from IBM Cognos, Microsoft, SAS Institute Inc., Tableau, and others.
Back in June, for example, Tableau -- a pioneer in category of rich visual discovery -- introduced a new version of its Tableau Desktop software. Although it was packed with new features, perhaps the biggest thing in Tableau 8.2 was a native client for Apple Inc.'s OS X. Other new Tableau 8.2 highlights included support for "Story Points" -- part of Tableau's new "storytelling" marketing push -- and new visual tools designed to simplify the work of data integration.
Also this year, Tableau competitor Qlik Inc. unveiled its own long-awaited visual discovery product, Qlik Sense. Even though they're often compared with each another, Tableau and Qlik are actually nothing alike. Traditional QlikView combines a user-focused, rapid application delivery model with not-so-sophisticated data visualization capabilities; Tableau is designed as a user-focused, best-of-breed data visualization tool. Qlik touts the combination of best-of-breed-like data visualization with its traditional developer-friendly model -- on top of its metadata management capabilities. (Traditionally, Tableau paid short shrift to metadata -- although this, too, is changing.)
There were other firsts this year. SAS introduced Visual Statistics, an intriguing attempt to extend the visual discovery experience to statistics. Big Blue unveiled its long-awaited Watson Analytics, a self-service visual cloud offering. Traditional analytic solutions are happiest when consuming or working against data from relational or strictly-structured sources. Watson Analytics bundles text analytic technologies, algorithmic functions, and natural language processing capabilities that IBM says permit it to meaningfully discover and synthesize information from plain text, XML, and other non-traditional sources.
In 2014, the industry coalesced around self-service visual discovery as the front-end user experience (UX) of choice, but some aren't yet convinced. Take Glen Rabie, CEO of Yellowfin, an on-premises and cloud BI offering. "Basically, I don't think self-service BI is in any way a real or meaningful concept. When we started Yellowfin, I actually did believe that everyone wanted to sit down and write their own reports and do their own analysis, but that's completely not the case. Most people are too busy, they don't have the skill set, and it's not their job. What they rely on are analysts who do that for them and who can do that for them," Rabie told BI This Week during a briefing about Yellowfin's new version 7.1 release, which also shipped this year -- and which nonetheless includes a self-service feature set.
The Story's the Thing
History begins with storytelling; so, in another sense, does marketing. Think of 2014, then, as the year in which BI marketers aggressively promoted their core competency: storytelling. Tableau Software led the charge, casting "data-driven storytelling" as a critical interpretive analytic tool. Tableau wasn't the only one, however: nearly all purveyors of front-end analytical tools had a storytelling story of some kind in 2014.
The pell-mell rush to exploit storytelling as a marketing tool misses important points, however. There's the problem of the structure of the story, for starters: its narrative story arc, with a beginning, middle, and end and with an emphasis on dramatic developments. What about the use of common storytelling devices, such as simile, metaphor, and analogy? Don't certain framing metaphors -- e.g., war or combat metaphors -- predispose one to look at the story that's being told in a certain way? In many cases, an honest or respectful interpretation of the available data might not lend itself to the requirements of traditional storytelling structure. We don't intentionally tell boring stories. We want our stories to be interesting. What happens when we uncritically bring those same expectations to bear in our storytelling efforts, "data-driven" or otherwise?
Thankfully, analysts and data management professionals are pushing back. In December, for example, TDWI hosted its first-ever "Storytelling" course at its Orlando World Conference. The new TDWI course looks at how storytelling can be adapted to the interpretation and analysis of data -- and vice-versa. As course co-creator Ted Cuzzillo, a respected industry analyst and a principal with Datadoodle.com notes, every storytelling medium -- from radio to newspapers to stone engraving -- innovates (and compromises) within the context of this storytelling structure. What's important with data storytelling is to identify and control for those innovations and compromises. Adds Cuzzillo, "I suspect that the data story's closest cousin in any traditional genre is the detective story. A crime occurs and the rest of the story lays out the clues and ends up telling which ones were relevant, how they fit together, and who did it."
Count Donald Farmer, vice president of innovation and design with Qlik, as a storytelling pragmatist.
"Storytelling suggests the sort of thing ... where I tell you a story about what's going on in our account, or I tell you a story about some kind of phenomenon. What I think we should focus on is more like what I call 'story-sharing:' the kind of storytelling you do around a camp fire, for example," said Farmer. "What we should focus on creating with a story … is a community of discourse within the company, within the enterprise, within the department. Communities of discourse aren't about agreement. You want to hear different views, different interpretations of 'facts.' With storytelling, it's critically important that you encourage these other, alternative views to surface."
More precisely, the data warehouse (DW) regained its importance, thanks in no small part to the primacy of SQL -- or "NewSQL," as it's now called. NoSQL crept onto the scene almost 15 years ago, exploded -- outside of data management (DM), at least -- in the mid-2000s, and pressed its way into DM in the last few years. At times, it seemed as if nothing could contain the NoSQL onslaught.
Then the unexpected happened: SQL came back -- and with it, the data warehouse. We had a taste of this in 2013, when Facebook CTO Ken Rudin, himself no stranger to BI and DW, publicly touted his company's come-(back)-to-the-DW moment and Google Inc. announced F1, its Spanner-based "NewSQL" database.
In 2014, we've seen still more on the NewSQL front: from the emergence of Spark SQL, a native SQL query technology for the Spark parallel computing framework -- about which we'll have more to say below -- to the expansion of Hortonworks Inc.'s "Stinger" project (an effort to beef up Hive, a SQL interpreter for Hadoop), to start-up vendors such as Metanautix Inc. and Bright Vine Inc., which emphasize SQL in their marketing and which depend on SQL to make their technologies work.
Elsewhere, 2014 also saw new ANSI-SQL-on-Hadoop query technologies offerings from Actian, SAS, and Teradata, in addition to new offerings from upstart vendors. Speaking of upstart vendors, cloud platforms such as Good Data Inc. and Snowflake Systems implement the equivalent of data-warehouse-architecture-in-the-cloud -- with SQL as their query-language backbone.
As Hadoop champion Cloudera Inc. put it in a promotional mailer sent out late this year: "The data warehouse isn't going away and neither is big or small data. By definition, big data storage and processing is not scalable via the data warehouse alone. Therefore, organizations are turning to an enterprise data hub."
Cloudera's Enterprise Data Hub is its vision of a Hadoop-centered information management architecture, with Hadoop used as an elastic means of storing, managing, and preparing data. Two years ago, Cloudera famously delivered Impala -- an in-memory, interactive SQL query engine for Hadoop, positioning Impala as a putative data warehouse replacement. Two years on, Cloudera's now affirming the DW. As one prominent research analyst told BI This Week, "Here's a big change: from 'To hell with the data warehouse' to 'Hell, yeah, the data warehouse!'"
O Wondrous Spark Divine
That's a line from Schiller's "Ode to Joy," which Ludwig van Beethoven famously appropriated -- and transfigured for all time -- in his Symphony No. 9. This year, it could just as easily have been said about Spark, a parallel cluster computing framework that can run in the context of Hadoop -- as well as by itself. Spark went supernova in 2014, such that by the end of the year nearly every prominent vendor in BI and DI had articulated a Spark story of some kind.
What's intriguing about Spark is that it can both run in-memory and persist data to the Hadoop Distributed File System (HDFS), the Cassandra File System (CFS), or to other distributed file systems and data stores. (Spark's ability to persist to disk is a distinct advantage over Cloudera's Impala engine, which has no provision for spilling over to disk in the event that it exhausts its physical memory resources.) For these and other reasons -- such as its support for interactive workloads -- Spark this year emerged as a compelling platform for analytics and for data integration (DI).
In just the last 18 months, basically every DI vendor -- from Actian (with its Pervasive technology) to IBM Corp., Informatica Corp., SAP AG, SAS Institute Inc., and Syncsort Inc. -- has announced support for Spark, with such announcements coming especially fast and furious in the second half of 2014. From a DM point of view, Spark is very much a work in progress; DM-wise, in fact, it arguably took a step backward, at least for a time, with its shift from "Shark" -- an Apache Software Foundation project that supports running Hive in conjunction with Apache Spark (as an alternative to Hive's dependence on MapReduce) -- for Spark SQL, a new, comparatively immature, SQL interpreter.
The consensus is that Spark SQL is a better (more efficient, elegant, and scalable) framework, although some grouse that, at this point, Spark SQL is inferior to Shark. Arsalan Tavakoli, director of customer engagement with Spark parent company Databricks Inc., vigorously disputes this. "Unequivocally, I would disagree with that Shark, when it was created was way back when. We had Hive, everybody's doing all of this work in Hive, can we kind of contort it [such that] instead of spitting out MapReduce jobs, [it can] spit out Spark jobs. Hive wasn't leveraging a ton of what Spark could offer," Tavakoli told BI This Week at this year's Strata + Hadoop conference.
"One of the other reasons we moved away from Shark is that Spark SQL can point to almost any data store -- whether it's in Cassandra, HBase, Parquet [a column storage layer for Hadoop] or whatever. If the structure's there, it can write SQL [to it]."
Neat, Neat Stuff
There's always been plenty of innovative, dominant-paradigm-contesting happenings in BI and data management. This year was no different, with entrants such as Interana, Looker, Metanautix, Snowflake Systems, and Trifacta, to name just a few, spicing things up. Even though not all of these companies are "new" -- Looker Data Sciences, and Trifacta have been around for a few years -- they are new to mainstream data management, and they're emphatically shaking things up.
For example, you might not know what to make of the term "data wrangling," which has a good bit of currency in circles outside of traditional BI and DM. If not, ask Trifacta, which effectively invented it. For the record, data wrangling is what people in data management call "data integration," and what the IEEE calls "data engineering." Regardless of what you call it, the existence of discourse-specific terms such as "data wrangling" and "data engineering" attests to the criticality of data preparation in all areas of IT. Data isn't just a problem for DM anymore -- and this isn't at all a bad thing. Trifacta, for example, nominally caters to business analysts and data analysts. However, it uses a very different toolset -- namely, Hadoop, in its capacity as an elastic platform for storage and data processing -- and an altogether different set of concepts and terms.
Right now, Trifacta automates the extraction, loading, and preparation (in Hadoop) of relevant data from dispersed data sources; the long-term goal, officials promise, is that Trifacta will be able to push down the data processing workload to those dispersed data sources. Over time, Hadoop (and Spark) will only become more important as a data preparation solution, Trifacta officials predict, thanks to its capacity to generate schema-on-read, as distinct to the rigid schema of the old-school DM model.
"One of the biggest changes that we see is a transition in thinking from the idea that schema should be governed top-down to [the idea that] schema is something that develops grassroots from the bottom up and should be reusable. I think that's the triggering change of thinking that happens as organizations try to figure out how to enable a business analyst, a data scientist, [and] a business engineer to do the work that they need to be able to do," Stephanie Langenfeld McReynolds, vice president of marketing with Trifacta, told "BI This Week."
Looker is another contender. It looks like a data federation engine, albeit without the federated query part. Alternatively, it's something like a federation engine with an HTML5 self-service visual analytic front-end UX and a host of developer-oriented amenities. Like federation, Looker presents a single, logical view of relevant data; like federation, Looker pushes the computation (query, data preparation) workload up to the source platforms. Unlike federation, it offers a more robust self-service experience such that information consumers can create their own business views. (View elements must first be designed and provisioned by an IT-someone, however.) Also unlike federation, Looker exposes its own SQL abstraction language -- LookML -- and specifies a bigger, more computationally intensive role for its engine, at least in its capacity as an analytic platform. (It shifts analysis that mixes or blends data from dispersed sources into the Looker engine.)
"[Looker is] completely Web-based, [it's] based on a Web server we wrote to interact really well with interactive query and SQL databases," says Keenan Rice, vice president of strategic alliances with Looker. "This whole Web architecture really opens up an interesting ecosystem around the data -- the kind of thing that just got lost in the self-service BI world. We wanted a data model layer where you just had raw data but you could do very complex things to transform the data. This is [possible by virtue of] the abstraction of SQL that we do with LookML."
The Past Lives On
On the other hand, some things don't -- or perhaps can't -- change. Take Good Data's new Open Analytic Platform, an ambitious attempt to build and scale a Platform-as-a-Service analytic platform.
Good Data's Open Analytic Platform addresses both traditional (strictly-structured) and advanced analytic use cases, but -- to the extent that it addresses strictly-structured analytics -- it does so by replicating something like data warehouse architecture in a PaaS context. (Take, for instance, this quote -- from a support document on the Good Data website: "Effective data modeling requires a distinct set of skills that may not be part of a general software engineering background. If you are unsure if you or your team has the appropriate skills, please contact GoodData Customer Support.)
Similarly, Snowflake Computing, a PaaS data-warehouse-as-a-service start-up, is explicit about what it does and why it does what it does. "One thing that's enabled by the cloud that we haven't seen anybody take advantage of so far is that you can deliver a full service that goes beyond eliminating the hardware and software install, which can automatically scale up and scale down [as needed]," says Jon Bock, vice president of product and marketing with Snowflake Computing. "The data modeling part is still something that a customer will need to do. It requires them to bring their intelligence about what they're doing. SQL is still the core language we're using, so they will need to build a data model, although we give them the flexibility with JSON and other [poly-structured] data of not having to define a data model. This is a necessary limitation of BI [architecture]: a BI tool doesn't understand that snowflake has this special way of accessing data without a schema."
Big Data Starts to Grow up?
Finally, 2014 was a year of redux and irony. Take Cindi Howson's Successful Business Intelligence, a landmark book that was first published back in 2007. In 2013, Howson, a consultant and BI practitioner, as well as a recognized BI tools expert, published an updated edition of her book, complete with a new subtitle: "Unlock the Value of BI and Big Data."
For years, Howson has been fighting the good fight, championing BI as a means to empower business decision-makers and to promote business transparency, accountability, and governance. IT needs to be more responsive to business people, Howson has long said, and BI tools need to do a better job of promoting this responsiveness. At the same time, an important function of IT is to protect the business against itself. For this reason, data lineage, master data management, and governance are critical business requirements that can't always be reconciled with the demands of business decision-makers for faster! better! now! results. The challenge for BI is to pragmatically address both prerogatives.
This, too, is the challenge for big data. In 2014, after several years of unabated hype, it seemed as if consumers, vendors, and hype-mongers were starting to come to grips with What Big Data Hath Wrought. As research analyst Marc Demarest framed the issue in a keynote address at TDWI's 2014 World Conference in Orlando: Where is all of the stuff we're putting into Hadoop coming from -- and what do we expect to do with it?
Demarest, a principal with Noumenal Inc., compared the unmanaged Hadoop “data lake” use case (in which Hadoop's HDFS is used as a vast reservoir for all enterprise information) to a polluted river. He invoked the infamous Cuyahoga River fire of 1952, which caused more than $1 million in property damage, to drive home this point. What caused the Cuyahoga River to spontaneously combust? Pollution. The challenge with using Hadoop as a long-term repository is that its data management feature set is shockingly primitive, which leads to a similar kind of pollution or contamination of data.
Instead of the Hadoop “data lake,” then, you wind up with the Hadoop “Superfund Clean-up Site,” Demarest told BI This Week in a follow-up interview. In Hadoop, it just isn't possible to answer a number of critical questions -- e.g., Where did this file come from? Who put it there? What is it used for? How accurate is the data stored in this file compared to this other data over here? How many other files of the same sort are in here? -- with the same degree of accuracy as with a data warehouse or (R)DBMS system.
There aren't any -- any -- good answers to this problem today, Demarest argued.
The hope, thanks to pushback from frustrated customers and thought-leadership from voices such as Demarest's and Howson's, is that such answers will soon emerge. Perhaps in 2015, then, we'll see the first such green shoots of a newer, more manageable, more governable take on big data.
Check back here next year and we'll see.