Using SQL/BI Tools with Semi-Structured Data

Analytics tools and platforms must evolve to enable the promise of big data, including mining the most from semi-structured data.

By Poulomi Damany, Vice President of Products, BitYota

Without question, big data is the next frontier of the information revolution. It holds enormous potential to remake business, science, and culture, driving unprecedented advancements in innovation and productivity. As it expands, big data is also taking on a new dimension -- one that confounds conventional BI and data warehouse strategies.

In recent years, the advent of portable smart devices has fundamentally changed the nature of data. With billions of people using multiple devices and apps to access the Web, more kinds of real-time data are being collected instantaneously than ever before. Yet smartphones and tablets aren't the only points of origin. Millions of sensored and networked devices are creating an explosion of low-latency information, adding to the volume, the promise, and the headache of collecting and analyzing this data.

Many of these devices use unique storage and format protocols. No longer is data static, structured, and predictable; it is variable, changing over time. With every new app, product feature, or software iteration, the problem compounds itself, aAnd data professionals must deal with the consequences.

Indeed, the rise of semi-structured data is perhaps the most significant trend to hit BI and database analytics in the past twenty years. It carries with it new problems and new opportunities. For one, the data management stack has evolved into multiples, with individualized tools attempting to solve each stack challenge. JSON (JavaScript Object Notation) has become the preferred transmission solution for semi-structured data, enabling structures to evolve alongside apps and platforms without the need for complicated modeling or schema changes.

Collection and storage technologies have also had to adjust to accommodate this new kind of data. No longer are relational database management systems (RDBMSs) the de facto option for storing operational data. NoSQL technologies such as MongoDB are gaining in popularity precisely because they employ a schema-free, document-oriented design that allows a developer to store and retrieve large collections of JSON documents quickly.

Moreover, the data life cycle goes beyond use in just transactional systems. Semi-structured data offers new opportunities for interrogations of customer and transaction data. "What if" questions are taking on a whole new dimension because data can be cross-referenced from literally dozens of sources to provide insights never before possible.

Although MongoDB and other NoSQL databases have evolved quickly to handle big data's variety and volume, they are still immature in enabling broad access for complex analytics (i.e., the ability to explore the data in different ways, using multiple values, conditions, and ranges). Such analytics allows an analyst to interactively explore data to correlate user visits with purchases, identify the most popular products and the most important users, uncover seasonality trends, and run a host of other statistical functions such as aggregations, comparisons, ranking, and other complex criteria. Yet the change to storing operational data in semi-structured form means that the existing treasure trove of SQL-based BI tools, modeling techniques, and the accompanying domain expertise that used to work for rich exploration of structured data are now rendered useless.

Hadoop, one of the best-known open source data storage and processing frameworks, illustrates this problem. Supported by a global base of users, Hadoop has the power to collect and store vast amounts of semi-structured data. Its strength, however, is batch processing, not interactive analytics. Every "question" requires a MapReduce job to be created, submitted, and run -- this can take anywhere from minutes to hours to complete. That's a non-starter when you want to iterate quickly on ad hoc, investigative analytics. The MapReduce framework itself is in another procedural (JAVA) language, requiring programming resources for every question.

Hadoop vendors are trying to remedy this with solutions such as Hive, Impala, and Stinger, which bring SQL querying capability on top of Hadoop. However, they are solving the problem in the same way as traditional structured databases/warehouses -- by imposing an upfront relational schema on this data through ETL and custom code to be able to access it using a SQL-like language.

This is a circular argument: in order to explore and understand your data, you first need to know it well enough to define its attributes and schema upfront. This won't work if you have a rapidly iterating application where your JSON documents are changing frequently. Imagine a Web app conducting a month-long A/B test of a new feature -- for that month, a subset of the users' interactions with this new feature will come as extra attributes in the JSON document. This kind of data has a very short shelf life and a high rate of change, so it is hard to justify a time-consuming upfront development and modeling effort for that one month.

Like Hadoop, MongoDB provides many built-in analytics capabilities, but it burdens an operational database that must meet SLAs with the overhead of data access for longer running analytical queries. These goals will conflict as the data and usage grow. Moreover, Mongo's query language is proprietary, requiring programming skills to query the data. Thus, for every ad hoc question a business user asks, a developer will have to write custom code. Finally, Mongo's query framework requires an understanding of how the data is physically laid out to avoid running into syntax, memory, and performance limitations on large data sets. In MongoDB, writing an optimally performing query can turn into a black art.

A true analytics solution for semi-structured data, where ad hoc queries require the scanning and joining of data across billions of records, requires a more sophisticated approach. Users should be able to cost-effectively store and manipulate the JSON data in its native form without spending effort upfront modeling, transforming, normalizing it, preserving agility and flexibility. SQL as the query language is a must, to enable broad use and access and bring insights to the decision-makers quickly. Finally, the system needs to allow truly interactive data exploration and performance over large data volumes.

A new class of data warehouse service (DWS) is providing just such a set of capabilities. These services, built from the ground up for fast, low-latency analytics on semi-structured data, do not impact operational databases. Data is stored in the DWS in its original format with support for JSON, XML, and key values as native data types. This preserves the richness of the data and circumvents the need for complex ETL or any up-front modeling before analysis.

In these next-generation DWS offerings, the analytics engine speaks ANSI SQL, and SQL-2011 OLAP operators work directly over JSON. SQL extensions also allow a JSON attribute to be treated like a virtual column, even if attributes are nested inside arrays. An analyst can now work with this data directly using favorite BI tools and script-based user defined functions. This makes the DWS MPP-style, enabling users to pay as they grow and make affordable price/performance tradeoffs.

The efficient and timely use of big data is becoming vital for today's companies to compete and grow. Analytics tools and platforms must evolve to enable this promise -- but it will be the job of forward-thinking enterprises to evaluate and adopt the tools that best leverage their organizations' existing domain knowledge investments while enabling the new use cases of big data analytics.

Poulomi Damany is the vice president of products at BitYota and has over 15 years of product management experience in enterprise data. Prior to BitYota, Damany led global product and engineering teams delivering Yahoo's custom ad targeting, listings recommendations, and branded experiences, and has run product management teams for the high availability and disaster recovery product lines at Symantec/Veritas. You can contact the author at

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.