TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 21, 2025
  - Platforms & Architecture Week July 21, 2025
  - AI Bootcamp Week July 21, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Using SQL/BI Tools with Semi-Structured Data

Analytics tools and platforms must evolve to enable the promise of big data, including mining the most from semi-structured data.

March 25, 2014

By Poulomi Damany, Vice President of Products, BitYota

Without question, big data is the next frontier of the information revolution. It holds enormous potential to remake business, science, and culture, driving unprecedented advancements in innovation and productivity. As it expands, big data is also taking on a new dimension -- one that confounds conventional BI and data warehouse strategies.

In recent years, the advent of portable smart devices has fundamentally changed the nature of data. With billions of people using multiple devices and apps to access the Web, more kinds of real-time data are being collected instantaneously than ever before. Yet smartphones and tablets aren't the only points of origin. Millions of sensored and networked devices are creating an explosion of low-latency information, adding to the volume, the promise, and the headache of collecting and analyzing this data.

Many of these devices use unique storage and format protocols. No longer is data static, structured, and predictable; it is variable, changing over time. With every new app, product feature, or software iteration, the problem compounds itself, aAnd data professionals must deal with the consequences.

Indeed, the rise of semi-structured data is perhaps the most significant trend to hit BI and database analytics in the past twenty years. It carries with it new problems and new opportunities. For one, the data management stack has evolved into multiples, with individualized tools attempting to solve each stack challenge. JSON (JavaScript Object Notation) has become the preferred transmission solution for semi-structured data, enabling structures to evolve alongside apps and platforms without the need for complicated modeling or schema changes.

Collection and storage technologies have also had to adjust to accommodate this new kind of data. No longer are relational database management systems (RDBMSs) the de facto option for storing operational data. NoSQL technologies such as MongoDB are gaining in popularity precisely because they employ a schema-free, document-oriented design that allows a developer to store and retrieve large collections of JSON documents quickly.

Moreover, the data life cycle goes beyond use in just transactional systems. Semi-structured data offers new opportunities for interrogations of customer and transaction data. "What if" questions are taking on a whole new dimension because data can be cross-referenced from literally dozens of sources to provide insights never before possible.

Although MongoDB and other NoSQL databases have evolved quickly to handle big data's variety and volume, they are still immature in enabling broad access for complex analytics (i.e., the ability to explore the data in different ways, using multiple values, conditions, and ranges). Such analytics allows an analyst to interactively explore data to correlate user visits with purchases, identify the most popular products and the most important users, uncover seasonality trends, and run a host of other statistical functions such as aggregations, comparisons, ranking, and other complex criteria. Yet the change to storing operational data in semi-structured form means that the existing treasure trove of SQL-based BI tools, modeling techniques, and the accompanying domain expertise that used to work for rich exploration of structured data are now rendered useless.

Hadoop, one of the best-known open source data storage and processing frameworks, illustrates this problem. Supported by a global base of users, Hadoop has the power to collect and store vast amounts of semi-structured data. Its strength, however, is batch processing, not interactive analytics. Every "question" requires a MapReduce job to be created, submitted, and run -- this can take anywhere from minutes to hours to complete. That's a non-starter when you want to iterate quickly on ad hoc, investigative analytics. The MapReduce framework itself is in another procedural (JAVA) language, requiring programming resources for every question.

Hadoop vendors are trying to remedy this with solutions such as Hive, Impala, and Stinger, which bring SQL querying capability on top of Hadoop. However, they are solving the problem in the same way as traditional structured databases/warehouses -- by imposing an upfront relational schema on this data through ETL and custom code to be able to access it using a SQL-like language.

This is a circular argument: in order to explore and understand your data, you first need to know it well enough to define its attributes and schema upfront. This won't work if you have a rapidly iterating application where your JSON documents are changing frequently. Imagine a Web app conducting a month-long A/B test of a new feature -- for that month, a subset of the users' interactions with this new feature will come as extra attributes in the JSON document. This kind of data has a very short shelf life and a high rate of change, so it is hard to justify a time-consuming upfront development and modeling effort for that one month.

Like Hadoop, MongoDB provides many built-in analytics capabilities, but it burdens an operational database that must meet SLAs with the overhead of data access for longer running analytical queries. These goals will conflict as the data and usage grow. Moreover, Mongo's query language is proprietary, requiring programming skills to query the data. Thus, for every ad hoc question a business user asks, a developer will have to write custom code. Finally, Mongo's query framework requires an understanding of how the data is physically laid out to avoid running into syntax, memory, and performance limitations on large data sets. In MongoDB, writing an optimally performing query can turn into a black art.

A true analytics solution for semi-structured data, where ad hoc queries require the scanning and joining of data across billions of records, requires a more sophisticated approach. Users should be able to cost-effectively store and manipulate the JSON data in its native form without spending effort upfront modeling, transforming, normalizing it, preserving agility and flexibility. SQL as the query language is a must, to enable broad use and access and bring insights to the decision-makers quickly. Finally, the system needs to allow truly interactive data exploration and performance over large data volumes.

A new class of data warehouse service (DWS) is providing just such a set of capabilities. These services, built from the ground up for fast, low-latency analytics on semi-structured data, do not impact operational databases. Data is stored in the DWS in its original format with support for JSON, XML, and key values as native data types. This preserves the richness of the data and circumvents the need for complex ETL or any up-front modeling before analysis.

In these next-generation DWS offerings, the analytics engine speaks ANSI SQL, and SQL-2011 OLAP operators work directly over JSON. SQL extensions also allow a JSON attribute to be treated like a virtual column, even if attributes are nested inside arrays. An analyst can now work with this data directly using favorite BI tools and script-based user defined functions. This makes the DWS MPP-style, enabling users to pay as they grow and make affordable price/performance tradeoffs.

The efficient and timely use of big data is becoming vital for today's companies to compete and grow. Analytics tools and platforms must evolve to enable this promise -- but it will be the job of forward-thinking enterprises to evaluate and adopt the tools that best leverage their organizations' existing domain knowledge investments while enabling the new use cases of big data analytics.

Poulomi Damany is the vice president of products at BitYota and has over 15 years of product management experience in enterprise data. Prior to BitYota, Damany led global product and engineering teams delivering Yahoo's custom ad targeting, listings recommendations, and branded experiences, and has run product management teams for the high availability and disaster recovery product lines at Symantec/Veritas. You can contact the author at [email protected].

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Using SQL/BI Tools with Semi-Structured Data

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research