TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Modernize and Govern: Unifying Your Data Strategy July 10, 2025
  - Expert Panel: Best Practices for Modernizing Your Data Environment July 14, 2025
  - Powering Data Science with AI-Driven Tools and Practices July 15, 2025
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

What's Essential -- And What's Not -- In Big Data Analytics

When it comes to analytics, the columnar database debate misses the point.

By Stephen Swoyer
August 18, 2010

Big Data is hot. Advanced Analytics is hot. The combination (which we’ll call Big Analytics) is blazing. For that very reason, no one seems to agree on the right way to do Big Analytics.

Some vendors -- whom we’ll call the Columnar-Haves, including vendors marketing columnar analytic databases -- claim that columnar is the structure for Big Analytics. Others, whom we’ll call the Columnar-Have-Nots, claim that traditional row-based data stores (coupled with massively parallel processing, or MPP capability) give shops considerably more flexibility than their column-based counterparts.

What everyone does agree on, however, is that the traditional data warehouse -- powered, in most cases, by a commercial, off-the-shelf (COTS) database package -- is no longer relevant. "The analytical techniques and data management structures of the past no longer work in this new era of big data," writes Wayne Eckerson, director of education and research with TDWI, in a new Big Data Analytics Checklist Report.

The report provides an essential primer for doing Big Analytics. Although Eckerson and TDWI stop short of a technology prescription, they do explore the implications of the columnar- versus row-based debate -- as well as the essential concerns that drive any Big Analytics effort.

"[M]ost data warehouses … have reached maximum storage capacity without an expensive upgrade and can't support complex ad hoc queries without wreaking havoc on performance," Eckerson writes. "In addition, the underlying data warehousing platform … isn't scalable enough to support new sources of data [e.g., either internal or external] and maintain adequate query performance."

"To avoid these limitations, companies need to create a scalable architecture that supports big data analytics from the outset and utilizes existing skills and infrastructure where possible," he continues. "To do this, many companies are implementing new, specialized analytical platforms designed to accelerate query performance when running complex functions against large volumes of data. Compared to traditional query processing systems, they are easier to install and manage, offering a better total cost of ownership."

Doctrinal Disagreement?

Columnar-haves like to point to the high number of columnar entrants (vendors such as Aster Data Systems Inc., ParAccel Inc., and Vertica Corp., among others) and the columnar-come-lately strategies of vendors such as Netezza Inc. and Oracle Corp., as well as research from prominent market watchers such as International Data Corp., which earlier this year wrote enthusiastically about the benefits and future popularity of column-based data stores.

Not surprisingly, Columnar-Have-Nots tend to take issue with this vision.

They tout their massively parallel processing (MPP) underpinnings -- a topology which they share with most columnar players -- and say that a conventional row-based architecture, coupled with MPP brawn, is more flexible than a columnar MPP topology.

Take John Thompson, U.S. CEO with Kognitio, one of the longest-lived of analytic database players, who claimed that the Neo-Columnar Wave appeared specifically in response to analytic workloads that were overwhelming COTS DBMSs. "My view is that columnar is a really interesting and good technology for certain applications, but … I believe that those applications are receding and becoming more and more of a minority in the trend that we see going toward Big Data [and] Always-On Data," Thompson explained in an April interview.

Columnar skeptics such as Thompson raise questions about the flexibility of a column-oriented design, especially from a data management perspective. It's a familiar tactic: Columnar-Have-Nots (such as Kognitio and Dataupia Inc.) tend to concede advantages in some (very specific) cases, but inevitably raise questions about column-orientation's suitability in "broader" or "general" DW scenarios. Vendors such as Netezza and Greenplum -- which recently introduced hybrid row/column facilities for their DBMSs -- tend to be more pragmatic on both questions.

Columnar Not (Exactly) the Right Question

Eckerson and TDWI have a more pragmatic take on what might be called the "Columnar Imperative."

Far from arguing over the benefits (or drawbacks) of a column-based architecture, shops would be better advised to focus on other, potentially more important issues. Row- or column-based engines marketed by Aster Data, Dataupia, Greenplum Software Inc. (now an EMC Corp. property), Hewlett-Packard Co. (HP), InfoBright, Kognitio, Netezza, ParAccel, Sybase Inc. (now an SAP AG property), Teradata, Vertica, and other vendors (to say nothing of the specialty warehouse configurations marketed by IBM, Microsoft, and Oracle) are by definition architected for Big Analytics.

Yes, some will scale better than will others -- in specific configurations (or for specific applications) -- but scalability, at least in this context, is a particular and not a universal consideration.

A more important consideration, according to Eckerson, concerns the "available options" that are unique to the different analytic database engines. Analytic database players compete on precisely these options. Just as in the automotive world -- where features such as anti-lock breaks, automatic transmissions, or airbags have morphed from nice-to-have options to need-to-have requirements -- some features come standard with all analytic databases and some are still optional.

Analytic database vendors today compete on the basis of several options -- capabilities such as in-database analytics, support for non-traditional (typically non-SQL) query types, sophisticated workload management, and connectivity flexibility.

Every vendor has an option-laden sales pitch, of course -- but few (if any) stories are exactly the same. In-database analytics is particularly hot, according to Eckerson. All analytic database vendors say they support it (to a degree), but some -- such as Aster Data, Greenplum, and (more recently) Netezza, Teradata, and Vertica -- seem to support it "more" flexibly than others.

"[S]o-called 'in-database analytics' minimizes or eliminates data movement, improves query performance, and optimizes model accuracy by enabling analytics to run against all data at a detailed level instead of against samples or summaries," writes Eckerson, who notes that the in-database approach "is particularly useful in the 'explore' phase, when business analysts investigate data sets and prepare them for analytical processing, because now they can address all the data instead of a subset and leverage the processing power of a data center database to execute the transformations." (In-database analytics is likewise important in what Eckerson calls the "scoring" stage -- i.e., when an analyst applies a model or function to incoming records.)

"With in-database analytics, scoring can execute automatically as new records enter the database rather than in a clumsy two-step process that involves exporting new records to another server and importing and inserting the scores into the appropriate records," he explains.

The twist comes by virtue of (growing) support for non-SQL analytic queries, chiefly in the form of the (increasingly ubiquitous) MapReduce algorithm. Aster Data and Greenplum have supported in-database MapReduce for two years; more recently, both Netezza and Teradata, along with IBM, have announced MapReduce moves. Last month, open source software (OSS) data integration (DI) player Talend announced support for Hadoop (an OSS implementation of MapReduce) in its enterprise DI product. Talend's MapReduce implementation can theoretically support in-database crunching in conjunction with Hadoop-compliant databases.

Although support for non-SQL analytics is today a nice-to-have option, it could soon become a need-to-have standard feature, according to Eckerson.

"Many analytic computations are recursive in nature, which requires multiple passes through the database. Such computations are difficult to write in SQL and expensive to run in a database management system," he points out. "[T]oday most analysts first run SQL queries to create a data set, which they download to another platform, and then run a procedural program written in Java, C, or some other language against the data set. Next, they often load the results of their analysis back into the original database."

This approach makes about as much sense in the case of non-SQL query as it does (to recap) in the case of SQL-based stuff. The solution, once again, is optional in-database support for non-SQL analytics, Eckerson explains.

"[T]echniques like MapReduce make it possible for business analysts, rather than IT professionals, to custom-code database functions that run in a parallel environment," he writes. As implemented by Aster Data and Greenplum, for example, in-database MapReduce permits analysts or developers to write reusable functions in many languages (including the Big Five of Python, Java, C, C++, and Perl) and invoke them by means of SQL calls.

Such flexibility is a harbinger of things to come, according to Eckerson. "[A]s analytical tasks increase in complexity, developers will need to apply the appropriate tool for each task," he notes. "No longer will SQL be the only hammer in a developer's arsenal. With embedded functions, new analytical databases will accelerate the development and deployment of complex analytics against big data."

Integration and Interoperability Still Matter

Lastly, Eckerson urges, don't forget mission-critical amenities -- issues such as integration, interoperability, reliability, availability, and security. Although analytic databases were first pitched as "appliances" -- as plug-in or turnkey complements (or, in some contexts, as rip-and-replace alternatives) to an existing DM infrastructure -- in practice, such offerings are rarely, if ever, non-disruptive. This is one reason some vendors (Teradata and, more recently, HP) emphasize what they claim are best-in-class workload management features.

Their analytic databases better integrate with a shop's existing DM infrastructure, both vendors like to claim, and they boast both the scalability and the flexibility to support a wide range of users, applications, and queries. More recently, other players -- such as Aster Data, Kognitio, Netezza, and Vertica -- have hyped their own workload management efforts.

Moreover, most players like to tout the resiliency and built-in fault tolerance of the MPP architecture -- although (in a familiar move) some claim to be more fault tolerant (or more resilient) than others.

These and other issues are assuming greater salience, according to Eckerson.

"[I]nvestigate whether the analytic database integrates with existing tools in your environment, such as ETL, scheduling, and BI tools. If you plan to use it as an enterprise data warehouse replacement, find out how well it supports mixed workloads, including tactical queries, strategic queries, and inserts, updates, and deletes," Eckerson concludes. "Also, find out whether the system meets your data center standards for encryption, security, monitoring, backup/restore, and disaster recovery. Most important, you want to know whether or to what degree you will need to rewrite any existing applications to run on the new system."

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

What's Essential -- And What's Not -- In Big Data Analytics

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research