TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 21, 2025
  - Platforms & Architecture Week July 21, 2025
  - AI Bootcamp Week July 21, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Machine-Generated Big Data Poses Special Challenges

Machine-generated data poses big challenges that can make it difficult for any relational competitor to be dominant.

By Stephen Swoyer
October 23, 2012

Analytic database specialist Infobright is trying to corner the market on machine-generated data. It isn't alone.

Machine-generated data is expected to comprise one of the biggest categories of big data, so challengers abound -- including many of Infobright's established analytic database rivals.

According to some challengers, however, the problems posed by machine-generated data make it difficult for any relational competitor to become dominant.

What's needed, these challengers claim, is an entirely new class of platform.

Infobright Changes Tacks

Infobright says it's serious about tackling machine-generated big data. It plans to release its new Infopliance later this year. The analytic appliance solution scales from between 12 to 144 TB. With load speeds of up to 10 TB per hour (per node) and an architecture that officials claim is highly suited for storing and analyzing machine-generated data, Infobright says it likes its chances in this fast-growing market.

At the same time, concedes president and CEO Don DeLoach, it still has to get the word out.

After all, he acknowledges, prospective customers tend to default to well-known brands.

"We've concluded that there is a market for an appliance for machine-generated data," DeLoach says. "We've seen customers who select something like a Netezza or Teradata, or even an [Oracle] Exadata to store their machine-generated data."

By "machine-generated" data, Infobright means a potentially enormous category: Web logs, network events, call data records, RFID information, and telemetry data generated by IP-enabled end-points or devices. It the very stuff of big data.

"[The Infopliance is] a general-purpose, real-world technology for doing something that's very narrow-scope," says DeLoach. "The market had evolved to the point where there is a very definite need for something like this that was fundamentally not being fulfilled [or which was being] filled at much higher cost to [prospective customers]."

DeLoach says Infobright's architecture -- which, although columnar, is distinctly different from columnar competitors such as ParAccel or the former Vertica -- is well-suited for storing and analyzing machine-generated data. "The mathematics behind our offering has some unique advantages for machine-generated data that actually become a limitation when you get into a general-purpose data warehouse.

"The [intellectual property] associated with how we load the data, how we establish the metadata layer; the lack of any kind of administrative overehead, the aggressive hardware compression -- these are all advantages [that Infobright has relative to its] competitors."

DeLoach points to a trio of immediate drivers -- call data records, Web logs, and network events -- the volumes of which he claims are expanding rapidly.

"There's going to be a proliferation of sensor data. As the market matures from a machine-to-machine standpoint, and as we see things like the smart grid market evolve, I think that you will expect to see more and more of Infobright used in these solutions."

Mark Madsen, a principal with consultancy Third Nature Inc., says Infobright's architecture does confer a few advantages when it comes to processing, storing, or analyzing machine-generated information.

"[Infobright does] some very interesting stuff for data placement, which optimizes for storage and I/O [performance]," he comments. "Infobright is designed for write-once data in [a] simple schema, like big flat log records in big flat tables," he continues, adding that -- in many cases -- data of this kind "essentially" comprises an event stream.

"Those kinds of sensor streams you can do fine with getting, organizing, and querying them, but you can't [easily or efficiently] do much math on them," Madsen observes.

A Post-Relational World?

This is true of any relational database system, says Madsen: unless it embeds analytic routines inside the database engine itself -- much like IBM Corp., ParAccel Inc., Oracle Corp., SAP AG, and Teradata Inc. are doing with their data warehouse platforms -- it's at a computational disadvantage relative to other kinds of data stores, such as vector- or matrix-based engines.

That's the rub. The data or event streams generated by sensors, embedded devices, machines, and other types of intelligent "things" tend to be multidimensional. This data has both spatial and time-series operators. For this reason, some experts argue, it lends itself to a more demanding kind of analytics. Call it Computational Analytics.

That's one reason Infobright and its relational database rivals aren't the only challengers in this segment.

There's another category of database engine: that of what might be called the "baggage-free" computational analytic platform. These databases reject traditional relational architectures. Entrants in this class include StreamBase Systems, VoltDB, and SciDB -- all three of which were developed (or co-developed) by industry luminary Michael Stonebraker -- as well as Paradigm4, a big analytics platform based on SciDB.

Stonebraker likes to describe SciDB as a data management and analytics software system (DMAS), an acronym that he uses to distinguish it from the familiar DBMSes that have long anchored BI and DW efforts. SciDB was designed to power the Large Synoptic Survey Telescope (LSST), an ambitious effort to map the Milky Way, among other goals. When it goes live in 2021, the LSST is expected to generate up to 30 TB of data every night.

As a DMAS, SciDB isn't architected like a conventional (relational) repository. It structures information in terms of arrays and vectors, which proponents say makes it ideal for expressing both spatial and time-series operators.

Paradigm4 extends SciDB with optimizations for or enhancements to the R statistical/programming language, MATLAB, and IDL, along with improved support for procedural languages such as C++ and Python. In addition, Paradigm4 offers management tools and other proprietary add-ons, along with maintenance and support.

"The basic idea here is that relational databases have been around for years [and] have a table data model that was designed for business facts, [which means that] you can't take this [machine-generated] data that's inherently ordered and shoehorn it into a relational database without sacrificing performance," says Marilyn Matz, CEO of Paradigm4. "So [Paradigm4's] basic data model is a multi-dimensional array ... [and] the value of that array preserves the inherent ordering [of data]; if you have spatial data, the [data] to the left of you is [logically understood to be] in the position to the left of you, [both] in storage and in the real world."

Paradigm4, Matz says, benefits from SciDB's architecture, which was designed to accelerate certain kinds of complex operations. "A lot of these math operations ... are actually matrix operations ... and a lot of these problems have high dimensionality, so when you're doing discovery analytics or ad hoc querying, you want to be able to slice, dice, drill down ... without having to set up any indices or doing any tuning: you just want access."

It's in this respect, she claims, that a relational architecture is most limiting. "In a relational database, it doesn't matter if you choose 'row-major [order] or column-major [order], you're still not having this [i.e., Paradigm4's underlying] dimensional model," Matz contends. "A relational database has to store indices; we don't. It's like [it is in] programming: you declare the multidimensional array and you compute where the data is."

Paradigm4 is not a SQL database. It substitutes two languages -- AFL and AQL -- for the RDBMS's traditional dependence on SQL. AFL, Matz says, "looks just like APL," an array-oriented programming language. (APL is an acronym for "A Programming Language;" AFL, on the other hand, stands for Array Functional Language.)

In lieu of SQL, Paradigm4 prescribes AQL, or Array Query Language, which Matz says "people coming from the SQL analytics world are more comfortable with."

Another way in which Paradigm4 (and SciDB) differ from relational databases is in how they handle missing data values, or NULLs. According to industry veteran Mark Madsen, a principal with consultancy Third Nature Inc., data management (DM) practitioners will use a number of methods – e.g., they'll compute a moving average – to handle NULLs.

For many applications, however, this approach has problems, Madsen says. "I can substitute context easily using just a moving average and [I won't have any] problems in merchandising, but you can't do that in a risk calculation; you have to have a mathematically valid way [of computing a missing value], so you would have to sort of custom-write an algorithm," he comments. "A lot of people can't even use a data warehouse today for machine learning stuff, or basic statistics. In fact, when people are doing hardcore stuff, many times they end up bypassing [the data warehouse] and going to the raw data anyway."

SciDB and Paradigm4 offer users more flexibility in this regard, Matz maintains.

"SQL semantics has one notion of NULL, and that doesn't cut it, so what we have is ... an unlimited number of codes [that you can use] instead of NULL so that what people really want to do is to do context-specific substitution," she explains.

For the risk calculation that Madsen invokes, for example, a user could assign a custom-written algorithm to a specific code, which -- depending on the context -- would be substituted when or where appropriate. "I might be in one kind of use case or query ... where ... if there's a missing value I might want to fill in the spatial average," Matz concludes. "We're able to support multiple flavors of NULLs" by means of substitution codes.

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Machine-Generated Big Data Poses Special Challenges

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research