Brighthouse Data Warehouse Works Smarter to Accelerate Queries
Infobright says its Brighthouse data warehouse works "smarter, not harder" to accelerate query performance
- By Stephen Swoyer
- April 23, 2008
Data columnar databases are an overnight sensation more than 20 years in the making.
Where columnar databases were once the odd stepchildren of the data warehousing world, they've grown in popularity largely thanks to their fast analytic query performance.
Last year, no less than three columnar database releases arrived, from ParAccel Inc., Vertica Systems Inc., and Infobright Inc. These players join Sybase Inc. and its venerable Sybase IQ in offering a columnar (or mostly-columnar) take on data warehousing.
Where Sybase IQ was once a tough sell, columnar vendors are finding that -- thanks to the technology's extremely fast query performance and extremely small data footprint -- it's a technology that largely sells itself.
Each vendor touts a different spin on technology. ParAccel likes to tout that it was first to market (among the columnar upstarts) and says it has an impressive list of benchmarks and proofs-of-concept. Vertica trumpets the extensive R&D it put into its DW technology, which it says amounts to a 21st-century retrofitting of a 20-year-old technology. Infobright is proud of the native intelligence of its Brighthouse data warehouse, which it says works "smarter, not harder" to quickly process queries.
"We're very, very different from the other technologies in the marketplace," argues Miriam Tuerk, Infobright's president and CEO. "With most [competing solutions], the output of what you're getting is quite rigid, so if you want to look at data in the way that it's been structured, and run a query against a particular case, that works. But if you want something more flexible and changing over time, that's where you run into problems."
That's in part because mainstream data warehouses (DWs) are basically just gussied up relational database management (RDBMS) platforms, Tuerk argues. "They're using [their] all-purpose relational database systems, and in order to turn them into data warehouse systems, they [become] very resource-intensive -- they require a lot of data center operations time, a lot of server time, etc."
One way to address this problem has been to throw lots of hardware at it, Tuerk contends; still others use partitions, indexes, or materialized views.
These are smart approaches, Tuerk and Infobright concede, but they aren't smart enough: while they may accelerate the performance of traditional RDBMSes (used as data warehouses), they don't address the shortcomings of the RDBMS-as-a-DW paradigm.
That's where Infobright's version of "smarter" comes into the picture. "We're all about working smarter instead of harder. We actually use the intelligence of the data to help accelerate our performance," she comments. "There's a lot of information in the data that can give you roadmaps and visibility in terms of how to do things."
Infobright uses this information -- call it naturally-occurring metadata -- to accelerate queries, reduce data volumes (and, consequently, data loading times), and perform other tricks. Its Brighthouse DW can be implemented on either a standalone basis or alongside a traditional data warehouse. In the latter scenario, data can be offloaded from a data warehouse into Brighthouse, which can be used to accelerate particularly intensive analytic queries.
One upshot of this is better performance all around, Tuerk argues: vanilla DW users benefit, as the original data warehouse -- which isn't optimized for analytic queries -- performs workloads that it does excel at (e.g., transaction-based tasks), while intensive workloads run faster because they're offloaded to Brighthouse.
It works almost automatically, Tuerk claims: as data's fed into it, Brighthouse creates what Infobright calls a "Knowledge Grid" -- i.e., a kind of index or metadata layer -- across every column in the data warehouse.
"We just load the data in as it comes. We break the data into individual columns, then break [those] into data packs [of 65,000 elements]. Instead of just using a standard compression algorithm against the data, we look at it [the data] as we're loading it into the data packs and we figure out how to store it most efficiently," Tuerk explains. "We can optimize for individual data packs. Every single data pack in a column will have a different compression value. That's how we get an average of at least 10 to 1 compression in Brighthouse."
Brighthouse's Knowledge Grid isn't just a one-trick pony: it actually uses three different metadata layers. "It keeps information about every single data pack and the min/max [the average range of data] in every single data pack," Tuerk says. "The next layer is metadata about data that's inside the data pack, so we would have a histogram of the data inside the data pack. The third level is relationships about the data. When you're looking at data warehousing, relationships [among] data [are] very important."
How much of a bite does its metadata-laden Knowledge Grid take out of the 10:1 compression? A tiny sliver, says Tuerk, who claims that "the metadata layer is the size of one percent of the compressed data."
This one percent sliver delivers a big bonus, she argues. "Most of the time, the Knowledge Grid has sufficient information between its relationship level, its detail data level and its parameters around the data pack to actually give you [the answer to a query]," she comments. "The Knowledge Grid sits in memory, so if you take into account bad data packs -- [i.e., data packs which are] completely irrelevant to a query -- you can prioritize the order of operations [in memory], and, depending on how you do things, you can often times eliminate even the ugly ones.
"The [query] optimizer doesn't calculate the query path just once. It calculates the best query path, runs it against the Knowledge Grid, then it looks to see how it can reoptimize [for performance]. So the way that I resolve the query isn't to access the data by using more MIPS processing or more memory or more storage, [it's] to figure out how to do the query without using actually doing the calculation at all."
Brighthouse is based on a hot-rodded version of MySQL, Tuerk says, complete with substantial under-the-hood modifications. For this reason, it's able to take advantage of MySQL's interoperability with most business intelligence (BI) and performance management (PM) tools
"We wrapped MySQL around the software; everything is beneath the covers, so out of the gate we work with all of the business intelligence tools: Business Objects, SAS, Cognos, MicroStrategy." Brighthouse also helps optimize one of the most onerous aspects of building and deploying an enterprise data warehouse: designing a data model.
"When you design a data model, there are two parts: a logical data model and a physical data model. The second part [of a data model] is where I have to create indexes, partitions, materialized views, [and] it's that second component that is totally eliminated," she explains.