Getting Started with Data Warehouse Appliances
Need more BI horsepower? A data warehouse appliance may be just what you need.
By Lou Agosta, Ph.D.
If your BI reports and queries are running slowly or data volumes are growing rapidly, you may be looking to BI appliances for relief. Just what is a data warehouse appliance, and what benefits do they promise (and actually deliver)? We'll answer that question and examine what distinguishes appliance products and best practices for selecting the right appliance.
An appliance is a complete technology configuration in a box that addresses the requirements of a particular end user of computing resources in a prepackaged format that encapsulates the entire technology stack in a usable format. Successful consumer appliances have included the refrigerator, the boom box entertainment system, and "gadgets" such as the iPhone and Blackberry.
In the BI world, the data warehousing appliance extends this metaphor to the enterprise data center with the vision of a high-performance database system that satisfies business intelligence (decision support) requirements and includes the server hardware, network interconnect, database software, and selected load, workload, scheduling, and administration tools needed for quick installation, loading, and ongoing monitoring.
Enterprise data warehousing appliances are popular because they get the job done in many data scenarios. However, in spite of their significant success, data warehousing appliances are not a one-size-fits all proposition, nor, as any vendor will tell you, are they appropriate for every workload profile or data warehousing challenge. A diversity of appliance vendors have emerged, including appliance offerings from the large, established information technology (IT) stalwarts such as HP, IBM, Oracle, and Microsoft. Teradata objects to be called an "appliance," though it also objects to not being named as an IT stalwart that is relevant to the appliance market.
Best-of-breed innovators continue to contribute to market dynamics. Key differentiators -- about which, as a prospective buyer of a data warehousing appliance, you should examine --include the number of successful installed customers in production willing to speak about their experiences (both positive and negative); the details of the technology itself (whether the database is open source and how it is customized, whether the server, disk, and networks are a commodity components and how they can be customized; the breadth and maturity of complementary tools such as inquiry and reporting, ETL, data quality solution); and the price of acquisition and cost of operation. Published results from public benchmarks (such as tpc.org) are also useful for starting a conversation about performance and price, though don't rely exclusively on the benchmark "winner" since results are frequently updated.
As with all information technology, imagine a spectrum of options from completely proprietary to totally open. Often when an innovative technology is released, it enjoys a performance or price advantage by using special-purpose (proprietary) hooks and implementations to provide improvement. The downside to proprietary solutions includes limited options for ongoing operations, support, and trouble-shooting in the event (even if unlikely) that something breaks. Technology lock-in with a proprietary approach that is even more "locked in" than the average implementation using de facto standards can also be an issue.
For example, Netezza is the vendor credited with the initial success of data warehousing appliances, and its approach has taken advantage of relatively proprietary performance boosters such as field programmable gate arrays (FPGA), pushing I/O downward closer to the data streaming off the disk, and an approach (on the part of Netezza) that until recently has not aimed to exploit improved performance through standard database indexes, scheduling of mixed workloads, and commodity hardware and disk.
Netezza has frequently surpassed the performance of legacy Oracle database installations running on legacy HP or Sun servers. Never one to shy away from a good competitive battle, Oracle has responded with several appliance initiatives culminating in its own branded appliance, running HP hardware with the flagship Oracle (RAC) database behind the HP Oracle Exadata Storage Server. Oracle's is a relatively new offering, and many analysts and users would like to learn more about the seamlessness of the service level agreement (SLA) and hardware support from a software company.
What might be called second-wave appliance vendors have answered with the message of "more openness." DATAllegro is using the Open Ingress database, an open source relational database with de facto open components such as Dell servers, EMC disk, and a Cisco network backplane. DATAllegro is now a part of Microsoft, so if the company introduces a version for MS SQL Server, it will likely be a totally rearchitected SQL Server that has morphed into a parallel product.
Greenplum (which markets a software only "appliance") offers a massively parallel distribution of the initially open source Postgres database and partners closely with Sun Microsystems on hardware. Meanwhile, additional candidates in the data warehousing appliance space include NeoView from HP, an updated version of the nonstop Tandem system, the Balanced Warehouse from IBM, Teradata, and (as already noted) Oracle's Exadata. Teradata denies being an appliance, though industry analysts (myself included, and TDWI's Philip Russom) have designated it as the "mother of all appliances."
What continues to be valid about the data warehousing appliance trend in all its many themes and variations is that buyers want simpler solutions at a superior price/performance -- simpler to buy, install, load operate, maintain, and upgrade.
Benefits of an Appliance
It is useful to distinguish between query-intensive and transaction-intensive workloads. Data warehousing appliances address the former. One of the key activities of the data warehouse is to determine what customers are buying and using what product or service and when and where they are doing so. The data warehousing appliance addresses fundamentally the same issues as the data warehouse itself with one exception: The track record of appliances is shorter and appliances have often operated as functional data marts, addressing specific subject area issues such as product, customer, marketing, and finance.
Data warehousing appliances have been used as a data repository for use by resource-intensive queries with relatively little regard for optimizing resource use. They've been a "dumping ground" for data and queries. In many ways -- and like any appliance -- the data warehouse appliance has been something of a black box. This has worked well in the short term. However, in spite of the performance boost and standard interfaces (such as ODBC, among others) for loading and distributing data, bottlenecks have developed when loading data, representing it the way the business wants to see it, and getting the data out. Data warehouse appliances aren't immune from data quality issues, either, which should serve as a reminder that no solution is any better than the overall maturity of the technology and business environment into which it is inserted.
The critical path to the integration of a data warehousing appliance into any given computing infrastructure is how the business represents the data as well as what options are available for partitioning the information in a divide-and-conquer way to enable optimization. If a subset of work can be readily identified from a central enterprise data warehouse, then an appliance is a good solution for "offloading work." If a standalone application is required -- for example, marketing -- then an appliance is a good candidate for a specific dedicated appliance application. However, if a server environment must perform extract, transform, and load (ETL) tasks, integrate data, or check data quality, or must perform heavy data updates and transformations, an appliance is not a good choice. Instead, invest in a standard server and configure it to your requirements.
Selecting an Appliance
How do I know if my company is a good candidate for selecting a data warehousing appliance? Data warehousing appliances are moving from the "early adopter" stage to being "mainstream." This limits the downside risk from a technology perspective; but issues of economics, relationships, and implementation still loom large.
Having a business executive who reads about the DW appliance trend and says "Get me one of those!" can be a mixed blessing. It does solve the problem of executive sponsorship -- not a trivial matter -- but it does not make for an informed choice. The selection process should have various "commit points" that lead up to a rational decision such that the purchase supports the business goals of the organization. It should be possible to "hold" or "abort" the decision, but only for the right reasons -- the organizational readiness assessment shows something is missing (and what must be done to remediate the matter), reasonable requirements are not satisfied, or other "due diligence" step surfaces a red flag.
The short version of the selection process comes down to technology, relationships, and economics.
Technology: Does the vendor have the technology you require now? Don't just meet with marketers -- talk to whomever handles production support. Is there a single point of contact and what is your assessment of this individual's "clout" within the organization? If he or she has been in this position for just six months, for example, be skeptical. Although it is important that the vendor have a robust road map (e.g., open source), focusing on current product capabilities will significantly shorten the marketing presentation. If it seems like everything you really need will be in the next release, then it is best to talk to that vendor later when that release is in production. In the meantime, continue your search for a product that meets your requirements today.
Relationships: At the same time, understand data warehousing appliances are relatively new, either stand alone or within a larger organization. Can the vendor help you build the business case with client references that have realized enough economic benefits and cost savings to render the project self-funding? What is the delivery schedule?
Economics: Put everything in writing. Unless it is in writing, who will remember it two weeks from now? Unless you are operating a software development lab, be sure to specify that you require the appliance to have the features and functions relating to price, performance, interoperability, and tools as specified in your requirements list today. These should be part of the current production release.
Candidate Criteria
The longer version of the selection process includes a "shoot out" among candidates on a short list. In performing a "shoot out" between multiple vendors, including data warehousing appliances, proceed with spreadsheet and scoring criteria in hand. Key criteria include:
- The ability of the vendor to deliver the product, up and running, on time on budget
- The cost model for system operation -- how many support personnel (or fraction thereof) and hours per period
- Feedback from conversations (possibly including an on-site visit) with existing customers whose workload profile is similar to that of your own is some meaningful way(s)
- The future road map of the vendor product to exploit the cost efficiencies of open source and other technology innovations (what happens four years into the future when the system is fully used and maxed out?) and the upgrade path if any
- Performance of the appliance in a real-world setting with data and workloads that resemble your own enterprise
- How well the appliance will be integrated with the existing IT configuration
The last two criteria are particularly important. If the vendor takes your large, resource-intensive queries and runs them on a brand new configuration with no other work running, then performance will be significantly better than in a mixed workload environment. What happens to performance with existing data and mix of queries and updates? A simulation (test) using your existing data, data model, and workload is one way to find out.
Your due diligence requires the involvement of a database administrator (DBA) and responsible management oversight of any "shoot out." That means time and effort – in other words, work -- to avoid solving the wrong problem. Work for staff and managers who may already have no extra bandwidth to spare. One possibility is to hire an outside consultant from an independent firm to represent your company's interests and to oversee the acquisition of the new, complementary system. In either case, teamwork, cooperation, business acumen, and technology smarts are on the critical path to success.
Integrating an Appliance in Your Environment
One of the reasons the DW appliance was invented was to reduce complexity. That includes the complexity of implementation. Nevertheless, a readiness assessment should be undertaken to assure that the required support is in place. This extends from the physical environment with dual backup power sources and cooling to the organizational dynamics around accessing the DW appliance.
In the DW appliance, a few simple choices -- such as small, medium, and large -- has replaced large numbers of tuning variables, sort heap sizes, SQL analyses, and buffer pool management. Massive parallel processing has replaced performance exercises in designing and building database indexes. A simple hash algorithm for distributing data uniformly has replaced attempts to place data sets on disk by figuring out a predetermined, anticipated pattern of business use. This simplification puts a premium on making the right choice upfront.
If the installation wants to move to the DW appliance with a proprietary version of structured query language (SQL) such as Oracle's PL/SQL, then a conversion of the code will be required. Perform an inventory of the source code at least into easy, medium, and hard (complex). Get training in accessing, using, and coding the language and syntax of the target environment. It may be standard SQL, but it seems like there are always some little variations that can become a surprise "gotcha." Smart management always appreciates knowing the issues or problems in advance rather than being surprised.
If the front-end to the DW appliance is an ad hoc query tool such Business Objects (SAP) or Cognos (IBM), etc., then make sure the users are properly trained in using the tool. Just because massive parallel processing is now available does not mean one can be causal and generate SQL that omits the WHERE clause in one's SQL statement. With front-end tools, sometimes one cannot tell the difference without looking "under the hood," though that is not so much a DW issue as a standard best practice in posing queries and questions.
Once you've purchased your appliance, the challenge becomes how to integrate it into an existing BI environment. The potential "gotcha" is that the value of the appliance will be negated if it not functionally integrated into the environment and becomes a fifth wheel. Loading data in an efficient operation, extracting the data and distributing it (if appropriate), and synchronizing (coordinating) the resulting reports with those in the existing data warehousing environment can be tricky. The DW appliance can function as designed perfectly. Yet if it is not properly integrated, it can become just another information island, a silo of data. Even worse, even though processing on the appliance island can be blazingly fast, if the access and exit become bottlenecks, the advantage is at risk.
In addition, one of the key cost drivers of any system is the number of interfaces used. The more interfaces, the greater the maintenance and update costs. Multiplying the number of spokes in a hub and spoke architecture will increase costs proportionately. Therefore, look for an opportunity to consolidate multiple data marts ("spokes") into a super data mart (the appliance) and realize the corresponding cost savings (from reduced maintenance of multiple system interfaces).
If the installation already has a service level agreement (SLA) in place for computing resources, the DW appliance should be integrated into the mix. Often with an SLA, the end user of information services does not necessarily know what provides the underlying computing resources. Depending on the size of the installation, the DW appliance has the potential to improve the metrics across the board if the response time, amount of work, and throughput are aggregated and calculated for the DW function as a whole. If the installation does not have an SLA, then the arrival of the DW appliance is a good opportunity to perform some planning and put one in place.
If part of the data warehousing work can be consolidated in a standalone data mart, whose interface(s) with the enterprise data warehouse is well defined, self-contained, and (ideally) unidirectional (such as marketing or finance), then it is a good candidate for a new appliance implementation. If carefully managed and addressed, the data warehousing appliance will become another bullet, even if not always a silver one, in the arsenal addressing key business questions. Of course, all the usual disclaimers apply and the appliance must be able to handle the mixed workload, something that has been a work in progress for many appliances until now. Answers to questions about mixed workloads and scheduling tools are on the critical path in the product evaluation and a successful implementation.
Lou Agosta, Ph.D., is an independent industry analyst specializing in data warehousing, data mining, data quality, and related data management issues. He is the author of The Essential Guide to Data Warehousing (Prentice Hall PTR, 2000). You can contact the author at [email protected].