LESSON - Petabyte Data Warehouses
By Bill Nee, Senior Director, Oracle Corporation
The world’s awash in information. It’s been estimated that the amount of global data is increasing by about 30 percent per year, with 2006 expected to see a staggering 11 exabytes of total information produced.
Organized information held in individual databases is growing just as fast, and is closing in on the previously mythical one-petabyte (1,125,899,906,842,624 bytes) threshold.
The good news is that vendor databases have grown steadily in sophistication as they’ve had to respond to the challenges of supporting multi-terabyte databases. But even as technology evolves to support petabyte DWs, IT managers will have some special considerations as they meet this challenge.
Foremost among these is scalability. IT managers must give careful thought to the database being deployed, since even metadata files will be huge. Can the vendor provide proof that it can support very large databases? Do they offer tools that will make management easier? Are data mining and business intelligence tools available that will allow users to easily analyze massive amounts of information? Does the vendor offer parallelism features that will improve performance?
Deploying large databases on SMP (symmetric multiprocessing) or proprietary MPP (massively parallel processing) machines has been, and will continue to be, a popular way of supporting large databases. But the relatively recent option of deploying clusters of commodity servers offers another alternative. A cluster comprises multiple interconnected servers that appear to users and applications as one. The combined processing power of the multiple servers provides greater throughput and scalability than would be available from a single server.
Clusters are inherently high-availability systems, since they provide server redundancy by definition. These systems are also typically less expensive to own and operate, since they consist of an array of cheaper commodity servers. And they are effective: companies have deployed clusters of 64 and even 128 CPUs that have proved capable of handling any current commercial data warehousing workload.
Commercial petabyte databases are now on thehorizon, and they’ll certainly offer their ownunique challenges to customers. But the greatthing about our industry has been its ability tomeet technical challenges through innovation.
Even with clusters, the cost of amassing the storage and hardware necessary to support a petabyte database will remain high. Though the price of disk storage is decreasing, it’s still advisable to use different storage media based on usage. High-speed disk could be used for data that is required on a regular basis; slower, less expensive storage for data that is only occasionally needed; and tape for data that is rarely accessed. Customers should choose a database that can map to both disk and tape and view all media as part of a single database.
High storage costs also underscore the critical need for database partitioning, a powerful feature that allows customers to organize their database into smaller, independent “partitions.” Partitioning can keep costs down by allowing a customer to implement a protocol that continually pushes older, unused data onto cheaper storage. It can also ensure data availability and dramatically improve query performance, which are both extremely important when dealing with petabytesized data warehouses.
An additional way for customers to lower the cost of their deployment is to utilize open source technology. Unfortunately, open source databases have not yet evolved to the point where they are viable options for most companies. On the other hand, many organizations are using Linux, and a large infrastructure of hardware, software, and services vendors now support it.
The performance of these systems has been very good. As of this writing, a Linux database cluster holds the world record for the TPC-H (data warehousing) 300 gigabyte benchmark, and as one would expect, these systems hold a number of TPC-H price/performance records.
Commercial petabyte databases are now on the horizon, and they’ll certainly offer their own unique challenges to customers. But the great thing about our industry has been its ability to meet technical challenges through innovation. With that in mind, it’s assured that customers will be able to cross the petabyte boundary just as they did the terabyte mark.
This article originally appeared in the issue of .