Database Technology and Open Source Keys to Greener Data Warehouses
Three key areas where new solutions for data warehousing are making a big difference.
by Miriam Tuerk and Rick Abbott
As a growing number of businesses seek ways to make their operations greener, IT infrastructure -- particularly infrastructure designed to store and analyze data -- is coming under greater scrutiny.
The data warehouse is a textbook example. What originated from a need to concentrate critical corporate data into one easy-to-find and easy-to-analyze location is today fueling IT's version of suburban sprawl -- with more users seeking access to ever-increasing volumes of data, more servers, and more power are required, along with more storage space and more IT resources. In fact, a recent article on data center overload in The New York Times Magazine stated that "the number of servers in the United States nearly quintupled from 1997 to 2007" (see Note 1).
However, "more" comes with a cost to both businesses and the environment.
61 Billion Kilowatt-hours of Data
Data warehouses, and the data centers that house them, require an enormous amount of power, both to run legions of servers and to cool them. According to U.S. News & World Report, "nationally, data center electricity use has more than doubled between 2000 and 2006, and is expected to double again by 2011" (see Note 2). In 2007, the Environmental Protection Agency revealed that data centers account for 61 billion kilowatt-hours of electricity annually and cost $4.5 billion a year, which, also according to the report, "is about 1.5 percent of the country's total electricity consumption, or enough to power 5.8 million households." That's nearly 12 million households by 2011 if the growth predictions hold.
For today's businesses, finding greener ways to store and analyze data will benefit the environment and the bottom line. To reap these benefits, organizations must look beyond traditional data warehousing approaches and embrace a new generation of technologies to create leaner, more efficient, and more cost-effective solutions.
Traditional Data Warehousing: Costly, Inefficient, and Massive
For several years, traditional data warehousing approaches have changed little. The market is dominated by proprietary, appliance, and hardware-centric solutions that need to be custom configured to address specific data management tasks, such as the analysis of financial transactions or call center data. Because of the significant IT resources involved in setting up and maintaining these data warehouses, the sales cycles are long, the costs are high, and the resulting systems are inflexible.
For example: it's not uncommon for a large organization (such as a telecommunications provider) to have 20-30 database administrators on staff just to handle requests from business users who want to run analyses different from the canned reports they are permitted to see. Because changes to traditional data warehouse configurations require a fair amount of effort, obtaining real-time analysis is challenging even when information requirements are relatively static. In highly dynamic business environments, it's nearly impossible.
Then there's the hardware issue. Due to a combination of drivers -- including the proliferation of online business transactions, regulatory requirements governing information storage, and ever-growing demands for in-depth analytics -- the data management needs of most organizations have exploded in recent years. Data volumes in the average data warehouse are increasing by gigabytes every day, and the traditional response has been to throw hardware at the problem, including more powerful servers and larger storage arrays. The result is a massive infrastructure footprint, with massive space, resource, maintenance and energy requirements.
The IT industry has begun to address the problem of energy consumption in the data center with a variety of approaches, including using more efficient cooling systems, blade servers, storage area networks, and virtualization. Organizations also need to start looking at how to minimize the amount of space and resources that their data take up in the first place -- a challenge that is beginning to be tackled with the help of technology advancements.
Data Warehousing Taps into New Technology and Open Source
How does data warehousing address the challenges involved in "going green"? The answer is a combination of new database technology designed for analysis of massive quantities of data, with open source software that leverages commodity low-cost, energy-efficient software and hardware. Together they reduce the need for expensive hardware infrastructure and the energy required to power it.
Let's take a closer look at the three key areas where new solutions for data warehousing are making a big difference.
Key #1: Reduce Hardware Footprint
In recent years, column-oriented databases have been noted by many analysts as the preferred architecture for high-volume analytics. A column-oriented database stores data column by column instead of row by row. There are many advantages to column-orientation for analytics. Most analytic queries only involve a subset of the tables' columns so a column-oriented database focuses on retrieving only the data that is required, speeding queries and reducing disk I/O and computer resources.
Furthermore, these databases enable efficient data compression because each column stores a single data type (as opposed to rows that typically contain several data types). This allows compression to be optimized for each data type, significantly reducing the amount of storage needed for the database. Column orientation also greatly accelerates query processing, which significantly increases the number of data warehouse transactions a server can process.
There are a variety of column-oriented solutions on the market. Some explode and duplicate the data and require as large a hardware footprint as traditional row-based systems. Others, however, have combined the column basis with other technologies, eliminating the need for data duplication and massive hardware footprints. What this ultimately means is that users don't need as many servers or as much storage to analyze the same volume of data. In fact, these column-oriented databases can achieve compression ranging from 10:1 (a 10 TB database becomes a 1 TB database) to more than 40:1 depending on the data. With this level of compression, a distributed server environment can be reduced by a factor of 20-50 times and be brought down to a single box, significantly slashing heat, power consumption, and carbon emissions.
Open source products, specifically designed to serve a broad community of users, take this a step further as they do not require proprietary hardware or specialized appliances. This offers open source users the ability to leverage simple, lower-cost commodity servers and reduce their hardware footprint. Open source software such as Linux can also extend the life of hardware components by allowing older servers to be seamlessly integrated into a single virtual machine. This keeps older servers out of landfills and reduces the demand for new machines to be built.
Key #2: Reduce Deployment Resources
New database technology and open source applications also enable simpler, "do-it-yourself" testing and deployment models, greatly reducing the resources involved in getting a data warehousing solution up and running. Consider the resource requirements potentially involved in the global acquisition and deployment of a traditional, proprietary solution: a lengthy product evaluation process will likely be followed by on-site visits from the vendor to set up and configure hardware and equipment.
The costs -- from both an environmental and bottom-line perspective -- include travel (plane trips, car rentals, hotel accommodations, etc.), hardware (multiple servers, cooling equipment, connectors, etc.) as well as personnel (a full team of experts may be required to customize the solution).
With new open source technologies, software can be downloaded online, and it's designed to be easy to install, so that one person can handle set-up and deployment. Support needs are also less involved, so that they can be handled via conference call rather than through more costly (and carbon-consuming) in-person travel.
Key #3: Reduce Ongoing Management and Maintenance
Because traditional data warehousing solutions are generally built to handle specific types of queries, they are not particularly well suited for environments in which data management needs are constantly changing and real-time analysis is critical. Retrofitting these solutions to handle ad hoc queries requires an enormous amount of manual fine-tuning and results in a huge drain on IT resources. (Recall the example of the telecommunication provider with 20+ data base administrators.)
For instance, trying to run a set of complex analytic queries alongside constantly changing schemas on a warehouse designed to store a deep amount of IT-oriented log data is like trying to use a dictionary to find a map. It involves a complete re-configuration of the underlying data structures, requiring database designers to create indexes and data partitions. (Indexing and partitioning also increase data size, in some cases by a factor of two or more.)
In contrast, some new analytic database products eliminate these manual and ongoing efforts to provide a more "Google-like" experience, so that users can easily leverage it to answer many types of questions. This level of flexibility presents the potential to reduce ongoing maintenance and operational support by as much as 90 percent. A business that only needs to build one solution that can be used by many optimizes staff usage, as well as time and financial investments. In addition, greater data analysis productivity means less hardware can be used without sacrificing performance.
Data Warehousing for the Green Generation
Cost, image, and regulatory concerns are compelling more businesses to explore how they can make their operations and their IT infrastructure greener. At the same time, many of these same organizations are struggling to keep up with overwhelming data access and management requirements that are burning through energy and resources.
The combination of new database technology with open source applications solves both of these challenges. Rather than expanding and duplicating data, the combination compresses and reduces data; rather than requiring specialized hardware and complex, multi-machine configurations, you can use standard, off-the-shelf, lower-cost systems. Rather than involving proprietary software and requiring experienced staff, you'll find an easier-to-implement, easier-to-use approach that reduces operational effort and cost.
The end result is a leaner, greener way to store and analyze information.
1: "Data Center Overload," Tom Vanderbilt, June 14, 2009
2: "The Internet's Hidden Energy Hogs: Data Servers" Kent Garber, March 24, 2009
Miriam Tuerk is the president and CEO of Infobright. You can reach Ms. Tuerk at email@example.com. Rick Abbott is president of 360DegreeView; you can contact him at firstname.lastname@example.org.