Data Warehousing on a Shoestring Budget (Part 1 of 3)
You can implement data warehouse solutions on a small budget by focusing on system, database, ETL, and reporting technologies that work in concert with requirements gathering, development, testing, and training. In the first part of our series, we look at how to keep technology costs low.
by Nathan Rawling
Do you believe that data warehouse (DW) projects must be multi-million dollar undertakings? Don't. As this series will demonstrate, any organization, regardless of size, can benefit from data warehousing technologies, even when working with a limited budget.
Although seemingly difficult, you can make choices, which allow for the beneficial realization of data warehousing while also minimizing costs. By balancing technology and carefully positioning your business, your organization can quickly create cost-effective solutions using data warehousing technologies.
There are a few simple rules to help you develop a data warehouse on a shoestring budget:
- Use what you have
- Use what you know
- Use what is free
- Buy only what you have to
- Think small and build in phases
- Use each phase to finance or justify the remainder of the projects
In this series we explore the traditional sources of data warehousing costs as well as how your company can best position itself to properly execute data warehousing projects, regardless of budget. By fully understanding these issues, your organization will be better equipped to achieve your profitability goals.
Traditional Data Warehouse Costs
Typical data warehouse project costs derive from two sources: technology and staffing. The balance between these costs varies by a project's environment, but every project includes both. Technology costs often include system hardware, database-related software, and reporting software. Staffing costs address the individuals who gather requirements as well as those who model, develop, and maintain the data warehouse.
In reviewing the marketplace for data warehousing hardware and software, organizations can easily "break the bank" when procuring these technologies. As tempting as it may be to seek out a large SAN with clustered high-availability database servers and farms of Web servers to support a data warehouse, it is not always appropriate or necessary. It is quite possible, and sometimes far more effective, to build a data warehouse solution on relatively conservative hardware.
Further, although there are a number of outstanding software solutions that address each step in the data flow, it is critical to first find the right toolset before selecting an expensive or feature-rich product. A common organizational mistake in data warehousing is to ignore the currently owned technology and software capabilities and move ahead quickly, if imprudently, to purchase a different product. An organization often has the toolset in place to effectively meet its data warehousing demands but simply needs the right partner to fully implement its current technology.
It is not uncommon for a large data warehouse installation to employ a SAN with high-performance, fibre-channel communication for primary storage. If your organization has a SAN in place, utilizing it for data warehouse storage may, in fact, be a logical choice. However, it is not necessary nor is it often most beneficial for an organization to build an SAN solely to support a data warehouse. The benefits of SAN architecture for an organization are often more recognizable when a single storage system can be used to support a significant number of diverse systems rather than a single data warehouse or project. Networked-attached storage (NAS) creates a similar architecture to a SAN. Yet, instead of employing fibre-channel connections, which appear to the server as local disk, the server communicates to the disk via network communications protocols, such as Server Message Block Protocol (SMB) or Network File System (NFS). This architecture can pose challenges in data warehousing applications, notably the database systems and extract, transform, and load (ETL) software may cause performance bottlenecks. Ethernet communications to the storage device can cause considerable slowdowns that adversely affect overall system performance. If an organization has NAS storage available -- a relatively inexpensive option -- it can still build data warehouses successfully through careful network management and database tuning.
Locally attached storage, with or without hardware RAID (redundant array of independent disks), is among the most inexpensive and attractive storage options because of its outstanding performance. In recent years, as disk storage prices have declined, disk storage availability in each server has increased. Although generally advantageous to separate operating system files from database and application-related files, an entire data warehouse system can be run adequately and efficiently on a single drive. Data warehouses tend to be disk intensive, therefore the addition of an extra drive can provide supplemental storage for database files, application files, and temporary storage while simultaneously improving overall performance.
Currently, the database and reporting server architecture of choice is fault-tolerant clustering servers with automatic load-balancing and failover. Unfortunately, while such an environment is highly desirable for maximum performance and uptime, it is also extremely expensive and can be difficult to maintain. Although it can be rewarding to design and work within such an environment, this type of implementation is an unlikely choice for the majority of organizations, but especially for those seeking to successfully complete a data warehousing project on a tight budget.
With fault-tolerant clustering servers an unrealistic option for most, a strong middle-ground option is a server-per-service architecture. This environment performs very well, allowing organizations to independently scale each application while simultaneously keeping costs down. A typical data warehouse and business intelligence environment might require two or three servers in this type of architecture: a database server, a reporting application server, and perhaps an additional Web server. This approach reduces the cost and still provides many of the benefits available in the higher-availability solutions. For example, in the case of a failure, it is possible to restore services on the remaining systems. Further, troubleshooting is typically easier than in a single system design and fits better into security models, which require separation of Web and database services.
In a pinch, however, it is possible to build an entire data warehouse and business intelligence solution on a single server, with or without virtualization. At the outset, this option may appear to save your organization valuable funds because only one server must be purchased. Yet, be mindful -- this singular purchase demands that the server selected must be requisitely powerful. It must handle the load of all processes: database, ETL, and reporting. This architecture can yield acceptable performance, but it can also be quite difficult to troubleshoot. Further, the environment is also difficult to integrate into organizations with well-developed security policies, which often require strict separation of database and Web servers.
Data appliances are receiving tremendous attention and are gaining market share because they can provide solutions for midsize-to-large volume data warehouse markets by offering low-cost performance -- most commonly on data volumes in the terabyte to petabyte range. ISA Consulting has recognized a significant increase in the number of clients -- traditionally, a conservative crowd -- utilizing DW appliances.
One of the benefits of DW appliances is the lower total cost of ownership (TCO) compared to traditional methods. The total cost of ownership of a data warehouse consists of initial entry costs, maintenance costs, and the cost of increasing capacity as the data warehouse grows. DW appliances offer low entry and low maintenance costs with initial costs ranging from $10,000 to $150,000 per terabyte, depending upon the size of the DW appliance installed. The resource cost for monitoring and tuning the data warehouse is a significant aspect of the TCO, often more than 50 percent.
DW appliances reduce administration for operations, setup and integration, and many also offer lower costs for expanding processing power and capacity. With the increased focus on controlling costs, combined with tightening IT budgets, data warehouse managers must reduce and manage expenses while leveraging their technology for full maximization. These demands make DW appliances, notably Oracle, Netezza, and Greenplum, a natural solution.
An interesting trade-off with hardware is that often one of the least burdensome components of a data warehouse is upgrades or replacements. Upgrading disk storage, adding memory or processors to a server, or simply installing additional servers to an environment rarely cause as much work as fully changing a database or reporting platform. This is not an excuse to intentionally under-power the hardware environment -- a situation likely to cause complications. Rather, this should temper the tendency to overbuild system environments for data warehouses. If cost is a driving force, it is important to recognize that real and significant savings can be achieved simply by purchasing the hardware needed today rather than the hardware that might be needed next year.
Although selecting an operating system (OS) and selecting hardware should be considered concurrently, the OS is nonetheless true that it is a software component. Many systems integration professionals have a personal preference regarding the ideal operating system for data warehousing and business intelligence. The reality, however, is that almost any common platform can be successful for all but the most extensive data warehouses.
The best way to maximize a limited budget for operating systems is to use the system your organization is most familiar with and select the minimal feature-set necessary to support the additional applications within your environment. Branching out into an entirely new platform for your data warehouse implementation will increase expenses for migration tasks such as staff training.
The database is the heart of the data warehouse. Similar to the choice of an operating system, professionals tend to advocate a particular vendor when an organization must choose a database. Not every data warehouse can run on any/every database platform, yet all major database platforms are sufficient for the majority of data warehouses. If your organization has a well-developed relationship with a particular vendor, maximize the opportunities afforded by that relationship can help to minimize technology, training, and implementation costs. In fact, many of the major database vendors offer different levels of software with additional features at various price points.
Carefully examine and review exactly which features you need prior to ordering the most expensive version; in some cases, it is quite possible to simply purchase the additional features you need without buying a product bundle.
All major database vendors offer ETL and reporting products and your organization can and should take advantage of product bundles whenever possible. If a database product you own comes with integrated ETL software, it makes little sense to buy an additional ETL product until your organization is confident that its current software will not successfully meet demands. The same rule can be applied to purchasing reporting tools, especially since the market for these products continues to be very competitive.
Another typical DW expense is for database modeling software; many include sophisticated features, including cross-platform support, version control, and reverse/ forward engineering of database objects. In the modeling of a large transactional system or a complex data warehouse, the advantages of such tools are evident, because they can easily manage large amounts of metadata. Unfortunately, such features come at a dear price. It is critical that the software products selected do meet your organizational needs: unintelligible database model diagrams can slow development. There are free and inexpensive commercial modeling software products which save valuable resources, such as Computer Associates' ERWin and Embarcadero's ER/Studio. Both are outstanding tools. However, Microsoft Visio Professional is quite capable of producing effective star and snowflake schemas at a quarter of the cost.
To process data from source systems into the data warehouse, organizations must use ETL. There are specialized products to simplify development, ranging from simple automation tools to full-fledged suites of applications, with extraordinary capabilities. These specialized tools, however, have expensive price tags that vary from tens of thousands to millions of dollars, depending on the number and type of servers. There are four ways in which an organization can potentially avoid paying costly licensing fees:
- Do not specifically use a tool. Instead, use the procedural code built in to your database for ETL development. Keep in mind, though, that working to manually code ETL logic in a procedural language may take far longer than developing ETL by simply using one tool within the bundle.
- If you are able, use a tool already bundled with your database. Microsoft SQL Server 2005 comes with SSIS and SSAS; Oracle Database versions 10g and later come with Oracle Warehouse Builder, There are other bundles available today; it is simply a matter of fully investigating all of the options.
- Check to see if a department in your organization has previously licensed an ETL product, and if your project could use those licenses or buy additional licenses at a reduced cost.
- Some organizations are building internal "centers of excellence" that offer free or low-cost infrastructure, licenses, and data warehousing expertise to the organization. Many times these groups are not well-known throughout the organization, but can be extremely helpful. Before making a substantial new investment in DW technologies and staff, make sure you check inside your organization for such a group.
If circumstances are such that you need to purchase an ETL tool, make sure to look at ETL tools such as Oracle Warehouse Builder (OWB), Cognos 8 Data Manager, and Microsoft Integration Services, all of which seek to challenge the market leaders, including Ab Initio, IBM DataStage, and Informatica PowerCenter. The latter tools are excellent "workhorses" for enterprise data integration, but your organization may not need to begin with one of these tools for a small project.
Presumably, once data is incorporated into the data warehouse, users in your organization will want to access and utilize it. Such a goal implies that a reporting software package might be required. Similar to the specialized ETL tools, it is possible to spend a significant portion of your organization's budget on the most highly functioning reporting solutions. Still, the options for reporting software on a limited budget remain much the same:
- Buy a commercial product
- Use a reporting tool bundled with the database
- Build the reporting tool from scratch
If complex, formatted reports are needed, however, it may be necessary to purchase an advanced commercial package. If the reporting needs are more modest, a less expensive solution may suffice.
When considering software and vital tools such as ETL or a reporting system, organizations must consider the timing and duration of the data warehousing project. The complexity of the project should dictate and justify the purchase price of a given tool. The most important points for a data warehousing project on a limited budget are assessing an organization's technological needs, understanding the intricacy and density of the project, and making informed, cost-effective decisions regarding servers and software.
Organizations must be prepared to balance the cost savings of purchasing less-expensive tools by accepting the increased development effort and staffing requirements associated with a data warehousing project. In the next part of this series, the human costs derived from a data warehousing project will be examined. We will offer advice for how employees and a budget can be maximized to successfully complete this initiative.
- - -
Nathan Rawling is a data integration/business intelligence consultant with ISA Consulting, a systems integration firm. You can contact him at firstname.lastname@example.org.