Top 10 Priorities for High-Performance Data Warehousing
Data management must achieve speed and scale, to support new data types and business requirements.
- By Philip Russom, Ph.D.
- April 16, 2013
Big data, real-time data, unstructured data, analytic data, machine data, new business expectations for data -- all these and more are making it increasing difficult to achieve high performance with technologies for enterprise data warehousing (EDW) and the business intelligence (BI) components integrated with it, such as tools for analytics, reporting, and data integration.
The top 10 priorities for high-performance data warehousing (HiPerDW) are summarized below. Think of these priorities as recommendations, requirements, goals, tips, or rules that can guide user organizations into successful solutions for HiPer DW.
Priority #1: Enable new business practices based on high-performance BI/DW/DI and analytics
This is what HiPer DW is really about, and you're already doing it, if you have programs for operational BI (real-time data) and enterprise BI (with thousands of users and reports). Expect to apply HiPer DW options to more business practices as your organization moves deeper into business analytics (demanding workloads for queries, mining, statistics) and big data (scaling to massive, diverse datasets to discover new facts about the business).
Priority #2: Make real-time operation your first technology priority for HiPer DW
After all, collecting, processing, and delivering time-sensitive data is the key enabler of most of the new applications that businesses are clambering for at the moment, such as operational BI, operational analytics, just-in-time inventory, facility monitoring, price optimization, workforce management, fraud detection, and mobile asset management --just to name a few.
Priority #3: Make scalability your second technology priority
On the one hand, you have no choice but to keep pace with growth in data volumes, BI user communities, and burgeoning bodies of reports. On the other hand, tapping new data sources -- whether Web data, social media, or traditional enterprise applications -- can provide richer information for business programs, such as 360-degree views, sentiment analysis, operational efficiency, Web site visitor sessionization, and customer relationships beyond the usual channels.
Priority #4: Use hardware, but don't abuse it
There's no doubt that servers, networks, and storage are key components of any performance strategy, but excessively hurling hardware at performance problems raises the cost of high performance. It also dulls a team's optimization expertise for software and data. Balance your reliance on hardware with software optimization skills and well-performing designs for queries, reports, data models, ETL jobs, etc.
Priority #5: Select database platforms and analytic tools that are designed for high performance
There many types to consider, including analytic DBMSs, columnar databases, appliances and other engineered systems, and Hadoop with MapReduce and other No-SQL databases. These will deliver performance gains out of the box with data structures and workloads that the tool or platform is designed for. Even so, you should still expect some development work in remodeling data and tweaking data processing to attain further gains.
Priority #6: Rely on specialized platform and tool functionality for certain performance gains
For example, in-database analytics assists the overall performance of non-query-based analyses by alleviating the need to move analytic data before analysis. In-memory processing decreases query response and analytic rescore time by decreasing disk I/O. Columnar data stores accelerate column-oriented queries by collating in physical storage the data of table columns.
Priority #7: Consider the many new architectures that boost performance
If your EDW is still on an SMP platform, make migration to MPP a priority. Consider distributing your data warehouse architecture, largely to offload a workload to a standalone platform that performs well with that workload. When possible, take analytic algorithms to the data instead of data to the algorithm (as is the DW tradition); this new paradigm is seen with in-database analytics, Hadoop with MapReduce layered over it, and gate-array processing in some storage platforms and appliances. Hadoop and MapReduce are an alternate MPP architecture that fits some analytic workloads quite well.
Priority #8: Keep your performance optimization skills sharp and current
Ample hardware and fast, scalable software can automatically provide good performance for many situations. However, it's inevitable that some queries, reports, data models, ETL logic, and analytic algorithms will need tweaking and tuning before achieving the desired performance level. Maintain your optimization skills, especially for SQL tuning and data model tweaking. Get optimization training, if needed.
Priority #9: Design and develop with high performance in mind
Most teams have standards for the look and feel of reports, approaches to modeling warehouse data, preferred interfaces for specific data sources or targets, the style of hand-written code, and so on. Whoever determines and enforces these standards should ensure that the standards also foster high performance. Of course, the performance of any new development work should be tested during peer review and quality assurance processes.
Priority #10: Develop and apply a technology strategy for HiPer DW
Evident in this Top 10 Priorities list is the fact that no single approach to attaining HiPer DW is adequate for all situations. Therefore, a successful technology strategy for the high performance of BI/DW/DI and analytics should tap into and balance four approaches:
- Up-to-date hardware platform components, especially CPUs, memory, and storage
- Enterprise software platforms and tools designed specifically for demanding applications in data warehousing and analytics
- Technical users' global architectures for data and team standards for BI development, especially when governing data models, SQL coding, ETL logic, and analytic algorithms
- Tactical tweaking and tuning on the local level, as required by reports, data structures, analytic algorithms, or deficient tools and platforms
For a much more detailed discussion of these and related issues, read the 2012 TDWI Best Practices Report, High-Performance Data Warehousing, from which the above list is excerpted. The report is available via a free download at http://tdwi.org/research/2012/10/tdwi-best-practices-report-high-performance-data-warehousing.aspx.