Q&A with the Experts
A business intelligence or data warehouse implementation can be a formidable undertaking. In these pages, leading business intelligence and data warehousing solution providers share their answers to the questions they hear often from industry professionals. Tim Feetham, an independent consultant, provides his analyst viewpoint to each Q&A.
Introducing performance management reporting to a diversely skilled workforce is incredibly expensive and time-consuming. What is the best strategy for driving adoption of new reporting applications across the enterprise?
Large user communities, as a whole, do not want to learn new technologies. They prefer to use the productivity applications to which they are accustomed. Therefore, in order to ensure that operation performance management solutions are widely used, we recommend deploying them in familiar means such as personalized, interactive Web applications and fully functional spreadsheets. This way, we immediately overcome the first objection to new technology. Instead, users say: “I knew how to use it when I opened it.”
In performance management, getting relevant feedback to the folks who can make a difference is essential. Organizations that undertake this task must also understand that what is relevant today will change tomorrow. These organizations face two issues: how to generate broad adoption and how to stay flexible. These needs point to Web-based reporting technologies that deliver reporting, visualization, and analysis tools, plus seamless spreadsheet integration under a unified but customizable interface. Users will be quick to adopt this technology, and support organizations will be able to focus more on tailoring the product to the pressing business issues at hand.
What is the secret to implementing a high-quality ETL solution?
One major reason data warehouse implementations fail to meet deadlines and expectations is becauseof inadequate design of the ETL processes. Today, the design process is a low-tech, high-cost exercise. To deliver a high-quality ETL process, organizations must obtain a thorough understanding of their sources, improve collaboration among team members to develop accurate source-to-target mapping documents, and gain a good understanding of their design before moving onto implementation. By addressing these areas, organizations can design an ETL process that will meet the needs of the business and deliver a trusted data foundation for business intelligence.
Ward Stroud sings a blues song called “Good Enough for Now.” That phrase sums up too many ETL efforts. Teams cut corners with hand coding and/or inexpensive tools along with lax upfront analysis and minimal structure. These organizations find themselves with unmanageable ETL processes after the first couple of iterations. Just when development cycles should be getting shorter, they get longer because developers are busy trying to maintain what they have already built. What was “good enough for now” now threatens the continued success of the data warehouse.The answer? Comprehensive upfront analysis, quality ETL technology, and sound production support procedures.
When is it best to use a dashboard versus a scorecard?
Scorecard applications are ideal for measuring and managing performance. Business users can easily manage performance through scorecards that have metrics with targets. These scorecards also ensure accountability by defining ownership of key performance indicators. The end result is that scorecards provide alignment throughout the organization around corporate objectives. Dashboards provide their own unique value proposition. Dashboards are a collection of large volumes of data from various sources, presented in a single graphical view. Business users typically monitor a series of charts and graphs that deliver interactive exploration of the data. Scorecards and dashboards are often leveraged together to deliver a full spectrum of performance information.
Scorecarding is a management system of key metrics delivered to those persons whose actions affect them. These folks might work at any level, and these metrics might change over time. Therefore, the supporting applications need to be easily understood and easily modified. Some organizations rely on basic reporting tools. This may work for line workers, but managers need more. Effective scorecarding usually involves a summary set of metrics at each level, but when a metric indicates a problem, the scorecarding application should give the user the ability to drill down to supporting metrics. An interactive dashboard serves this purpose well.
What is the difference between Ingres and Postgres?
Ingres emerged circa 1974 under Michael Stonebraker and Eugene Wong at UC Berkeley. In1980, Ingres and Oracle entered the commercial world as the two leading relational database management system (RDBMS) products. Ingresis the progenitor of other RDBMS products such as Informix, SQL Server, and Tandem’s NonStop, among others. Stonebraker later developed Postgres, a derivative of Ingres, as an object-relational DBMS (for unstructured datatypes). Postgres remained open source, but Ingreshad engineering oversight since its commercial availability. Computer Associates bought Ingres in1990, and it is the database within CA products. Ingres is ISO 9001 certified.
Michael Stonebraker released Postgres, an object-relational lDBMS, to the open source community in 1986. It has since gained wide acceptance in that group. Ingres, which he helped create in1974, made its reputation as a leading commercial RDBMS. Computer Associates purchased Ingres in 1990. CA made it its key database product and encouraged its customers to move to Ingres for Y2K. Last year, CA released Ingres into the open source community. Although both products are now in the open source community, Ingres has benefited from years of market discipline, making it an ideal engine for low-cost database appliance technology.
How do you sell an executive on a data quality initiative?
You might be surprised at how easily you can capture information to approach your executive about a data quality initiative. There are three pillars of justification:
- Identify how big the problem is (values out of range, incomplete records, bad formatting, incorrect product codes, etc.). A profiling tool can assist with gathering this data.
- The second pillar is represented by example records. People love to see their own data and are amazed at how corrupt it actually is.
- The third pillar encompasses anecdotes—stories from your customers or data consumers about negative impact on your operations. Testimonials are powerful.
Once you have statistics, example records, and anecdotes, you can then plan to approach your executive.
The team leaders responsible for data warehousing development and/or enterprise reporting are the natural champions of a data quality initiative. Their work will be judged on the reliability of the information they produce. These team leaders can usually find support throughout the organization. Rare is the experienced line manager who doesn’t have a tale or two related to poor data quality. However, most write it off as something they can’t do anything about. The team needs to document these stories, analyze their financial impacts, assure these managers that the situation can be changed, and develop scenarios for improving data quality.
What is business performance management?
Business performance management (BPM) is the mapping of the business agenda with sophisticated BI technologies to provide business users across the enterprise with the appropriate insight to make faster, more accurate decisions.
BPM is about understanding and aligning the goals that the enterprise (entire company) is trying to achieve, modeling different alternatives on how to go about achieving those goals, putting plans in place to go about executing the goals, monitoring the execution of those plans, analyzing the progresson those goals, reporting on that progress,and then starting all over again. This management cycle can only be supported with a BPM system that provides transparency and visibility (requirements and methods) securely across financial and operational data, regardless of source systems.
Business performance management is a practicethat identifies, propagates, and reports on measurable goals that, when achieved, will have a positive outcome on the organization. The effective organization will identify goals at all levels. Measures should be identified based on leading indicators of business performance, again at all levels. Once identified, they need to be tracked and made readily available to the people who can affect them. Those people need to be accountable for their goals. Finally, these goals need to be analyzed over time to see if they really are effective, and if they are not, new goals should be selected.
IBM InformationIntegration Solutions
How do IBM® WebSphere® Information Integrator and the IBM® WebSphere® Data Integration Suite complement each other?
These products deliver complementary capabilities for the information integration market. WebSphere Information Integrator provides real-time, access-in-place technology, event-driven data movement for any data, and unstructured content, while WebSphere Data Integration Suite delivers high-speed, high-volume, parallel data movement, transformation, and data quality.
The two product sets already integrate today using service-oriented interfaces, providing complex data movement and transformation. Now, IBM is better able to leverage the strengths of the WebSphere Data Integration Suite and WebSphere Information Integrator to help customers achieve greater value, particularly in the areas of real-time integration requirements for master data management and business performance management.
When IBM brought bought Ascential, a market-leading ETL company, it faced several challenges.The first was to preserve Ascential’s customer base.The second was to integrate it into IBM’s other suite of products. IBM wisely decided to brand it as the IBM WebSphere Data Integration Suite, making it database-neutral. IBM also has a history stretching back into the 1980s of technology centered on dynamic access to heterogeneous datasources. This technology is now marketed under the WebSphere Information Integrator name, and fits well with the Data Integration Suite, especially for customers looking to implement enterpriseinformation integration strategies.
How important is the data visualization component in the success of a BI dashboard initiative?
There are two key elements to a BI dashboard initiative:
- Consolidation of information from disparate data sources
- Providing insight (and business intelligence) to users presented with a large amount of data
Therefore, cutting-edge data visualization capabilities are at the heart of user adoption and the overall success of a dashboard initiative. The data visualization component must make the information intuitive, providing user interaction and making it easy to spot trends, exceptions, and interconnections between a multitude of variables. Visualization components must provide interactive visual intelligence within the dashboard.
Data visualization can serve two functions for BI dashboards. The first centers on presenting key metrics in a way that stands out in the sea of information clutter that we all deal with in our daily lives. A good user interface that presents these metrics needs to resonate with the user and quickly tip him or her off to any variances. The second function is a bit more of a challenge, buth as a relatively high payoff. A dashboard that supports visualization for analyzing the underlying data will help produce a quicker understanding of the issues for more users than numbers alone would accomplish.
What are the differences between 32-bit and 64-bit PowerCenter for a given platform? What should I consider when choosing a performance-enhancing solution?
The primary advantage of 64-bit processing is the application’s ability to use more memory—virtually unlimited—as compared with the 32-bit environment. The basic principles encompassed within the general term “performance” include throughput, scalability, availability, and manageability. All four of these principles must be considered to ensure that you are selecting a platform that will grow with your business. The benefits of 64-bit processing include improved throughput and scalability, as well as a reduction in I/O bottleneck due to large memory-addressingcapability. Several platforms support 64-bit PowerCenter, including IBM AIX v5.3,HP-UX 11i on Itanium, and Solaris 10, with additional 64-bit platform support coming in the next major release of PowerCenter.
When 32-bit ETL developers started tapping its potential, they found they could improve data transformation performance dramatically by moving reference data into memory. These applications could address up to four gigabytes of memory, which meant that they could accommodate fairly large reference data sources. However, this has only whetted these developers’ appetites for accommodating more types of data in memory. The 64-bit architectures now give these developers 16 million times the addressable memory that 32-bit architectures do. This means that products such as Informatica’s 64-bit PowerCenter now hold promise for eliminating many ETL performance problems that are associated with big data.
More and more people are demanding real-time business information. How can I solve my data latency problems?
You can solve them with your business intelligence technology in two ways. First, your BI solution needs more than batch-loaded ETL. Products like Information Builders’ WebFOCUS offer trickle-fed ETL that provides real-time access. Second, you need direct access to transactional types of data sources. That implies the integration of robust middleware capabilities into your business intelligence solution. WebFOCUS’s built-in iWay middleware is designed for service-oriented architectures and feeds directly accessed, real-time information to any person or system in the operational environment.
The demand for real-time access to business information presents some real opportunities. Middleware has matured to the point where direct access to transaction data is quite manageable. The ability to analyze operational data in real time is a boon to applications such as customer relationship management. However, this doesn’t mean that organizations should substitute this capability for real-time data warehousing, where updates in an operational system result in adds to the data warehouse. The real-time data warehouse, which is designed for tactical and strategic analysis, when combined with real-time access to transaction data, maximizes the value of an organization’s data.
How can an IT department streamline administration and BI report creation?
By providing managed self-service to their user communities, IT departments can more effectively manage the reporting backlog and reduce the need for IT administration. Companies are selecting BI technology that provides managed self-service BI to their users in need of report development, and reduces IT department involvement when report creation requests overflow the queue.
Self-service BI reduces the time involved in report deployment if it does three things: intimately involves the users in report design; increases the amount of reusable metrics and report components; and allows the creation of one report with prompting and “drill anywhere” capability. MicroStrategy provides that balance between self-service and performance.
IT needs to understand that its customers will likely be quite varied. Some will want IT to do everything for them, and others will want to retain complete control. An effective IT department will stay flexible and:1) Promote training in the BI tools of choice forcasual and power users, 2) Encourage customers to identify power users who will write the bulk of their reports, 3) Provide support for users in dealing with difficult requests plus any excess demand, and 4) Establish a production environment in which power users (with the aid of support) can publish their reports.
Are real application clusters suitable for data warehousing?
Yes. Deploying a data warehouse on a “grid” of inexpensive, industry-standard commodity servers instead of a large symmetric multiprocessing (SMP) system has become increasingly popular and viable. Software and hardware vendors have devoted years of development into making these systems practical and reliable, and the technology has matured greatly. A data warehousing infrastructure built on RAC has significant advantages in that it provides enterprise-level scalability, performance, and availability at a reasonable cost. Scalability is achieved through incrementally adding additional low-cost servers to a cluster as demand grows. Data warehouse performance benchmarks have shown that RAC-based systems provide query performance equal to or better than that of SMP systems. High availability is provided by the innate failover capabilities of a cluster, where remaining servers automatically pick up the workload of a failed server.
In terms of basic capabilities regarding performance and cost/performance numbers, Oracle’s RAC implementations have done quite well on decision support benchmarks (tpc.org, TPC-H), but these numbers tell only part of the story. Oracle developed real application clusters (RAC) to provide their customers with much easier-to-accommodate growth in general application usage and data. In addition, Oracle’sarchitecture supports strong failover capabilities. As such, RAC does not appear to be targeted specifically for the data warehousing market. However, data warehouses are especially susceptible to rapid if not erratic growth, and as they grow in importance, failover capabilities become increasingly important.
PolyVista includes discovery mining as part of its analytical client application. What’s the difference between discovery mining and data mining?
The primary deliverable for data mining is the construction of a predictive model, whereas the primary deliverable in discovery mining is new business insight. While predictive models can certainly be useful, few in-house analysts will have the statistical wherewithal to complete this task. Discovery mining, on the other hand, helps guide business users to new and interesting data relationships or anomalies, allowing the business analyst to review and assess the value of the findings.
Our industry likes categories. We have OLAP, the analysis of data with known relationships, and data mining, the discovery of relationships between data. Sometimes we supplement OLAP with visualization software, but there has been little in the way of data mining visualization—mainly because there have been few tools to support it. PolyVista has addressed this with their discovery mining technology. Some firms will use PolyVista in place of building data mining models. However, there is another payoff potential for those who do build those models: Now they can visually check their intuition by discovering new patterns in data beforehand.
What role should analytics play in a modern BI strategy?
With the recent advancements in both the power and simplicity of analytic technology, there remains a huge discrepancy in how organizations adopt and exploit that technology. Some organizations subscribe to an older philosophy where decision makers get reports and a handful of analysts use complex analytic tools and/or manual spreadsheets. Meanwhile, other organizations are deploying simple but powerful analytics deeper into their organizations—specifically into the decision-making ranks—to drive faster and higher-quality decisions than ever before. These progressive organizations employ a different approach, starting with, and staying focused on, simplicity and decision making rather than data and its countless complexities.
An organization that empowers its employees, suppliers, and customers to make decisions when questions first arise reduces costs and improves customer satisfaction dramatically. Analytics are applications that support such decision making a tall organizational levels. These applications often combine operational data with information from the data warehouse. Organizations with large data warehousing programs, but with information systems that are limited to standard reporting and complex data analysis, are constraining the true potential of the data warehouse. Indeed, TDWI’s Business Intelligence Maturity Model (www.tdwi.org/poster) shows that the greatest ROI comes when analytical services are deployed in tandem with the enterprise data warehouse.
Data integration seems to be have evolved a lot from the early ETL days. What is the impact on my BI projects, and more broadly on my entire information system?
It used to be a “simple” matter of running daily batch extractions to load the data warehouse. With the advent of real time, new and updated data now needs to be propagated almost instantly. Data integration has evolved significantly, and should now offer capabilities such as changed data capture for event detection, integration of datafrom asynchronous messaging systems (MOM), and incremental updates on targets.
Furthermore, the BI system is no longer the only consumer of data in the information system, and a data integration strategy should encompass not only the BI needs, but also the batch, asynchronous, and synchronous integration requirements of all systems and applications. Avoid the proliferation of tools and technologies to cover needs that are diverse but very complementary.
Ten years ago, BI developers counted themselves fortunate to get good-quality monthly snapshots of production data for their data warehouses. Today, however, business users are demanding real-time or low-latency information from the data warehouse. BI teams faced with this demand will likely see the need for new technologies that handle messaging and changed data capture. Source system technologies will dictate what those technologies will be, but most likely an organization will end up with several data integration products in addition to their ETL tool. Wise teams will look for vendors who can support these capabilities through one integrated product line.
How can I economically manage data explosion and accelerate end-user reporting requirements?
Most organizations are struggling to deal with data management and data growth. As data volumes explode, so does the cost of managing that data. As data warehouse, reporting, and business analytical application databases increase in size, most organizations’ only line of defenseis to apply more processing power to deal with very large databases and increasing numbers of users. Analyzing large amounts of data for a high number of users has always been a formidable and expensive task. Although the “hard costs” of storage are decreasing, the “soft costs” associated with maintaining data are substantial. Using an analytical database engine that is designed specifically for reporting and analytics for many users, complex queries, and large data sets solves this problem—and the business pains that accompany it.
Disk storage is cheap nowadays, but given large numbers of users analyzing large amounts of data, BI teams will need to support expensive processing power. Even test data stored inefficiently may elongate development time. Although there are a number of ways of getting through big data, such as specialized indexing, parallel processing, and soon, each of these methods has shortcomings when supporting some types of queries. One of the more elegant approaches to this problem is to reduce the size of the data store itself. Sybase has done this through data management technology that is specifically designed for analysis and reporting.
Compliance with mandates like Sarbanes-Oxley and HIPAA is driving stored enterprise data growth to rates of up to 125 percent a year. Is there any way to reduce the volume of data in data warehouses?
One method of reducing the data loaded into data warehouses from transactional databases is to store only the data that has changed—a process called changed data capture or delta processing. There are several ways to do this effectively, including using Syncsort’s DMExpress to perform joins, or sort, merge, and summarize processes. In either case, the technique identifies which records are unchanged, updated or added, or deleted and obsolete, so that only changed or added records are updated. Unchanged or obsolete records are deleted rather than being loaded as duplicates.
A surprising number of data warehousing teams are still relying on periodic snapshot data collection.Often, 95 percent of this data has not changed from month to month. However, snapshot processing does miss changes that occur between snapshots, and these changes are becoming critical to compliance reporting. So now we have the prospect of storing more snapshots, with even more redundancy, in order to provide a view into critical changes in the data. At this point, many teams will reconsider their basic strategies and redesign their databases to handle critical changed data while avoiding loading redundant data time and time again.