Three Tiers of Analytic Sandboxes: New Techniques to Empower Business Analysts
Analytic sandboxes are proving to be a key tactic in liberating business analysts to explore data while preventing the proliferation of spreadmarts and renegade data marts. Many BI teams already provide sandboxes of some sort, but few recognize that there are three tiers of sandboxes that can be deployed individually or in concert to meet the unique needs of every organization
Analytic sandboxes adhere to the maxim, “If you can’t beat them, join them.” They provide a “safe haven” for business analysts to explore enterprise data, combine it with local and external data, and then massage and package the resulting data sets without jeopardizing an organization’s proverbial “single version of truth” or adversely affecting performance for general DW users.
By definition, analytic sandboxes are designed for exploratory analysis, not production reporting or generalized distribution. Ideally, sandboxes come with an expiration date (e.g. 90 days), reinforcing the notion that they are designed for ad hoc analyses, not application development. If analysts want to convert what they’ve created into a scheduled report or application, they need to turn it over to the BI team to “productionize” it.
Unfortunately, analytic sandboxes can’t enforce information policies. Analysts can still export data sets to their desktop machines, email results to colleagues, and create unauthorized production applications. Ultimately, organizations that establish sandboxes must establish policies and procedures for managing information in a consistent manner and provide sufficient education about proper ways to produce and distribution information. Nonetheless, many BI teams are employing analytic sandboxes with reasonable success.
Tiers of Sandboxes
1. DW-Centric Sandboxes. The traditional analytic sandbox carves out a partition within the data warehouse database, upwards of 100GB in size, in which business analysts can create their own data sets by combining DW data with data they upload from their desktops or import from external sources. These DW-centric sandboxes preserve a single instance of enterprise data (i.e., they don’t replicate DW data), make it easier for database and DW administrators to observe what analysts are doing, and help analysts become more comfortable working in a corporate data environment. It’s also easier for the BI team to convert analyses into production applications since the analytic output is already housed in the DW.
However, a DW-centric sandbox can be difficult to manage from a systems perspective. Database administrators must create and maintain partitions and access rights and tune workload management utilities to ensure adequate performance for both general DW users and business analysts. An organization that has dozens or hundreds of analysts, each of whom wants to create large data sets and run complex queries, may bog down performance even with workload management rules in place. Inevitably, the BI team may need to upgrade the DW platform at considerable expense to support the additional workload.
2. Replicated Sandboxes. One way to avoid performance problems and systems management complexities is to replicate the DW to a separate platform designed exclusively for analysts. Many companies have begun to physically separate the production DW from ad hoc analytical activity by purchasing specialized DW appliances.
This approach offloads complex, ad hoc queries issued by a handful of people to a separate machine, leaving the production DW to support standardized report delivery, among other things. DW performance improves significantly without a costly upgrade, and analysts get free reign of a box designed exclusively for their use.
Of course, the downside to this is cost and duplication of data. Organizations must purchase, install, and maintain a separate database platform--which may or may not run the same database and server hardware as the DW. Executives may question why they need a separate machine to handle tasks they thought the DW was going to handle.
In addition, the BI team must establish and maintain a utility to replicate the data to the sandbox, which may take considerable expertise to create and maintain. The replication can be done at the source systems, the ETL layer, the DW layer (via mirrored backup), or the DW storage system. Also, with multiple copies of data, it’s easy for the two systems to get out of sync and for analysts to work with outdated information.
3. Managed Excel Sandboxes. The third tier of analytic sandbox runs on the desktop. New Excel-based analytical tools, such as Microsoft’s PowerPivot and Lyzasoft’s Lyza Workstation, contain in-memory columnar databases that run on desktop machines, giving analysts unheralded power to access, massage, and analyze large volumes of data in a manner that conforms to the way they’ve traditionally done such work (i.e., using Excel versus SQL.)
Although these spreadsheets-on-steroids seem like a BI manager’s worst nightmare, there is a silver lining: analysts who want to share their results have to publish through a server managed by corporate IT. This is why I call this type of sandbox a “managed Excel” environment.
For example, with Microsoft PowerPivot, analysts publish their results to Microsoft SharePoint, which makes the results available to other users via Excel Services, which is a browser-based version of Excel. Excel Services prevents users from changing or downloading the report, preventing unauthorized distribution. In the same way, Lyzasoft lets analysts publish data to the Lyza Commons, where others can view and comment on the output via browser-based collaborative tools.
Of course, where there is a will there is a way and business analysts can and will find ways to circumvent the publishing and distribution features built into PowerPivot and Lyza workbooks and other managed Excel environments. But the collaborative features of their server-based environments are so powerful and compelling that I suspect most business analysts will take the path of least resistance and share information in this controlled manner.
Combining Sandboxes. A managed Excel sandbox might work well in conjunction with the other two sandboxes, especially if the corporate sandboxes have performance or size constraints. For example, analysts could download a subset of data from a centralized sandbox to their managed Excel application, combine it with local data on their desktops, and conduct their analyses using Excel. If they liked what they discovered, they could then run the analysis against the entire DW within the confines of a centralized sandbox.
Our industry is in the early stages of learning how to make most effective use of analytic sandboxes to liberate power users without undermining information consistency. With three (and perhaps more) types of analytic sandboxes, BI teams can tailor the sandbox experience to meet the unique needs of their organization.
Posted by Wayne Eckerson on March 22, 2010