What is your e-mail address?

My e-mail address is:

Do you have a password?

Forgot your password? Click here
close

Experts Blog: Philip Russom

Philip RussomPhilip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 500 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at prussom@tdwi.org, @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.


Philip Russom

By Philip Russom


Q&A RE: Data Warehouse Architecture Issues

Attendees of a recent TDWI Webinar asked excellent questions.
By Philip Russom, TDWI Research Director for Data Management

Recently, on Tuesday April 15, 2014, I broadcasted a TDWI Webinar in which I presented some of the findings from my new TDWI report, Evolving Data Warehouse Architectures in the Age of Big Data. You can download a free copy of the report in a PDF file. And you can replay the Webinar.

Attendees of the Webinar posed several very good questions about various issues in data warehouse architecture. Please allow me to share a few of the attendees’ questions and the answers I sent them via e-mail:

Q. As we update our data warehouse from more reporting to more analytics functions, should we design a brand new data warehouse architecture, or improve from the existing one?

If the existing data warehouse and its architecture fulfill business requirements and technical performance requirements (for speed and scale), then you should try to build out the existing architecture. For that to work, your existing vendor platform under the warehouse must perform well with multiple mixed workloads, including analytic workloads; ask your vendor representative for customer references who’ve succeeded with mixed workloads. Also, building up data sets for advanced analytics typically means loading large data volumes into the warehouse, which may cost more money with some licenses; again, ask your vendor if there are such ramifications under your current license. 

If your current core warehouse platform cannot support mixed workloads with high performance (or adding analytic data costs too much money), you may decide to manage and process large data sets for advanced analytics on a separate standalone platform that integrates with your warehouse. But in that case, you still keep your existing data warehouse and most of its data structures intact, just making slight changes for better integration with the new additional platform(s) for advanced analytics.

Q. Given the lack of integration across this multi-platform [data warehouse] environment, how do we avoid the need to replicate DW transactional sources into the big data platforms, as transactions are required in mining?

Good question, and there are number of issues here. First, a well-designed multi-platform environment won’t suffer a “lack of integration.” TDWI’s definition of “logical data warehouse” is that the logical design specifies integration schemes (not just data models) across physically distinct platforms, whether that integration takes a data model approach (as in shared or conformed dimensions, etc.) or a data integration approach (as in jobs for ETL, replication, etc.) or both. Second, I take your point, that replicating data more than needed can lead to a variety of problems, as data gets out of sync and loses integrity. A good architecture can minimize replication, and sometimes alleviate it. Third, for decades, users have faced the same decision you’re looking at: do we store, manage, and analytically process our rich, valuable collection of transactional data in the warehouse proper or on a standalone but integrated platform, such as the usual operational data store (ODS)?

For years, a solution I’ve seen users successfully adopt is to deploy a homegrown ODS that they’ve designed and optimized for transactions. The ODS is on a standalone platform that’s integrated with the core warehouse (plus other ODSs, marts, etc.), running on a relational DBMS atop commodity priced hardware. Note that the upcoming trend is toward ODSs atop Hadoop (but only if the data volumes are massive). The idea is to manage transactional data on a platform that’s much cheaper than the DW, on a standalone platform where the relentless sorting, updating, and processing of that data won’t degrade warehouse performance. Yet, the ODS is easily reached from all tools, plus through data federation and virtualization as well, which minimizes the replication of transactional data.

If you give the ODS the capacity it needs to persist multiple sort orders and data subsets in the ODS, then copying data outside the ODS is further reduced. Also, if you use data mining tools that can work on data “in situ” (i.e., in the ODS’s relational database) without moving data to the tool, then that also reduces copying and moving transactional data.

Q. The need for data warehouses is never going to go away. But isn’t the separation between "operations" and "analytics" starting to blur? In other words, the future isn't DWE; it's a "data environment" that does both.

Operational BI is all about getting operational data into BI faster and more frequently, while also embedding BI functions in operational applications and their processes as well. Operational BI is a very popular practice. It has been for years, and will get even more popular, as organizations adjust their BI efforts to bring them closer to real time (to be more competitive, customer conscious, efficient, etc.). The widespread existence of operational BI corroborates that the line between operations and BI is already quite blurred and will become even more so.

In another trend, many organizations are purposefully evolving toward a more or less loosely unified data environment for most enterprise data. I say “more or less” and “loosely” because early adopters are quick to say that the architecture is not 100 percent of the enterprise and integration is spotty, on an “as needed” basis. As one architect joked, “it’s more archaeology than architecture, because the work usually consists of imposing a logical architecture over mature, preexisting systems.” For early adopters, it makes sense to architect data globally, when customer data and some other data domains are pervasively shared across multiple applications, departments, and processes. It also makes sense in firms where business processes ramble across multiple business units and IT systems. Obviously, there’s an infinitude of resulting enterprise data architectures.

The data warehouse environment (DWE) I’m describing is a local microcosm of such a broad and loosely unified multi-platform data architecture. However, in some organizations today, the data warehouse and similar data platforms are just a few among many other data platforms, integrated on an enterprise scale. But those organizations are as yet the minority, although we at TDWI expect it to be the norm for IT-intense organizations within five years. TDWI’s Vegas conference has been devoted to issues in enterprise-scale data architecture for years, and will continue to be. You might consider attending next February.

Q. Can you point us to white papers on the difference between reporting and analytics [and how that affects DW architecture]?

You can read my blog on the subject. Or you could read the new report on evolving data warehouse architectures, because I adapted material from the blog to become a section in the report, starting on page 24.

Q. What’s the role, or is there a role, for variants like an ODS in the new world [of data warehouse architectures]? Is it part of the real-time world?”

Historically, some of the first standalone systems in a multi-platform data warehouse (going back to the mid-1990s) were ODSs deployed on their own hardware sever with their own DBMS instances. These are still with us, and will continue to be with us, as data warehouse environments evolve into even more platforms used at once. An ODS can be designed and optimized by users for a wide range of data domains and uses (including real-time data), but I’m currently seeing a lot of users deploying ODSs for various types of big data and other data earmarked for advanced analytics.

Q. Saying Inmon vs. Kimball is no longer relevant is like saying Newton is no longer relevant in the world of physics today. It's still important, maybe not as fundamental as 1–2 decades ago.

For decades, Newton practiced alchemy in his copious spare time, because he was convinced that changing lead to gold was possible. Our heroes aren’t always 100 percent right.

Concerning Inmon and Kimball, see the top of page 7 in the report. Also please read the User Story on that same page. “No longer relevant” is your phrase, not mine. In my view, Inmon and Kimball’s innovations are as relevant as ever, and are still being applied daily. And they just keep giving: Inmon has recently extended our understanding of unstructured data and Kimball is currently working new best practices for Hadoop.

It’s the users who’ve changed. Instead of arguing about which to choose, users choose to apply Inmon and Kimball techniques (and others, too) in the same extended warehouse environment. And that’s a wise choice on their part, since hybrids and diversity seem to be winning strategies for a growing number of user organizations and their diversified DW architectures nowadays.

Q. Some organizations consider Hadoop a replacement for their current DW appliance. How is this possible?

As I said in the Webinar, I’ve only found two organizations that took out a data warehouse and put Hadoop in its place. While that corroborates that a replacement is possible, it’s not likely, nor is it a compelling trend.

Instead of replacement, we at TDWI see far more users augmenting their data warehouse environment with the Hadoop Distributed File System (HDFS), plus related Hadoop tools, especially MapReduce, Hive, HBase, and Pig. In short, HDFS handles things that relational warehouses are not designed for, such as unstructured data, algorithmic analytics, millions of files, and petabyte-size data sets. But the relational warehouse is still best for the structured and multidimensional data that goes into standard reports, performance management, and set-based analytics (typically OLAP or SQL-based analytics).

Another possibility is that Hive atop MapReduce and HDFS makes a highly scalable “row store” type of database. Sometimes you don’t need a full-featured (and expensive) relational DBMS, and hence a row store will do just fine. For example, many of the ODSs found today in data warehouse environments are candidates for migration to Hadoop. That includes ODSs that manage large “archives” (I use the word loosely) of transactional data and other operational data that’s persisted and kept long-term for advanced analytics that just need simple tabular structures. Most standalone ODSs of that description today run on mature DBMSs, but could run almost as well (for less money) on Hadoop.

Finally, let’s remember that not all organizations need a data warehouse, as represented by 15 percent of survey respondents.

Q. Can you recommend any sample success stories on how to integrate Hadoop or similar big data into an existing data warehouse [environment]?

Yes, many real-world use cases and user stories are discussed in the 2013 TDWI report Integrating Hadoop into Business Intelligence and Data Warehousing.

Posted on April 30, 20140 comments


Evolving Data Warehouse Architectures: An Overview in 35 Tweets

By Philip Russom
Research Director for Data Management, TDWI

To help you better understand the ongoing evolution of data warehouse architectures and why you should care, I’d like to share with you the series of 35 tweets I recently issued on the topic. I think you’ll find the tweets interesting because they provide an overview of big data management and its best practices in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from my recent TDWI report Evolving Data Warehouse Architectures in the Age of Big Data. Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs, hashtags, and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Basic Components of the Average Data Warehouse Architecture

  1. Most DW Arch’s have 4 layers: logical, physical, hardware topology, data standards.
  2. DW logical architecture is mostly about data models, entity models & relationships.
  3. DW logical arch also defines standards for data models, dev practices, interfaces, etc.
  4. DW physical architecture is mostly a plan for data deployment on servers.
  5. DW physical arch also defines topology for hardware & software servers plus interfaces.

Users’ Views of Architectural Components

  1. #TDWI SURVEY SEZ: Data standards & rules are highest priority (71%) of #EDW architecture.
  2. #TDWI SURVEY SEZ: Logical design (66%) is the starting point of an #EDW architecture.
  3. #TDWI SURVEY SEZ: Physical plan (56%) locates logical pieces in an #EDW architecture.
  4. #TDWI SURVEY SEZ: Only 12% have #EDW that’s “collection of data & platforms without a plan.”
  5. #TDWI SURVEY SEZ: Only 12% feel Inmon vs Kimball argument is priority for #EDW architecture.

The Evolution of Data Warehouse Architectures

  1. #TDWI SURVEY SEZ: 79% say their #DataWarehouse has an architecture.
  2. #TDWI SURVEY SEZ: #EDW arch is evolving dramatically (22%), moderately (54%) or slightly (22%)
  3. #TDWI SURVEY SEZ: Driving #EDW arch evolution: #Analytics 57%, #BigData 56%, #RealTime 41%.
  4. #TDWI SURVEY SEZ: Driving #EDW arch evolution: BizPerfMgt 38%, OLAP 30%, UnstrucData 25%.
  5. #TDWI SURVEY SEZ: Driving #EDW arch evolution: competition 45%, compliance 29%, dep’ts 29%.

The Importance of Data Warehouse Architectures

  1. #TDWI SURVEY SEZ: Architecture extremely (79%) or moderately (19%) important to #EDW success.
  2. #TDWI SURVEY SEZ: #EDW Architecture is an opportunity (84%), not a problem (16%).

Benefits and Barriers for Data Warehouse Architecture

  1. #TDWI SURVEY SEZ: Stuff that benefits from #DWarch: #analytics, biz value, data breadth.
  2. #TDWI SURVEY SEZ: Barriers to #DWarch success: skills gap, sponsorship, #DataMgt, funding.

Multi-Platform Data Warehouse Environments

  1. #EDWarch trend: more standalone platforms: #analytics DBMSs, columnar, appliances, #Hadoop, etc.
  2. As #EDW workloads get more diverse, so do types of standalone data platforms in #EDW environment.
  3. As types and numbers of data platforms grow in DW environs, architecture gets ever more distributed. #
  4. Distributed #EDWarch is good&bad: provides workload optimized platforms. But may spawn data silos.
  5. Logical layer of #EDWarch more important than ever to unite big design across multi data platforms.

Single-Platform versus Multi-Platform DW Architectures

  1. #TDWI SURVEY SEZ: Totally pure #EDWarchs are rare. Only 15% have central monolithic #EDW.
  2. #TDWI SURVEY SEZ: Hybrid #EDWarchs are most common today = central #EDW + a few other data platforms (37%).
  3. #TDWI SURVEY SEZ: 2nd most common Hybrid #EDWarch = central #EDW + many other data platforms (16%).
  4. #TDWI SURVEY SEZ: Sometimes #EDW plays small role in #EDWarch compared to workload platforms (15%).
  5. #TDWI SURVEY SEZ: Some organizations (15%) have many workload-specific data platforms, but no true DW.

Big Data’s Influence on Evolving DW Architectures

  1. #TDWI SURVEY SEZ: 41% will extend existing core #EDW to handle #BigData.
  2. #TDWI SURVEY SEZ: 25% will deploy new data platforms to handle #BigData.
  3. #TDWI SURVEY SEZ: 23% have no strategy for their #EDW’s architecture, though they need one.
  4. #TDWI SURVEY SEZ: Only 6% feel they don’t need a strategy for their #EDW’s architecture.

Reports and Analytics have Different DW Architecture Needs

  1. Many users preserve #EDW for reporting, BizPerfMgt & OLAP, but take #analytics data elsewhere.
  2. Data prep for reports differs from same for #analytics. So, many users prep data on separate platforms.

Want to learn more about evolving data warehouse architectures?

For a more detailed discussion—in a traditional publication!—get the TDWI Best Practices Report, titled Evolving Data Warehouse Architectures in the Age of Big Data, which is available in a PDF file via a free download.

You can also register for and replay my TDWI Webinar, where I present the findings of the TDWI report Evolving Data Warehouse Architectures in the Age of Big Data.

Posted on April 15, 20140 comments


Q&A RE: The State of Big Data Integration

It’s still early days, but users are starting to integrate big data with enterprise data, largely for business value via analytics.

By Philip Russom, TDWI Research Director for Data Management

A journalist from the IT press recently sent me an e-mail containing several very good questions about the state of big data relative to integrating it with other enterprise data. Please allow me to share the journalist’s questions and my answers:

How far along are enterprises in their big data integration efforts?

According to my survey data, approximately 38% of organizations don’t even have big data, in any definition, so they’ve no need to do anything. See Figure 1 in my 2013 TDWI report Managing Big Data. Likewise, 23% have no plans for managing big data with a dedicated solution. See Figure 5 in that same report.

Even so, some organizations have big data, and they are already managing it actively. Eleven percent have a solution in production today, with another 61% coming in the next three years. See Figure 6.

Does data integration now tend to be haphazard, or one-off projects, in many enterprises, or are architectural strategies emerging?

I see all the above, whether with big data or the usual enterprise data. Many organizations have consolidated most of their data integration efforts into a centralized competency center, along with a centrally controlled DI architecture, whereas a slight majority tend to staff and fund DI on a per-application or per-department basis, without an enterprise strategy or architecture. Personally, I’d like to see more of the former and less of the latter.

What are the best approaches for big data integration architecture?

Depends on many things, including what kind of big data you have (relational, other structures, human language text, XML docs, etc.) and what you’ll do with it (analytics, reporting, archiving, content management). Multiple big data types demand multiple data platforms for storing big data, whereas multiple applications consuming big data require multiple processing types to prepare big data for those applications. For these reasons, in most cases, managing big data and getting business use from it involves multiple data management platforms (from relational DBMSs to Hadoop to NoSQL databases to clouds) and multiple integration tools (from ETL to replication to federation and virtualization).

Furthermore, capturing and integrating big data can be challenging from a data integration viewpoint. For example, the streaming big data that comes from sensors, devices, vehicles, and other machines requires special event-processing technologies to capture, triage, and route time-sensitive data—all in a matter of milliseconds. As with all data, you must transform big data as you move it from a source to a target, and the transformations may be simple (moving a click record from a Web log to a sessionization database) or complex (deducing a fact from human language text and generating a relational record from it).

What "traditional" approaches are being updated with new capabilities and connectors?

The most common data platform being used for capturing, storing, and managing big data today are relational databases, whether based on MPP, SMP, appliance, or columnar architectures. See Figure 16 in the Managing Big Data report. This makes sense, given that in a quarter of organizations big data is mostly or exclusively structured data. Even in organizations that have diverse big data types, structured and relational types are still the most common. See Figure 1.

IMHO, we’re fortunate that vendors’ relational database management systems (RDBMSs) (from the old brands to the new columnar and appliance-based ones) have evolved to scale up to tens and hundreds of terabytes of relational and otherwise structured data. Data integration tools have likewise evolved. Hence, scalability is NOT a primary barrier to managing big data.

If we consider how promising Hadoop technologies are for managing big data, it’s no surprise that vendors have already built interfaces, semantic layers, and tool functionality for accessing a broad range of big data managed in the Hadoop Distributed File System (HDFS). This includes tools for data integration, reporting, analysis, and visualization, plus some RDBMSs.

What are the enterprise "deliverables" coming from users’ efforts with big data (e.g., analytics, business intelligence)?

Analytics is the top priority and hence a common deliverable from big data initiatives. Some reports also benefit from big data. A few organizations are rethinking their archiving and content management infrastructures, based on big data and the potential use of Hadoop in these areas.

How is the role of data warehousing evolving to meet the emergence of Big Data?

Big data is a huge business opportunity, with few technical challenges or downsides. See figures 2 through 4 in the report Managing Big Data. Conventional wisdom says that the opportunity for business value is best seized via analytics. So the collection, integration, and management of big data is not an academic exercise in a vacuum. It is foundational to enabling the analytics that give an organization new and broader insights via analytics. Any calculus for the business return on managing big data should be based largely on the benefits of new analytics applied to big data.

On April 1, 2014, TDWI will publish my next big report on Evolving Data Warehouse Architectures in the Age of Big Data. At that time, anyone will be able to download the report for free from www.tdwi.org.

How are the new platforms (such as Hadoop) getting along with traditional platforms such as data warehouses?

We say “data warehouse” as if it’s a single monolith. That’s convenient, but not very accurate. From the beginning, data warehouses have been environments of multiple platforms. It’s common that the core warehouse, data marts, operational data stores, and data staging areas are each on their own standalone platforms. The number of platforms increased early this century, as data warehouse appliances and columnar RDBMSs arrived. It’s now increasing again, as data warehouse environments now fold in new data platforms in the form of the Hadoop Distributed File System (HDFS) and NoSQL databases. The warehouse has always evolved to address new technology requirements and business opportunities; it’s now evolving again to assure that big data is managed appropriately for the new high-value analytic applications that many businesses need.

For an exhaustive discussion of this, see my 2013 TDWI report Integrating Hadoop into Business Intelligence and Data Warehousing.

Posted on January 22, 20140 comments


Managing Big Data: An Overview in 30 Tweets

By Philip Russom
Research Director for Data Management, TDWI

To help you better understand new practices for managing big data and why you should care, I’d like to share with you the series of 30 tweets I recently issued on the topic. I think you’ll find the tweets interesting, because they provide an overview of big data management and its best practices in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from my recent TDWI report “Managing Big Data.” Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs, hashtags, and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Types of Multi-Structured Data Managed as Big Data
1. #TDWI SURVEY SEZ: 26% of users manage #BigData that’s ONLY structured, usually relational.
2. #TDWI SURVEY SEZ: 31% manage #BigData that’s eclectic mix of struc, unstruc, semi, etc.
3. #TDWI SURVEY SEZ: 38% don’t have #BigData by any definition. Hear more in #TDWI Webinar Oct.8 noonET http://bit.ly/BDMweb
4. Structured (relational) data from traditional apps is most common form of #BigData.
5. #BigData can be industry specific, like unstruc’d text in insurance, healthcare & gov.
6. Machine data is special area of #BigData, with as yet untapped biz value & opportunity.

Reasons for Managing Big Data Well
7. Why manage #BigData? Keep pace w/growth, biz ROI, extend ent data arch, new apps.
8. Want to get biz value from #BigData? Manage #BigData for purposes of advanced #analytics.
9. #BigDataMgt yields larger samples for apps that need it: 360° views, risk, fraud, customer seg.
10. #TDWI SURVEY SEZ: 89% feel #BigDataMgt is opportunity. Mere 11% think it’s a problem.
11. Key benefits of #BigDataMgt are better #analytics, datasets, biz value, sales/marketing.
12. Barriers to #BigDataMgt: low maturity, weak biz support, new design paradigms.
13. #BigDataMgt non-issues: bulk load, query speed, scalability, network bandwidth.

Strategies for Users’ Big Data Management Solutions
14. #TDWI SURVEY SEZ: 10% have #BigDataMgt solution in production; 10% in dev; 20% prototype; 60% nada. #TDWI Webinar Oct.8 http://bit.ly/BDMweb
15. #TDWI SURVEY SEZ: Most common strategy for #BigDataMgt: extend existing DataMgt systems.
16. #TDWI SURVEY SEZ: 2nd most common strategy for #BigDataMgt: deploy new DataMgt systems for #BigData.
17. #TDWI SURVEY SEZ: 30% have no strategy for #BigDataMgt though they need one.
18. #TDWI SURVEY SEZ: 15% have no strategy for #BigDataMgt cuz they don’t need one.

Ownership and Use of Big Data Management Solutions
19. Some depts. & groups have own #BigDataMgt platforms, including #Hadoop. Beware teramart silos!
20. Trend: #BigDataMgt platforms supplied by IT as infrastructure. Imagine shared #Hadoop cluster.
21. Who does #BigDataMgt? analysts 22%; architects 21%; mgrs 21%; tech admin 13%; app dev 11%.

Tech Specs for Big Data Management Solutions
22. #TDWI SURVEY SEZ: 97% of orgs manage structured #BigData, followed by legacy, semi-struc, Web data etc.
23. Most #BigData stored on trad drives, but solid state drives & in-memory functions are gaining.
24. #TDWI SURVEY SEZ: 10-to-99 terabytes is the norm for #BigData today.
25. #TDWI SURVEY SEZ: 10% have broken the 1 petabyte #BigData barrier. Another 13% will within 3 years.

A Few Best Practices for Managing Big Data
26. For open-ended discovery-oriented #analytics, manage #BigData in original form wo/transformation.
27. Reporting and #analytics are different practices; managing #BigData for each is, too.
28. #BigData needs data standards, but different ones compared to other enterprise data.
29. Streaming #BigData is easy to capture & manage offline, but tough to process in #RealTime.
30. Non-SQL, non-relational platforms are coming on strong; BI/DW needs them for diverse #BigData.

Want to learn more about managing big data?

For a much more detailed discussion—in a traditional publication!—get the TDWI Best Practices Report, titled Managing Big Data, available in a PDF file via a free download.

You can also register for and replay my TDWI Webinar, where I present the findings of Managing Big Data.

Please consider taking courses at the TDWI World Conference in Boston, October 20–25, 2013. Enroll online.
============================
Philip Russom is the research director for data management at The Data Warehousing Institute (TDWI). You can reach him at prussom@tdwi.org or follow him as @prussom on Twitter.

Posted on October 11, 20130 comments


Analytics and Reporting Are Two Different Practices

Treat them differently, if you want to get the most out of each.

By Philip Russom, TDWI Research Director for Data Management

I regularly get somewhat off-base questions from users who are in the thick of implementing or growing their analytic programs, and therefore get a bit carried away. Here’s a question I’ve heard a lot recently: “Our analytic applications generate so many insights that I should decommission my enterprise reporting platform, right?” And here’s a related question: “Should we implement Hadoop to replace our data warehouse and/or reporting platform?”

The common misconception I perceive behind these questions (which makes them “off-base” in my perception) is that people seem to be forgetting that analytics and reporting are two different practices. Analytics and reporting serve different user constituencies, produce different deliverables, prepare data differently, and support organizational goals differently. Despite a fair amount of overlap, I see analytics and reporting as complementary, which means you most likely need both and neither will replace the other. Furthermore, due to their differences, each has unique tool and data platform requirements that you need to satisfy, if you’re to get the most out of each.

Allow me to net it out with a few sweeping generalizations.

Reporting is mostly about entities and facts you know well, represented by highly polished data that you know well. And that data usually takes the form of carefully modeled and cleansed data with rich metadata and master data that’s managed in a data warehouse. In fact, it’s difficult to separate reporting and data warehouses, because most users designed their DWs first and foremost as a repository for reporting and similar practices such as OLAP, performance management, dashboards, and operational BI.

I regularly hear claims that Hadoop can replace a true DW. But I doubt this, because the current state of Hadoop cannot satisfy the data requirements of enterprise reporting near as well as the average DW can. Ultimately, it’s not about the warehouse per se; it’s about practices a DW supports well, such as reporting. I reserve the right to change my mind in the future, because Hadoop gets more sophisticated almost daily. My real point: most enterprise reporting depends on a DW for success, so keep and protect the DW.

Advanced analytics enables the discovery of new facts you didn’t know, based on the exploration and analysis of data that’s probably new to you. New data sources generally tell you new things, which is one reason organizations are analyzing big data more than ever before. Unlike the pristine data that reports operate on, advanced analytics works best with detailed source data in its original (even messy) form, using discovery oriented technologies, such as mining, statistics, predictive algorithms, and natural language processing. Sure, DWs can be expanded to support some forms of big data and advanced analytics. But the extreme volumes and diversity of big data are driving more and more users to locate big data on a platform besides a DW, such as Hadoop, DW appliances, or columnar databases.

I personally think that providing separate data platforms for reporting and analytics is a win-win data strategy. It frees up capacity on the DW, so it can continue growing and supporting enterprise reporting plus related practices. And it gives advanced analytics a data platform that’s more conducive to exploration and discovery than the average DW is.

Reporting is like a “high-volume business,” whereas analytics is like a “high-value business.” For example, with so-called enterprise business intelligence, thousands of concurrent report consumers access tens of thousands of reports that are refreshed nightly. By comparison, a small team of data analysts can transform an organization with a few high-value insights, such as new customer segments, visibility into costs, correlations between supplies and product quality, fraud detection, risk calculations, and so on. For completely different reasons, you need both reporting and analytics to serve the full range of user constituencies and provide many different levels of information and insight.

Most reports demand numeric precision, whereas most analyses don’t. Think financial reports (accurate to the penny) versus website page view reports (where guesstimates are fine).

Most enterprise reports require an audit trail, whereas few analyses do. Think regulatory reports versus the scores of an analytic model for customer churn.

Data management techniques differ. Squeaky clean report data demands elaborate data processing (for ETL, quality, metadata, master data, and so on), whereas preparing raw source data for analytics is simpler, though at higher levels of scale.

CONCLUSIONS: Despite some overlap, enterprise reporting and advanced analytics are so different as to be complementary. Hence, neither will replace the other. Both do important things for an information-driven organization, so you must give each what it needs for success, both at the tool level and at the data management level. Taking seriously the data requirements of big data analytics may lead you to implement Hadoop; but that doesn’t mean that Hadoop will replace a DW, which is still required to satisfy the data requirements of reporting and related practices, such as OLAP, performance management, dashboards, and operational BI.

Posted on September 26, 20130 comments


Evolving Data Warehouse Architectures: Integrating HDFS with an RDBMS Alleviates the Limitations of Both

Hadoop has limitations. But the relational database management systems used for data warehousing do, too. Luckily, their strengths are complementary.

By Philip Russom, TDWI Research Director for Data Management

In a recent blog in this series, I discussed “The Roles of Hadoop” in evolving data warehouse architectures. (There’s a link to that blog at the end of this blog.) In response, a few people asked me (I’m paraphrasing): “Since the Hadoop Distributed File System (HDFS) is so useful, can it replace the relational database management system (RDBMS) that’s at the base of my current data warehouse and its architecture?”

The short answer is: “No.” The long answer is: “Not today, and probably not in the future.” The main reason is that Hadoop—in its current form—lacks (or is weak with) many of the functions that we depend on in our RDBMSs. As you’ll see in the list below, most of the RDBMS functions I have in mind enable feature-rich and high-performance access to stored data via SQL. Other functions concerns tools for data security and administration.


Just so you know where this blog is going: Hadoop has limitations, but the average data warehouse does, too. Luckily, the strengths and weaknesses of the two are complementary (for the most part). When you integrate Hadoop and an RDBMS, they fill in each other’s holes and provide a more broadly capable data warehouse architecture than has been possible until now.

Hadoop’s Limitations Relative to RDBMSs Used for Data Warehousing
Despite all the goodness of Hadoop I described in a previous blog, there are areas within data warehouse architectures where HDFS isn’t such a good fit:

RDBMS functionality. HDFS is a distributed file system and therefore lacks capabilities we expect from relational database management systems (RDBMSs), such as indexing, random access to data, support for standard SQL, and query optimization. But that’s okay, because HDFS does things RDBMSs do not do as well, such as managing and processing massive volumes of file-based, unstructured data. For minimal DBMS functionality (though not fully relational), users can layer HBase over HDFS, as well as the query framework called Hive.

Low-latency data access and queries. HDFS’s batch-oriented, serial-execution engine means that it’s not the best platform for real-time or speedy data access or queries. Furthermore, Hadoop lacks mature query optimization. Hence, the selective random access to data and iterative ad hoc queries that we take for granted with RDBMSs are alien to Hadoop.

An RDBMS integrated with Hadoop can provide needed query support. HBase is a possible solution, if all you need is a record store, not a full-blown DBMS. And upcoming improvements to Hadoop Hive and the new Impala query engine will address some of the latency issues.


Streaming data. HDFS and other Hadoop products can capture data from streaming sources (Web servers, sensors, machinery) and append it to files. But, being inherently batch, they are ill-equipped to process that data in real time. In my opinion at this date, such extremes of real-time analytics are best done with specialized tools for complex event processing (CEP) and/or operational intelligence (OI) from third-party vendors.

Granular security. Hadoop today includes a few security features, such as file-permission checks, access control for job queues, and service-level authorization. Add-on products that provide encryption and LDAP integration are available for Hadoop from a few third-party vendors. Since HDFS is not a DBMS (and Hadoop data doesn’t necessarily come in relational structures), don’t expect granular security at the row or field level, as in an RDBMS.

Administrative tools. According to a TDWI survey, security is Hadoop users’ most pressing need, followed by a need for better administrative tools, especially for cluster deployment and maintenance. The good news is that a few vendors offer tools for Hadoop administration, and an upgrade of open-source Ambari is coming.

SQL-based analytics. With the above latency limitations in mind, HDFS is a problematic choice for workloads that are iterative and query based, as with SQL-based analytics. Furthermore, Hadoop products today have limited support for standard SQL. A number of vendor products (from RDBMSs to data integration and reporting tools) can provide SQL support today, and open-source Hadoop has new incubator projects that will eventually provide adequate support for SQL. These are critical if Hadoop is to become a productive part of a SQL-driven data warehouse architecture.

I’m not making up these limitations for Hadoop. The list is based on survey results and user interviews, as reported in my 2013 TDWI Best Practices Report: Integrating Hadoop into BI and Data Warehousing.

In Defense of Hadoop
My list of limitations might seem like “Hadoop bashing” to some readers, but that is not what I intend. So let me restate what I stated positively in the last blog: “HDFS and other Hadoop tools promise to extend and improve some areas within data warehouse architectures.”

Sure, Hadoop’s help is limited to “some areas.” But the fantastically fortuitous fact is that most of Hadoop’s strengths are in areas where most warehouses and BI technology stacks are weak, such as unstructured data, outrageously large data sets, non-SQL algorithmic analytics, and the flood of files that’s drowning many of us. Conversely, Hadoop’s limitations (as discussed above) are mostly met by mature functionality available today from a wide range of RDBMS types (OLTP databases, columnar databases, DW appliances, etc.), plus admin tools. In that light, I hope it’s clear that Hadoop and the average data warehouse are complementary (despite a bit of overlap), so it’s unlikely that one could replace the other, as I am often asked.


Integrating Hadoop with an RDBMS Alleviates the Limitations of Both
The trick, of course, is making HDFS and an RDBMS work together optimally. To that end, one of the critical success factors for assimilating Hadoop into evolving data warehouse architectures is the improvement of interfaces and interoperability between HDFS and RDBMSs. Luckily, this is well under way, due to efforts from software vendors and the open source community. And technical users are starting to leverage HDFS/RDBMS integration.

For example, an emerging best practice among DW professionals with Hadoop experience is to manage diverse big data in HDFS, but process it and move the results (via ETL or other data integration media) to RDBMSs (elsewhere in the DW architecture) that are more conducive to SQL-based analytics. HDFS serves as a massive data staging area. A similar best practice is to use an RDBMS as a front-end to HDFS data; this way, data is moved via queries (whether ad hoc or standardized), not via ETL jobs. HDFS serves as a large, diverse operational data store. For more information about these practices, replay my recent TDWI Webinar: "Ad Hoc Query Speed for Hadoop."

Other Blogs in the Evolving Data Warehouse Architectures series:
From EDW to DWE
The Role(s) of Hadoop

Posted on September 2, 20130 comments


Back to Top

Channels by Topic

  • Agile BI »
    Includes:
    • Agile
    • Scoping
    • Principles
    • Iterations
    • Scrum
    • Testing
  • Big Data Analytics »
    Includes:
    • Advanced Analytics
    • Diverse Data Types
    • Massive Volumes
    • Real-time/Streaming
    • Hadoop
    • MapReduce
  • Business Analytics »
    Includes:
    • Advanced Analytics
    • Predictive
    • Customer
    • Spatial
    • Text Mining
    • Big Data
  • Business Intelligence »
    Includes:
    • Agile
    • In-memory
    • Search
    • Real-time
    • SaaS
    • Open source
  • BI Leadership »
    Includes:
    • Latest Trends
    • Technologies
    • Thought Leadership
  • Data Analysis and Design »
    Includes:
    • Business Requirements
    • Metrics
    • KPIs
    • Rules
    • Models
    • Dimensions
    • Testing
  • Data Management »
    Includes:
    • Data Quality
    • Integration
    • Governance
    • Profiling
    • Monitoring
    • ETL
    • CDI
    • Master Data Management
    • Analytic/Operational
  • Data Warehousing »
    Includes:
    • Platforms
    • Architectures
    • Appliances
    • Spreadmarts
    • Databases
    • Services
  • Performance Management »
    Includes:
    • Dashboards, Scorecards
    • Measures
    • Objectives
    • Compliance
    • Profitability
    • Cost Management
  • Program Management »
    Includes:
    • Leadership
    • Planning
    • Team-Building
    • Staffing
    • Scoping
    • Road Maps
    • BPM, CRM, SCM
  • Master Data Management »
    Includes:
    • Business Definitions
    • Sharing
    • Integration
    • ETL, EAI, EII
    • Replication
    • Data Governance

Sponsored Links