TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Modernize and Govern: Unifying Your Data Strategy July 10, 2025
  - Expert Panel: Best Practices for Modernizing Your Data Environment July 14, 2025
  - Powering Data Science with AI-Driven Tools and Practices July 15, 2025
  - Smarter Marketing in Retail: How AI and Modern Data Foundation Drive Growth July 17, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Blog

Philip Russom, Ph.D., is senior director of TDWI Research for data management and is a well-known figure in data warehousing, integration, and quality, having published over 550 research reports, magazine articles, opinion columns, and speeches over a 20-year period. Before joining TDWI in 2005, Russom was an industry analyst covering data management at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and consultant, was a contributing editor with leading IT magazines, and a product manager at database vendors. His Ph.D. is from Yale. You can reach him by email ([email protected]), on Twitter (twitter.com/prussom), and on LinkedIn (linkedin.com/in/philiprussom).

TDWI Blog: Data 360

Q&A RE: The State of Big Data Integration

It’s still early days, but users are starting to integrate big data with enterprise data, largely for business value via analytics.

By Philip Russom, TDWI Research Director for Data Management

A journalist from the IT press recently sent me an e-mail containing several very good questions about the state of big data relative to integrating it with other enterprise data. Please allow me to share the journalist’s questions and my answers:

How far along are enterprises in their big data integration efforts?

According to my survey data, approximately 38% of organizations don’t even have big data, in any definition, so they’ve no need to do anything. See Figure 1 in my 2013 TDWI report Managing Big Data. Likewise, 23% have no plans for managing big data with a dedicated solution. See Figure 5 in that same report.

Even so, some organizations have big data, and they are already managing it actively. Eleven percent have a solution in production today, with another 61% coming in the next three years. See Figure 6.

Does data integration now tend to be haphazard, or one-off projects, in many enterprises, or are architectural strategies emerging?

I see all the above, whether with big data or the usual enterprise data. Many organizations have consolidated most of their data integration efforts into a centralized competency center, along with a centrally controlled DI architecture, whereas a slight majority tend to staff and fund DI on a per-application or per-department basis, without an enterprise strategy or architecture. Personally, I’d like to see more of the former and less of the latter.

What are the best approaches for big data integration architecture?

Depends on many things, including what kind of big data you have (relational, other structures, human language text, XML docs, etc.) and what you’ll do with it (analytics, reporting, archiving, content management). Multiple big data types demand multiple data platforms for storing big data, whereas multiple applications consuming big data require multiple processing types to prepare big data for those applications. For these reasons, in most cases, managing big data and getting business use from it involves multiple data management platforms (from relational DBMSs to Hadoop to NoSQL databases to clouds) and multiple integration tools (from ETL to replication to federation and virtualization).

Furthermore, capturing and integrating big data can be challenging from a data integration viewpoint. For example, the streaming big data that comes from sensors, devices, vehicles, and other machines requires special event-processing technologies to capture, triage, and route time-sensitive data—all in a matter of milliseconds. As with all data, you must transform big data as you move it from a source to a target, and the transformations may be simple (moving a click record from a Web log to a sessionization database) or complex (deducing a fact from human language text and generating a relational record from it).

What "traditional" approaches are being updated with new capabilities and connectors?

The most common data platform being used for capturing, storing, and managing big data today are relational databases, whether based on MPP, SMP, appliance, or columnar architectures. See Figure 16 in the Managing Big Data report. This makes sense, given that in a quarter of organizations big data is mostly or exclusively structured data. Even in organizations that have diverse big data types, structured and relational types are still the most common. See Figure 1.

IMHO, we’re fortunate that vendors’ relational database management systems (RDBMSs) (from the old brands to the new columnar and appliance-based ones) have evolved to scale up to tens and hundreds of terabytes of relational and otherwise structured data. Data integration tools have likewise evolved. Hence, scalability is NOT a primary barrier to managing big data.

If we consider how promising Hadoop technologies are for managing big data, it’s no surprise that vendors have already built interfaces, semantic layers, and tool functionality for accessing a broad range of big data managed in the Hadoop Distributed File System (HDFS). This includes tools for data integration, reporting, analysis, and visualization, plus some RDBMSs.

What are the enterprise "deliverables" coming from users’ efforts with big data (e.g., analytics, business intelligence)?

Analytics is the top priority and hence a common deliverable from big data initiatives. Some reports also benefit from big data. A few organizations are rethinking their archiving and content management infrastructures, based on big data and the potential use of Hadoop in these areas.

How is the role of data warehousing evolving to meet the emergence of Big Data?

Big data is a huge business opportunity, with few technical challenges or downsides. See figures 2 through 4 in the report Managing Big Data. Conventional wisdom says that the opportunity for business value is best seized via analytics. So the collection, integration, and management of big data is not an academic exercise in a vacuum. It is foundational to enabling the analytics that give an organization new and broader insights via analytics. Any calculus for the business return on managing big data should be based largely on the benefits of new analytics applied to big data.

On April 1, 2014, TDWI will publish my next big report on Evolving Data Warehouse Architectures in the Age of Big Data. At that time, anyone will be able to download the report for free from www.tdwi.org.

How are the new platforms (such as Hadoop) getting along with traditional platforms such as data warehouses?

We say “data warehouse” as if it’s a single monolith. That’s convenient, but not very accurate. From the beginning, data warehouses have been environments of multiple platforms. It’s common that the core warehouse, data marts, operational data stores, and data staging areas are each on their own standalone platforms. The number of platforms increased early this century, as data warehouse appliances and columnar RDBMSs arrived. It’s now increasing again, as data warehouse environments now fold in new data platforms in the form of the Hadoop Distributed File System (HDFS) and NoSQL databases. The warehouse has always evolved to address new technology requirements and business opportunities; it’s now evolving again to assure that big data is managed appropriately for the new high-value analytic applications that many businesses need.

For an exhaustive discussion of this, see my 2013 TDWI report Integrating Hadoop into Business Intelligence and Data Warehousing.

Posted by Philip Russom, Ph.D. on January 22, 20140 comments

Managing Big Data: An Overview in 30 Tweets

By Philip Russom
Research Director for Data Management, TDWI

To help you better understand new practices for managing big data and why you should care, I’d like to share with you the series of 30 tweets I recently issued on the topic. I think you’ll find the tweets interesting, because they provide an overview of big data management and its best practices in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from my recent TDWI report “Managing Big Data.” Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs, hashtags, and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Types of Multi-Structured Data Managed as Big Data
1. #TDWI SURVEY SEZ: 26% of users manage #BigData that’s ONLY structured, usually relational.
2. #TDWI SURVEY SEZ: 31% manage #BigData that’s eclectic mix of struc, unstruc, semi, etc.
3. #TDWI SURVEY SEZ: 38% don’t have #BigData by any definition. Hear more in #TDWI Webinar Oct.8 noonET http://bit.ly/BDMweb
4. Structured (relational) data from traditional apps is most common form of #BigData.
5. #BigData can be industry specific, like unstruc’d text in insurance, healthcare & gov.
6. Machine data is special area of #BigData, with as yet untapped biz value & opportunity.

Reasons for Managing Big Data Well
7. Why manage #BigData? Keep pace w/growth, biz ROI, extend ent data arch, new apps.
8. Want to get biz value from #BigData? Manage #BigData for purposes of advanced #analytics.
9. #BigDataMgt yields larger samples for apps that need it: 360° views, risk, fraud, customer seg.
10. #TDWI SURVEY SEZ: 89% feel #BigDataMgt is opportunity. Mere 11% think it’s a problem.
11. Key benefits of #BigDataMgt are better #analytics, datasets, biz value, sales/marketing.
12. Barriers to #BigDataMgt: low maturity, weak biz support, new design paradigms.
13. #BigDataMgt non-issues: bulk load, query speed, scalability, network bandwidth.

Strategies for Users’ Big Data Management Solutions
14. #TDWI SURVEY SEZ: 10% have #BigDataMgt solution in production; 10% in dev; 20% prototype; 60% nada. #TDWI Webinar Oct.8 http://bit.ly/BDMweb
15. #TDWI SURVEY SEZ: Most common strategy for #BigDataMgt: extend existing DataMgt systems.
16. #TDWI SURVEY SEZ: 2nd most common strategy for #BigDataMgt: deploy new DataMgt systems for #BigData.
17. #TDWI SURVEY SEZ: 30% have no strategy for #BigDataMgt though they need one.
18. #TDWI SURVEY SEZ: 15% have no strategy for #BigDataMgt cuz they don’t need one.

Ownership and Use of Big Data Management Solutions
19. Some depts. & groups have own #BigDataMgt platforms, including #Hadoop. Beware teramart silos!
20. Trend: #BigDataMgt platforms supplied by IT as infrastructure. Imagine shared #Hadoop cluster.
21. Who does #BigDataMgt? analysts 22%; architects 21%; mgrs 21%; tech admin 13%; app dev 11%.

Tech Specs for Big Data Management Solutions
22. #TDWI SURVEY SEZ: 97% of orgs manage structured #BigData, followed by legacy, semi-struc, Web data etc.
23. Most #BigData stored on trad drives, but solid state drives & in-memory functions are gaining.
24. #TDWI SURVEY SEZ: 10-to-99 terabytes is the norm for #BigData today.
25. #TDWI SURVEY SEZ: 10% have broken the 1 petabyte #BigData barrier. Another 13% will within 3 years.

A Few Best Practices for Managing Big Data
26. For open-ended discovery-oriented #analytics, manage #BigData in original form wo/transformation.
27. Reporting and #analytics are different practices; managing #BigData for each is, too.
28. #BigData needs data standards, but different ones compared to other enterprise data.
29. Streaming #BigData is easy to capture & manage offline, but tough to process in #RealTime.
30. Non-SQL, non-relational platforms are coming on strong; BI/DW needs them for diverse #BigData.

Want to learn more about managing big data?

For a much more detailed discussion—in a traditional publication!—get the TDWI Best Practices Report, titled Managing Big Data, available in a PDF file via a free download.

You can also register for and replay my TDWI Webinar, where I present the findings of Managing Big Data.

Please consider taking courses at the TDWI World Conference in Boston, October 20–25, 2013. Enroll online.
============================
Philip Russom is the research director for data management at TDWI. You can reach him at [email protected] or follow him as @prussom on Twitter.

Posted by Philip Russom, Ph.D. on October 11, 20130 comments

Analytics and Reporting Are Two Different Practices

Treat them differently, if you want to get the most out of each.

By Philip Russom, TDWI Research Director for Data Management

I regularly get somewhat off-base questions from users who are in the thick of implementing or growing their analytic programs, and therefore get a bit carried away. Here’s a question I’ve heard a lot recently: “Our analytic applications generate so many insights that I should decommission my enterprise reporting platform, right?” And here’s a related question: “Should we implement Hadoop to replace our data warehouse and/or reporting platform?”

The common misconception I perceive behind these questions (which makes them “off-base” in my perception) is that people seem to be forgetting that analytics and reporting are two different practices. Analytics and reporting serve different user constituencies, produce different deliverables, prepare data differently, and support organizational goals differently. Despite a fair amount of overlap, I see analytics and reporting as complementary, which means you most likely need both and neither will replace the other. Furthermore, due to their differences, each has unique tool and data platform requirements that you need to satisfy, if you’re to get the most out of each.

Allow me to net it out with a few sweeping generalizations.

Reporting is mostly about entities and facts you know well, represented by highly polished data that you know well. And that data usually takes the form of carefully modeled and cleansed data with rich metadata and master data that’s managed in a data warehouse. In fact, it’s difficult to separate reporting and data warehouses, because most users designed their DWs first and foremost as a repository for reporting and similar practices such as OLAP, performance management, dashboards, and operational BI.

I regularly hear claims that Hadoop can replace a true DW. But I doubt this, because the current state of Hadoop cannot satisfy the data requirements of enterprise reporting near as well as the average DW can. Ultimately, it’s not about the warehouse per se; it’s about practices a DW supports well, such as reporting. I reserve the right to change my mind in the future, because Hadoop gets more sophisticated almost daily. My real point: most enterprise reporting depends on a DW for success, so keep and protect the DW.

Advanced analytics enables the discovery of new facts you didn’t know, based on the exploration and analysis of data that’s probably new to you. New data sources generally tell you new things, which is one reason organizations are analyzing big data more than ever before. Unlike the pristine data that reports operate on, advanced analytics works best with detailed source data in its original (even messy) form, using discovery oriented technologies, such as mining, statistics, predictive algorithms, and natural language processing. Sure, DWs can be expanded to support some forms of big data and advanced analytics. But the extreme volumes and diversity of big data are driving more and more users to locate big data on a platform besides a DW, such as Hadoop, DW appliances, or columnar databases.

I personally think that providing separate data platforms for reporting and analytics is a win-win data strategy. It frees up capacity on the DW, so it can continue growing and supporting enterprise reporting plus related practices. And it gives advanced analytics a data platform that’s more conducive to exploration and discovery than the average DW is.

Reporting is like a “high-volume business,” whereas analytics is like a “high-value business.” For example, with so-called enterprise business intelligence, thousands of concurrent report consumers access tens of thousands of reports that are refreshed nightly. By comparison, a small team of data analysts can transform an organization with a few high-value insights, such as new customer segments, visibility into costs, correlations between supplies and product quality, fraud detection, risk calculations, and so on. For completely different reasons, you need both reporting and analytics to serve the full range of user constituencies and provide many different levels of information and insight.

Most reports demand numeric precision, whereas most analyses don’t. Think financial reports (accurate to the penny) versus website page view reports (where guesstimates are fine).

Most enterprise reports require an audit trail, whereas few analyses do. Think regulatory reports versus the scores of an analytic model for customer churn.

Data management techniques differ. Squeaky clean report data demands elaborate data processing (for ETL, quality, metadata, master data, and so on), whereas preparing raw source data for analytics is simpler, though at higher levels of scale.

CONCLUSIONS: Despite some overlap, enterprise reporting and advanced analytics are so different as to be complementary. Hence, neither will replace the other. Both do important things for an information-driven organization, so you must give each what it needs for success, both at the tool level and at the data management level. Taking seriously the data requirements of big data analytics may lead you to implement Hadoop; but that doesn’t mean that Hadoop will replace a DW, which is still required to satisfy the data requirements of reporting and related practices, such as OLAP, performance management, dashboards, and operational BI.

Posted by Philip Russom, Ph.D. on September 26, 20130 comments

Evolving Data Warehouse Architectures: Integrating HDFS with an RDBMS Alleviates the Limitations of Both

Hadoop has limitations. But the relational database management systems used for data warehousing do, too. Luckily, their strengths are complementary.

By Philip Russom, TDWI Research Director for Data Management

In a recent blog in this series, I discussed “The Roles of Hadoop” in evolving data warehouse architectures. (There’s a link to that blog at the end of this blog.) In response, a few people asked me (I’m paraphrasing): “Since the Hadoop Distributed File System (HDFS) is so useful, can it replace the relational database management system (RDBMS) that’s at the base of my current data warehouse and its architecture?”

The short answer is: “No.” The long answer is: “Not today, and probably not in the future.” The main reason is that Hadoop—in its current form—lacks (or is weak with) many of the functions that we depend on in our RDBMSs. As you’ll see in the list below, most of the RDBMS functions I have in mind enable feature-rich and high-performance access to stored data via SQL. Other functions concerns tools for data security and administration.

Just so you know where this blog is going: Hadoop has limitations, but the average data warehouse does, too. Luckily, the strengths and weaknesses of the two are complementary (for the most part). When you integrate Hadoop and an RDBMS, they fill in each other’s holes and provide a more broadly capable data warehouse architecture than has been possible until now.

Hadoop’s Limitations Relative to RDBMSs Used for Data Warehousing
Despite all the goodness of Hadoop I described in a previous blog, there are areas within data warehouse architectures where HDFS isn’t such a good fit:

RDBMS functionality. HDFS is a distributed file system and therefore lacks capabilities we expect from relational database management systems (RDBMSs), such as indexing, random access to data, support for standard SQL, and query optimization. But that’s okay, because HDFS does things RDBMSs do not do as well, such as managing and processing massive volumes of file-based, unstructured data. For minimal DBMS functionality (though not fully relational), users can layer HBase over HDFS, as well as the query framework called Hive.

Low-latency data access and queries. HDFS’s batch-oriented, serial-execution engine means that it’s not the best platform for real-time or speedy data access or queries. Furthermore, Hadoop lacks mature query optimization. Hence, the selective random access to data and iterative ad hoc queries that we take for granted with RDBMSs are alien to Hadoop.

An RDBMS integrated with Hadoop can provide needed query support. HBase is a possible solution, if all you need is a record store, not a full-blown DBMS. And upcoming improvements to Hadoop Hive and the new Impala query engine will address some of the latency issues.

Streaming data. HDFS and other Hadoop products can capture data from streaming sources (Web servers, sensors, machinery) and append it to files. But, being inherently batch, they are ill-equipped to process that data in real time. In my opinion at this date, such extremes of real-time analytics are best done with specialized tools for complex event processing (CEP) and/or operational intelligence (OI) from third-party vendors.

Granular security. Hadoop today includes a few security features, such as file-permission checks, access control for job queues, and service-level authorization. Add-on products that provide encryption and LDAP integration are available for Hadoop from a few third-party vendors. Since HDFS is not a DBMS (and Hadoop data doesn’t necessarily come in relational structures), don’t expect granular security at the row or field level, as in an RDBMS.

Administrative tools. According to a TDWI survey, security is Hadoop users’ most pressing need, followed by a need for better administrative tools, especially for cluster deployment and maintenance. The good news is that a few vendors offer tools for Hadoop administration, and an upgrade of open-source Ambari is coming.

SQL-based analytics. With the above latency limitations in mind, HDFS is a problematic choice for workloads that are iterative and query based, as with SQL-based analytics. Furthermore, Hadoop products today have limited support for standard SQL. A number of vendor products (from RDBMSs to data integration and reporting tools) can provide SQL support today, and open-source Hadoop has new incubator projects that will eventually provide adequate support for SQL. These are critical if Hadoop is to become a productive part of a SQL-driven data warehouse architecture.

I’m not making up these limitations for Hadoop. The list is based on survey results and user interviews, as reported in my 2013 TDWI Best Practices Report: Integrating Hadoop into BI and Data Warehousing.

In Defense of Hadoop
My list of limitations might seem like “Hadoop bashing” to some readers, but that is not what I intend. So let me restate what I stated positively in the last blog: “HDFS and other Hadoop tools promise to extend and improve some areas within data warehouse architectures.”

Sure, Hadoop’s help is limited to “some areas.” But the fantastically fortuitous fact is that most of Hadoop’s strengths are in areas where most warehouses and BI technology stacks are weak, such as unstructured data, outrageously large data sets, non-SQL algorithmic analytics, and the flood of files that’s drowning many of us. Conversely, Hadoop’s limitations (as discussed above) are mostly met by mature functionality available today from a wide range of RDBMS types (OLTP databases, columnar databases, DW appliances, etc.), plus admin tools. In that light, I hope it’s clear that Hadoop and the average data warehouse are complementary (despite a bit of overlap), so it’s unlikely that one could replace the other, as I am often asked.

Integrating Hadoop with an RDBMS Alleviates the Limitations of Both
The trick, of course, is making HDFS and an RDBMS work together optimally. To that end, one of the critical success factors for assimilating Hadoop into evolving data warehouse architectures is the improvement of interfaces and interoperability between HDFS and RDBMSs. Luckily, this is well under way, due to efforts from software vendors and the open source community. And technical users are starting to leverage HDFS/RDBMS integration.

For example, an emerging best practice among DW professionals with Hadoop experience is to manage diverse big data in HDFS, but process it and move the results (via ETL or other data integration media) to RDBMSs (elsewhere in the DW architecture) that are more conducive to SQL-based analytics. HDFS serves as a massive data staging area. A similar best practice is to use an RDBMS as a front-end to HDFS data; this way, data is moved via queries (whether ad hoc or standardized), not via ETL jobs. HDFS serves as a large, diverse operational data store. For more information about these practices, replay my recent TDWI Webinar: "Ad Hoc Query Speed for Hadoop."

Other Blogs in the Evolving Data Warehouse Architectures series:
• From EDW to DWE
• The Role(s) of Hadoop

Posted by Philip Russom, Ph.D. on September 2, 20130 comments

Evolving Data Warehouse Architectures: The Roles of Hadoop

HDFS and other Hadoop tools promise to extend and improve some areas within data warehouse architectures
By Philip Russom, TDWI Research Director for Data Management

In a TDWI survey I designed and ran in 2012, 88% of the users surveyed reported that the Hadoop ecosystem of products is a business opportunity (not a technology problem) because it enables new types of applications. When asked which types of applications benefit most from Hadoop, survey respondents chose (in priority order) big data analytics, advanced analytics (i.e., data mining, statistical analysis, and complex SQL), and discovery analytics. After these three analytic application types, respondents then chose two data management use cases for Hadoop, namely information exploration and complementing a data warehouse. Other data management uses seen in the survey include data archiving, transforming big data for analytics, and data staging.

If you pull together all the things I just mentioned, it’s quite a list of use cases in data warehousing for the Hadoop Distributed File System (HDFS) and other Hadoop tools (MapReduce, Hive, HBase, HCatalog, Impala, etc.). And many of these—if implemented in a multi-platform data warehouse environment (DWE)—would have a strong influence on the architecture of that data environment.

Promising Uses of Hadoop that Impact DW Architectures
I see a handful of areas in data warehouse architectures where HDFS and other Hadoop products have the potential to play positive roles:

Data staging. A lot of data processing occurs in a DW’s staging area, to prepare source data for specific uses (reporting, analytics, OLAP) and for loading into specific databases (DWs, marts, appliances). Much of this processing is done by homegrown or tool-based solutions for extract, transform, and load (ETL). Imagine staging and processing a wide variety of data on HDFS.

For users who prefer to hand-code most of their solutions for extract, transform, and load (ETL), they will most likely feel at home in code-intense environments like Apache MapReduce. And they may be able to refactor existing code to run there. For users who prefer to build their ETL solutions atop a vendor tool, the community of vendors for ETL and other data management tools is rolling out new interfaces and functions for the entire Hadoop product family.

Note that I’m assuming that (whether you use Hadoop or not), you should physically locate your data staging area(s) on standalone systems outside the core data warehouse, if you haven’t already. That way, you preserve the core DW’s capacity for what it does best: squeaky clean, well-modeled data (with an audit trail via metadata and master data) for standard reports, dashboards, performance management, and OLAP. In this scenario, the standalone data staging area(s) offload most of the management of big data, archiving source data, and much of the data processing for ETL, data quality, and so on.

Data archiving. When organizations embrace forms of advanced analytics that require detail source data, they amass large volumes of source data, which taxes areas of the DW architecture where source data is stored. Imagine managing detailed source data as an archive on HDFS.

You probably already do archiving with your data staging area, though you probably don’t call it archiving. If you think of it as an archive, maybe you’ll adopt the best practices of archiving, especially information life cycle management (ILM), which I feel is valuable but woefully vacant from most DWs today. Archiving is yet another thing the staging area in a modern DW architecture must do, thus another reason to offload the staging area from the core DW platform.

Traditionally, enterprises had three options when it came to archiving data: leave it within a relational database, move it to tape or optical disk, or delete it. Hadoop’s scalability and low cost enable organizations to keep far more data in a readily accessible online environment. An online archive can greatly expand applications in business intelligence, advanced analytics, data exploration, auditing, security, and risk management.

Multi-structured data. Relatively few organizations are getting BI value from semi-structured and unstructured data, despite years of wishing for it. Imagine HDFS as a special place within your DW environment for managing and processing semi-structured and unstructured data. Another way to put it is: imagine not stretching your RDBMS-based DW platform to handle data types that it’s not all that good with.

One of Hadoop’s strongest complements to a DW is its handling of semi-structured and unstructured data. But don’t go thinking that Hadoop is only for unstructured data; HDFS handles the full range of data, including structured forms, too. In fact, Hadoop can manage just about any data you can store in a file and copy into HDFS.

Processing flexibility. Given its ability to manage diverse multi-structured data, as I just described, Hadoop’s NoSQL approach is a natural framework for manipulating non-traditional data types. Note that these data types are often free of schema or metadata, which makes them challenging for SQL-based relational DBMSs. Hadoop supports a variety of programming languages (Java, R, C), thus providing more capabilities than SQL alone can offer.

In addition, Hadoop enables the growing practice of “late binding.” Instead of transforming data as it’s ingested by Hadoop (the way you often do with ETL for data warehousing), which imposes an a priori model on data, structure is applied at runtime. This, in turn, enables the open-ended data exploration and discovery analytics that many users are looking for today.

Advanced analytics. Imagine HDFS as a data stage, archive, or twenty-first-century operational data store that manages and processes big data for advanced forms of analytics, especially those based on MapReduce, data mining, statistical analysis, and natural language processing (NLP). There’s much to say about this; in a future blog I’ll drill into how advanced analytics is one of the strongest influences on data warehouse architectures today, whether Hadoop is in use or not.

Stay tuned, because I’ll soon post more blogs about evolving data warehouse architectures. In the meantime, please read the new TDWI Checklist “Where Hadoop Fits in Your Data Warehouse Architecture.”

Other blogs in the Evolving Data Warehouse Architectures series:
• From EDW to DWE

Posted by Philip Russom, Ph.D. on August 4, 20130 comments

Evolving Data Warehouse Architectures: From EDW to DWE

Many Enterprise Data Warehouses (EDWs) are evolving into multi-platform Data Warehouse Environments (DWEs)

By Philip Russom, TDWI Research Director for Data Management

Analytics, big data, real time, and unstructured data present new data warehouse (DW) workloads.

Workload-centric DW architecture. One way to measure a data warehouse’s architecture is to count the number of workloads it supports. According to the TDWI Survey on High-Performance Data Warehousing of 2012, a little over half of user organizations surveyed (55%) support only the most common workloads, namely those for standard reports, performance management, and online analytic processing (OLAP). The other half (45%) also supports workloads for advanced analytics, detailed source data, various forms of big data, and real-time data feeds.

The trend is toward the latter. In other words, the number and diversity of DW workloads is increasing, due to organizations embracing big data, multi-structured data, real-time or streaming data, and data processing for advanced analytics. The catch is that some data warehouses (whether defined as a vendor product or a user’s design) can handle multiple, concurrent workloads of various types, whereas others cannot.

The diversification of DW workloads leads to distributed architectures for DWs.

Distributed DW architecture. The issue in a multi-workload environment is whether a single-platform data warehouse can be designed and optimized such that all workloads run optimally, even when concurrent. More and more DW teams are concluding that a single-platform DW is no longer desirable. Instead, they maintain a core DW platform for traditional workloads (reports, performance management, and OLAP), but offload other workloads to other platforms. For these organizations, the DW is not going away; it’s just being complemented by additional data platforms tuned to workloads that can and should be offloaded from the core warehouse.

For example, data and processing for SQL-based analytics are regularly offloaded to DW appliances and columnar DBMSs. And a few teams offload workloads for big data and advanced analytics to HDFS, MapReduce, and other NoSQL platforms. The result is a strong trend toward distributed DW architectures, where many areas of the logical DW architecture are physically deployed on standalone platforms instead of the core DW platform.

A distributed DW architecture is both good and bad. It’s good if your fidelity to business requirements and DW performance lead you to deploy another data platform in your DW environment, and the new platform integrates well with others in the distributed architecture. But it’s bad when disconnected systems proliferate uncontrolled, like the errant data marts we all fear. So far, the newest generation of analytic databases and data management platforms are controlled by users far better than the marts of yore. But you still have to be diligent to avoid abuses.

Also, note that the architectural distinctions made here have always been a matter of degree, and will continue to be so. In other words, no architecture is 100% monolithic or 100% distribution. Many are hybrids, and the percentage right for you depends on many matters of business and technology. Many DW architectures have always been distributed, to some degree. It's just that the degree is more pronounced today.

The trend toward a distributed DW architectures isn’t new. Not by a long shot. For decades, warehouses have wended their way through a variety of “edge systems” that are deployed on standalone servers off to the side of the warehouse, but integrated with it. This has been true from the dawn of warehousing (as with data marts and operational data stores (ODSs)), though recently expanded (with DW appliances and columnar DBMSs), and now continuing with new types of data platforms (namely NoSQL and Hadoop). Hence, even the new platforms fit comfortably into the well-established tradition of DW edge systems.

Rearrange the acronym from EDW to DWE, standing for “data warehouse environment,” meaning multi-platform DW.

From the single-platform EDW to the multi-platform DWE. A consequence of the workload-centric approach is a trend away from the single-platform monolith of the enterprise data warehouse (EDW) toward a physically distributed data warehouse environment (DWE). A modern DWE consists of multiple platform types, ranging from the traditional warehouse (and its satellite systems for marts and ODSs) to new platforms like DW appliances, columnar DBMSs, noSQL databases, MapReduce tools, and HDFS. In other words, users’ portfolios of tools for BI/DW and related disciplines are diversifying aggressively.
The multi-platform approach adds more complexity to the DW environment, but BI/DW professionals have always managed complex technology stacks successfully. The upside is that users love the high performance and solid information outcomes that they get from workload-tuned platforms.

Note that a DWE can be a simple bucket of standalone silos, and that’s where many organizations are today. Ideally, the physically distinct systems of the DWE should be integrated with others, so they connect via an overall logical design. Integration within the DWE can take many forms, including shared dimensions, data sync, federation, data flows across DWE platforms, and so on. Unless the platforms of a DWE are integrated at appropriate levels, the DWE is just a bucket of silos, whereas it will be more efficient technically and more effective for business users if it has an architectural design that unifies it.

Stay tuned, because I’ll soon post more blogs about evolving data warehouse architectures. In the meantime, please attend an upcoming TDWI Webinar, in which I’ll address many of the issues mentioned here. Register online for the Webinar Big Data and Your Data Warehouse, to be broadcast September 5, 2013 at 9:00am ET.

Posted by Philip Russom, Ph.D. on July 26, 20130 comments