TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Modernize and Govern: Unifying Your Data Strategy July 10, 2025
  - Expert Panel: Best Practices for Modernizing Your Data Environment July 14, 2025
  - Powering Data Science with AI-Driven Tools and Practices July 15, 2025
  - Smarter Marketing in Retail: How AI and Modern Data Foundation Drive Growth July 17, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Blog

Data Warehousing Blog Posts

See the most recent Data Warehousing related items below.

TDWI Blog: Data 360

Faster Analytics Processing with Open Source

By David Stodder, TDWI Director of Research for Business Intelligence

A tsunami of big data is hitting many organizations and the demand for faster, more frequent, and more varied analytics is riding the crest of that wave. Organizations want to apply predictive analytics, stream analytics, machine learning, and other forms of advanced analytics to their key decisions and operations. They are also experiencing the rise of self-service visual analytics, which is whetting the appetite of nontechnical users throughout organizations who want do more with data than they can using standard business intelligence (BI) reports and spreadsheets.

Fortunately, technology trends are moving in a positive direction for organizations seeking to expand the business impact of analytics and send data exploration in new directions. Many of the most important innovations are occurring in the open source realm. In the decade since Hadoop and MapReduce were first developed, we have seen a flurry of initiatives, the best of which have become ongoing Apache Software foundation projects. Today, with the Hadoop 2.0 ecosystem and YARN, it is more possible for organizations to plug their choice of interactive SQL programs, advanced analytics, open source-based processing and execution engines, and other best-of-breed tools into something resembling a unified architecture.

TDWI has just published my new Checklist Report, “Seven Steps to Faster Analytics Processing with Open Source.” We also did a Webinar on this topic that featured discussion with representatives of the four sponsors of the checklist: Cloudera, DataTorrent, Platfora, and Talend. I invite you to check out these resources.

One of the key areas that I wrote about in the checklist—and that was also discussed in the Webinar—was open source stream processing and stream analytics. With interest growing in Internet of Things (IoT) data streams from sensors and other machines, many organizations need to develop a technology strategy for stream processing and stream analytics. The Apache Spark Streaming module, Apache Storm, and Apache Apex are aimed at processing streams of real-time data for analytics. These technologies can be integrated with Apache Kafka, a popular publish-and-subscribe messaging system that can serve data to streaming systems. In the coming year, I am sure we will see rapid evolution of open source technologies for gaining value from real-time data streams.

Other important topics that we discussed in the Webinar and I covered in the report are interactive SQL-based querying of Hadoop systems, and data integration and preparation. Good interactivity with Hadoop data, which includes the ability to send ad hoc SQL queries and receive responses in a reasonable time, is critical to analytics. However, until recently interactivity with Hadoop data was slow and difficult. New options involving SQL-on-Hadoop, Hive/Spark integration, and packaged MapReduce-based big data discovery are improving performance and making interactivity easier for users and developers. Data integration is also getting a push from Spark. Programs for data integration and preparation can use its in-memory data processing and generally better performance to quicken the pace of what are often the most time-consuming steps in BI and analytics.

I expect an active year ahead in open source-based technologies for BI and analytics and will be observing them closely in my 2016 research and analysis.

Hyperlinks embedded in this blog:

Cloudera: http://www.cloudera.com

DataTorrent: https://www.datatorrent.com/

Platfora: http://www.platfora.com/

Talend: http://www.talend.com/

Apache Spark Streaming: https://spark.apache.org/streaming/

Apache Storm: https://storm.apache.org/

Apache Kafka: http://kafka.apache.org/

Apache Apex: http://apex.incubator.apache.org/

Posted by David Stodder on December 21, 20150 comments

Igniting the Analytic Spark

An Introduction to Apache Spark and its uses in Business Intelligence (BI), Data Warehousing (DW), and Advanced Analytics

Blog by Philip Russom
Research Director for Data Management, TDWI

At TDWI, we’re hearing a lot of interest in Apache Spark, although it’s still new and most users are unfamiliar with it. So, please allow me to define Spark for you, explain its potential benefits, and describe actual use cases.

Apache Spark is a parallel processing engine. It specializes in big data, and works well with Hadoop environments. However, Apache is not just for Hadoop; it provides parallel processing for other environments, too. Spark is known for high speed and low latency, which it achieves by leveraging in-memory computing and cyclic data flows.

Spark is fast. Very fast. Benchmarks show Spark to be up to one hundred times faster than Hadoop MapReduce with in-memory operations. Spark is ten times faster than MapReduce with disk-bound operations. The point is that Spark has the low latency required of new data-driven practices, like data exploration, discovery, streaming analytics, and SQL-based analytics.

Spark functions apply directly to applications in BI, DW, DI, & analytics. Spark today includes four libraries of functionality, and each is of interest to professionals in BI, DW, and analytics. The libraries support ANSI-standard SQL, streaming data, machine learning, and graph analytics.

A Spark library provides native support for ANSI and ISO standard SQL. In a recent TDWI survey, 69% of users surveyed said that ANSI- and ISO-standard SQL on Hadoop is required for broad enterprise use. That’s because a modern enterprise wants to leverage pre-existing SQL skills and SQL-based tools. Furthermore, users want fast queries on Hadoop, to enable data exploration, analytics, and other interactive, data-driven practices. Spark and its SQL support promise to enable these – in both batch or interactive sessions, for Hadoop and other environments – which in turn will spark big data analytics for users in BI, DW, and analytics.

Spark offers broad compatibility. Spark SQL reuses the Hive front-end and metastore, to provide compatibility with existing Hive data, queries, UDFs. Spark SQL’s server mode extends interoperability via industry-standard ODBC/JDBC. Spark can process data in S3, HDFS, HBase, Hive, Cassandra, and any Hadoop InputFormat.

Spark can be deployed many ways. Spark requires some kind of shared file system (NFS compliant), so its deployment options are diverse. Spark runs on its standalone cluster, Hadoop YARN, Apache Mesos, and Amazon EC2; on premises or cloud. A single job, query, or stream processing can be deployed in either batch or interactive mode via Scala, Python, and R shells.

Spark has one console for the seamless development of diverse functionality. Apache Spark includes libraries for four high-level applications: SQL, streaming data, machine learning, and graph analytics. These are integrated tightly, so users can create applications that mix SQL queries and stream processing alongside complex analytic algorithms.

Spark and its libraries enable several application types for BI, DW, and analytics:

SQL analytics and related set-based applications – e.g., data exploration and discovery, customer-base segmentation, financial analyses, dimensional modeling and analysis, reporting, ETL pushdown that requires SQL
Stream capture and analysis -- monitoring facilities (utilities, factories), tracking social sentiment, predictive machine maintenance, reroute vehicle traffic, manage mobile assets, any time-sensitive process
Graph analytics -- anomaly detection for fraud or risk, behavioral analysis, entity clustering, patient outcome optimization
Mixtures of the above – a trend among users is to mix multiple analytic methods in a single application, because each reveals different insights

Want to learn more about Spark? Click here to replay my recent TDWI Webinar, where go into more detail about Spark and its uses in BI, DW, and analytics.

Posted by Philip Russom, Ph.D. on December 7, 20150 comments

Emerging Technologies and Methods: An Overview in 25 Tweets

Blog by Philip Russom
Research Director for Data Management, TDWI

To help you better understand what today’s emerging technologies and methods (ETMs) are – especially those related to business intelligence, analytics, and data warehousing – I’d like to share with you the series of 25 tweets I recently issued on the topic. I think you’ll find the tweets interesting, because they provide an overview of ETMs in a form that’s compact, yet amazingly comprehensive.

Each tweet below is a short sound bite or stat bite drawn from the recent TDWI report “Emerging Technologies for Business Intelligence, Analytics, and Data Warehousing,” which I researched and wrote with my colleagues David Stodder and Fern Halper. Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs, hashtags, and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Examples of Emerging Technologies and Methods (ETMs)

1. Most Emerging Techs & Methods (#ETMs) fall into 3 layers of BI, #analytics & #EDW tech stack.

2. #ETMs for #BI include #DataViz, #DataExploration, #DataPrep, #Dashboards, #MashUps, #MobileBI.

3. #ETMs for #Analytics operate on data from #SocialMedia, #IoT, streams, #MachineData.

4. #ETMs for #DataMgt include #Hadoop, #ApacheSpark, #NoSQL, in-DBMS #analytics, in-mem DBMS, columnar.

Examples of Emerging Methods and Platforms

5. #EmergingMethods include agile & lean dev methods applied to whole BI/DW/#analytics tech stack.

6. Other #EmergingMethods include #CompetencyCenters, #CollaborativeBI, #StoryTelling, #DataGovernance.

7. Emerging platforms include many types of clouds, #SaaS, #OpenSource, appliances...

The Importance of ETMs

8. #TDWI SURVEY SEZ: Emerging Techs & Methods (#ETMs) are very important (53%) or somewhat (39%).

9. #TDWI SURVEY SEZ: Emerging Techs & Methods (#ETMs) are opportunity to compete, evolve, perform (79%).

10. #TDWI SURVEY SEZ: Two-thirds of respondents (64%) already have #ETMs in production.
Benefits and Barriers for ETMs

11. #TDWI SURVEY SEZ: Top benefits of #ETMs = competitiveness, decision support, biz performance, innovation.

12. #TDWI SURVEY SEZ: Top barriers to #ETMs = LACK of skills, budgets, biz value, innovation.

13. #TDWI SURVEY SEZ: Other barriers to #ETMs = poor state of IT infrastructure & poor #DataGovernance.

User Satisfaction with Current State of ETMs

14. #TDWI SURVEY SEZ: 41% dissatisfied with their enterprise adoption of Emerging Techs & Methods (#ETMs).

15. Adoption of agile development methods is one of strongest trends in BI, #analytics, #EDW today.

16. #TDWI SURVEY SEZ: 55% dissatisfied with time required of development for BI, #analytics, #DataMgt.

User Success with Current State of ETMs

17. #TDWI SURVEY SEZ: Users successful with #ETMs for #SelfServiceBI (54%) & #DataPrep (50%).

18. #Hadoop & #NoSQL #ETMs are challenging for tools & apps built for relational data.

Emerging Data Types for Analytics

19. #TDWI SURVEY SEZ: 84% analyze structured data today. Suprising that 16% are not; maybe text analytics?

20. #TDWI SURVEY SEZ: #IoT data used by <20% of respondents today, but 40% more will use within 3 years.

21. Other data sources poised for growth = Machine data (sensors, devices) & #RealTime #EventStreaming.

22. #TDWI SURVEY SEZ: In clouds, users already do #EDW (35%), #Analytics (31%), sandbox (29%), DataInt (24%).

23. #TDWI SURVEY SEZ: 49% have production #PredictiveAnalytics today; another 39% will in 3 yrs.

ETMs for Data Warehousing & Data Management

24. #TDWI SURVEY SEZ: 3-yr hi growth in #DataMgt #ETMs = #RealTime, streams, #DataPrep, #Hadoop, #CloudDW.

25. Top security #ETMs for #DataMgt = #DataProtection (encrypt, mask, token), not just user name/pswd.

Want to learn more about Emerging Technologies and Methods (ETMs)?

For a more detailed discussion – in a traditional publication! – get the TDWI Best Practices Report, titled “Emerging Technologies for Business Intelligence, Analytics, and Data Warehousing,” which is available in a PDF file via a free download.

You can also register for and replay the TDWI Webinar, where David Stodder, Fern Halper, and I discuss the findings of the TDWI report.

Posted by Philip Russom, Ph.D. on November 9, 20150 comments

Trip Report: What I Learned at Informatica World 2015

Inspirational User Case Studies and Educational Product Demonstrations

By Philip Russom, TDWI Research Director for Data Management

When I attend a user group meeting or a vendor’s conference, my top two priorities are (1) to hear case studies from successful users and (2) to see practical demonstrations of the vendor’s products. I got both of those in spades last week, when I spent three days attending Informatica World 2015 in Las Vegas.

It was a huge conference, with about 2,500 people attending and five or more tracks running simultaneously. I couldn’t attend all these sessions, so I decided to focus on the keynotes and the Data Integration Track. To give you a taste of the conference, allow me to share highlights from what I was able to attend, with a stress on case studies and demos.

User Case Studies

An enterprise architect at MasterCard discussed their implementation of an enterprise data hub. The hub gives data analysts the data they need in a timely fashion, provides self-service data access for a variety of users, and serves as a unified platform for both internal and external data exchange.

Tom Tshontikidis explained why and how Kaiser Permanente migrated its large collection of data integration solutions from a legacy product (heavily extended via hand coding) to PowerCenter and other Informatica tools.

Two representatives from Cleveland Clinic spoke of their journey from quantity based metrics for performance management (which mostly laid blame on employees for missed targets) to quality based predictive analytics (which now sets realistic goals for helping their patients).

Dr. John Frenzel is the chief medical information officer at the MD Anderson Cancer Center. At Informatica World, he discussed how big data analytics is accelerating clinical research. Among the many great tips he shared, Frenzel described how data scientists at MD Anderson work like consultants, traveling among multiple teams, to share their expertise.

An IT systems architect at a major telecommunications company told the story about how they needed to simplify operations, so it could transform into better integrated – and hence more nimble – global organization. In support of those business goals, IT replaced hundreds of systems, mostly with six primary ones. This gargantuan consolidation project was mostly powered by Informatica tools.

Tom Kato of Mototak Consulting spoke in a few sessions. In one, he described how to manage data from cradle to grave, using best practices and leading tools for Information Lifecycle Management (ILM). In another, he explained his use of the Informatica Data Validation Option (DVO) in an early phase of the merger between American Airlines and US Airways.

John Racer from Discount Tire explained why validating data is important to assuring that data arrives where it’s supposed to be and in the condition intended. He discussed practical applications in cross-platform data flows, application migrations, and data migrations, involving tools from Informatica and other providers.

Product Demonstrations

Some of the coolest demos were presented by users. For example, I saw a management dashboard built by folks at a major energy company, using a visualization tool and data from PowerCenter. The dashboard enables business users to do pipeline capacity management and related operational tasks, many with near time data.

The Informatica Data Validation Option (DVO) kept coming up in presentations by both Informatica employees and customers. I was glad to see this, because I’ve long felt that data integration users do not validate data as often as they should. For example, validation should be part of most ETL testing and all data migration projects.

For a variety of reasons, I was glad see Secure@Source demo’d. The demo clarified that this is not a security tool, per se, although it can guide your security and other efforts. Instead, Secure@Source provides analytics for assessing data-oriented risks relevant to security, privacy, compliance, governance, and so on. Essentially, you create policies and other business rules (typically inspired by your compliance and governance policies), and Secure@Source helps you identify risks and quantify compliance.

Informatica’s Krupa Natarajan spent most of a session demonstrating Informatica Cloud. This product has been in production since 2006, so there’s a lot of robust functionality to look at. Long story short, Informatica Cloud comes across as a full-featured integration tool, not some after-thought hastily ported to a cloud (as too many cloud-based products are). Although Krupa didn’t say it explicitly, the demo brought home to me the point that data integration with a cloud-based tool is pretty much the same as with traditional tools. That good news should help users get more comfortable with clouds in general, as well as the potential use of cloud-based data management tools.

Further Learning

If you go to www.YouTube.com and search for “Informatica World 2015” you’ll find many useful speeches and sessions that you can replay. Here’s a couple of links to get you started:

Keynote by Informatica’s CEO, Sohaib Abbasi. This is a “must see,” if you care about Informatica’s vision for the future, especially in the context of the proposed acquisition of Informatica.

Interviews filmed on site by theCUBE. All the interviews are good. But I especially like the interviews with my analyst friends: John Myers and Mark Smith.

Posted by Philip Russom, Ph.D. on May 18, 20150 comments

5 Analytics Resources You Don't Want to Miss!

Take a look at 5 new resources that can help you evolve your analytics strategies beyond spreadsheets and dashboards. Create more value from your data when you move beyond simple business intelligence (BI) reporting to data discovery and advanced analytics. Use these recently released resources to develop your competitive advantage.

Ten Mistakes to Avoid When Democratizing BI and Analytics
Premium member resource—freely available until May 29
Download Now

Seven Steps for Executing a Successful Data Science Strategy
Download Now

TDWI Analytics Maturity Model Assessment & Guide
Download Now

TDWI Infographic: Hadoop for the Enterprise
Download Now

Upcoming Live Event: Special Offer Below
TDWI Boston 2015 | The Analytics Experience
July 26-31, 2015
Six action-packed days filled with classes, case studies, and hands-on training (WebAction, Tableau, Luminoso, Yellowfin, Archipelago, Data Mining with R, Hadoop, and more) offer an accelerated learning experience for business and technical leaders and implementers.
Learn More

The Analytics Experience | Boston 2015
July 26–31, 2015

Sign up now for the SUPER EARLY registration discount
20% off until May 29—Save up to $855!
Use priority code SEB20
Learn More

Posted by TDWI on May 15, 20150 comments

Hadoop for the Enterprise: An Overview in 25 Tweets

By Philip Russom, Research Director for Data Management, TDWI

To help you better understand Hadoop’s evolution into mainstream enterprise usage—and why you should care—I’d like to share with you the series of 25 tweets I recently issued on the topic. I think you’ll find the tweets interesting, because they provide an overview of enterprise Hadoop and its best practices in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from my recent TDWI report Hadoop for the Enterprise. Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs, hashtags, and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Introduction to Hadoop for the Enterprise
1. #Hadoop is expanding into more industries, use cases & enterprise breadth. More in #TDWI Webinar Apr. 14 Noon ET http://bit.ly/1F9d2iy
2. #Hadoop for the Enterprise tech drivers: scalability, low cost, & many data types.
3. #Hadoop for the Enterprise biz drivers: #analytics, data exploration, value from #BigData.

Hadoop Adoption is Up
4. #TDWI SURVEY SEZ: #Hadoop adoption accelerating. Production clusters up 60% in 2 yrs.
5. #TDWI SURVEY SEZ: Half of respondents have #Hadoop clusters in development, coming online in 12 months.
6. #TDWI SURVEY SEZ: 60% of users surveyed will have #Hadoop in production by 2016.

Benefits and Barriers
7. #TDWI SURVEY SEZ: 89% surveyed say #Hadoop is opportunity for biz/tech #innovation.
8. #TDWI SURVEY SEZ: #Hadoop’s benefits: improve #analytics, #EDW, scalability, exotic data.
9. #TDWI SURVEY SEZ: #Hadoop’s barriers: weak skills, biz case, security, open source tools.

Organizational Issues with Enterprise Hadoop
10. As #Hadoop goes enterprise scope, ownership, staffing, dev methods & economics shift.
11. #Hadoop clusters are becoming central, shared IT infrastructure in mainstream firms.
12. #TDWI SURVEY SEZ: Common #Hadoop job titles are: #DataScientist, architect, analyst, developer.
13. #TDWI SURVEY SEZ: Firms train employees in #Hadoop cuz they can’t find or afford folks to hire.

The Many Use Cases for Enterprise Hadoop
14. #TDWI SURVEY SEZ: Leading future #Hadoop uses: ent data hubs, archives, misc BI/DW.
15. #TDWI SURVEY SEZ: Half of respondents will add #DataQuality & #MDM for #Hadoop data.
16. #TDWI SURVEY SEZ: Established #Hadoop practice extends a #DataWarehouse (46%).
17. #TDWI SURVEY SEZ: Data lakes (36%) & enterprise data hubs (28%) are new practices for #Hadoop.
18. #TDWI SURVEY SEZ: Archiving on #Hadoop is upcoming for new (36%) & old (19%) data.
19. #TDWI SURVEY SEZ: #Hadoop for content mgt (17%) & operational ent apps (11%) are new.

Hadoop’s Roles in Enterprise Data Strategies and Architectures
20. #TDWI SURVEY SEZ: 66% feel #Hadoop is important to their enterprise data strategy.
21. #TDWI SURVEY SEZ: #Hadoop is becoming key to multi-platform #DataWarehouse environments (DWEs).
22. #TDWI SURVEY SEZ: a third of #Hadoop clusters are off premises, on cloud, SaaS, managed provider. Surprising!

Hadoop Development Details
23. #Hadoop cluster size scales down to dept use (8 nodes) or up to enterprise (1000 nodes).
24. #TDWI SURVEY SEZ: #Hadoop clusters per enterprise = 10 on average, with median at 4.
25. #TDWI SURVEY SEZ: 58% of #Hadoop dev done w/mix of hand-coding & hi-level tools. 23% coded only.

Want to learn more about Hadoop for the Enterprise?

For a more detailed discussion—in a traditional publication!—get the TDWI Best Practices Report Hadoop for the Enterprise, which is available in a PDF via a free download.

You can also register for and replay my TDWI Webinar, where I present the findings of Hadoop for the Enterprise.

Posted by Philip Russom, Ph.D. on April 27, 20150 comments

Q&A RE: Hadoop for the Enterprise

Attendees of a recent TDWI Webinar asked excellent questions.

By Philip Russom, TDWI Research Director for Data Management

Recently, on April 14, I broadcast a TDWI Webinar in which I presented some of the findings from my new TDWI report on "Hadoop for the Enterprise." You can download a free copy of the report in a PDF, and you can replay the Webinar. With each link, you may need to scroll down to find what you want. If you’re new to Hadoop, you may wish to first read the 2013 TDWI Best Practices Report Integrating Hadoop into Business Intelligence and Data Warehousing.

Attendees of the Webinar posed several very good questions about various issues around Hadoop. Please allow me to share a few attendee questions and the answers I sent them via e-mail:

What is a Hadoop cluster? And why would an organization need more than one?

The Wikipedia article on “Computer Cluster” is a good general description of all clustered server pools. The article doesn’t mention Hadoop, but Hadoop’s clustering strategy is in line with the article, except that Hadoop can run on heterogeneous servers, whereas the article recommends that all servers be identical. The point of any cluster is to get scalable and high-perfromance computational power, but at a relatively low cost because of commodity priced hardware.

An organization may need more than one Hadoop cluster, due to departmental funding and sponsorship (which is common with analytic applications) or other organizational dynamics. As I pointed out in the Webinar, as users decide on a strategy for Hadoop on an enterprise scale, they tend to abandon the departmental focus in favor of central IT providing Hadoop as a shared enterprise asset (as IT often does with corporate networks, racks of servers, and storage subsystems).

You don't need big data to take advantage of Hadoop?

That’s correct. I’ve found many user organizations with a small Hadoop implementation (8 nodes seems common) used as the data layer under a departmental analytic application or analytics sandbox of some sort. Hadoop makes sense when the department has exotic data (perhaps in lots of files), which Hadoop excels with. Use cases include sentiment analytics with schema-free human language text or supplier analytics with multi-structured XML or JSON files.

Note that, in the examples, the data volumes are modest, but it’s still “big data” in the sense that it’s not the usual structured and relational data. For many users dealing with big data (whether on Hadoop or elsewhere), the value proposition is that big data is new and different, and therefore offers new insights and more complete views of customers. Even when big data is truly big (tens of terabytes or more), users don’t have much trouble managing it; hence, big data is not a scalability crisis, as some people have claimed.

Hadoop has a well-deserved reputation for scaling up linearly. But these examples show that Hadoop also scales down successfully.

Do companies transfer master data into Hadoop to support analytics in a real-time or batch data replication process?

Yes, but that’s still rather rare today. In fact, only 10% of survey respondents who have Hadoop in production today are doing master data management (MDM) on Hadoop. But 45% anticipate doing so within three years. Similarly, data quality is in a similar position, with 11% doing it today versus 55% in the future. Personally, I’ve seen it take a while to ramp up all the data management best practices when a new data platform appears. That seems to be the case with Hadoop. But the proliferation of Hadoop into more of the enterprise is driving up requirements for data management best practices, too.

Let’s now focus on your question. Modern MDM architectures typically support a mix of operational and analytic purposes; they do the same on Hadoop.

Today, Hadoop is strong on volume but weak on real-time operation. So MDM (and other operations) are usually exclusively batch oriented. Given strong Hadoop projects like Storm and Spark, real-time data operations will become more favorable soon.

Can we get a use case for Hadoop and MDM?

As I mentioned in the Webinar, MDM on Hadoop is pretty rare today, but survey results show it will soon be far more common, along with similar practices like data quality.

There are many ways to architect an MDM solution, but many are built atop or around some kind of hub, which includes a database or operational data store (ODS) plus appropriate interfaces in and out of the hub. At TDWI, we’ve seen a number of organizations start migrating subsets of enterprise data to Hadoop, and simply modeled databases and ODSs seem to migrate to Hadoop successfully. The straightforward tabular structures of these (unlike complex warehouse dimensions) usually fit well with Hive tables or HBase in the Hadoop environment. With the so-called enterprise data hub on Hadoop gaining in popularity, we should expect to see more migrations like this in coming years.

A lot of MDM master databases (or systems of record) have very wide records, because they’re also used to compile the “complete view” of customers and other enterprise entities. I’ve heard conflicting opinions from Hadoop users; some think Hive tables are best for wide records, while others swear HBase is best. I hear similar debates involving query mechanisms, including HiveQL, Pig, Drill, and Impala. If you contemplate similar tasks, I recommend you take a known ODS to Hadoop and test on both Hive and HBase, with a variety of query approaches.

Can HBase replace a classic data warehouse, and can it compete from a performance side?

If you have a “classic” data warehouse, then I’ll assume it is designed for dimensional models, optimized for complex queries, and supported by a rich metadata layer with auditing capabilities. HBase today is not particularly good with any of those, so it makes an unlikely replacement.

Even so, some pieces of the warehouse environment do well on HBase. For example, many warehouses include a number of operational data stores (ODSs). These may be physically managed in the warehouse’s core database instance, or they may be running on standalone hardware servers and database instances. Either way, I’ve interviewed users who’ve migrated these pieces to HBase—or Hive or both. They say it’s an easy migration, tweaking on the new platform is minimal, and performance is fine, as long as batch processing is all you need. Furthermore, moving these pieces to Hadoop frees up capacity on the warehouse, so it can grow into more data and use cases that truly must reside in the core warehouse platform. Or, if the migrated ODSs were on standalone platforms, then Hadoop seems to work as a consolidation strategy.

There has been less talk [about] making Hadoop transaction oriented, i.e., ACID compliant. Is there any trend or survey outcome?

To be honest, I haven’t looked into transaction processing on Hadoop, although I’ve heard that some people in both open source and vendor communities are working on it.

Why would I be so remiss? Because the leading use cases I see today don’t require transaction processing and hence the four ACID properties. That includes extensions of data warehousing and data integration, plus a wide range of analytics. Upcoming use cases—data archiving and content management—don’t involve transaction processing either. Furthermore, if you want open source software, the other NoSQL database management systems are strong on transaction processing (as are older open source databases), so you may wish to look into those.

I’m sorry to cop out on you with a non-answer. But at least you can see that transaction processing on Hadoop is a low priority for those of us excited about doing data warehouse, data integration, reporting, and analytics on Hadoop.

Posted by Philip Russom, Ph.D. on April 15, 20150 comments

Successful Application and Data Migrations and Consolidations

Minimizing Risk with the Best Practices for Data Management
By Philip Russom, TDWI Research Director for Data Management

I recently broadcast a really interesting Webinar with Rob Myers – a technical delivery manager at Informatica – talking about the many critical success factors in projects that migrate or consolidation applications and data. Long story short, we concluded that the many risks and problems associated with migrations and consolidations can be minimized or avoided by following best practices in data management and other IT disciplines. Please allow me to share some of the points Rob and I discussed:

There are many business and technology reasons for migrating and consolidating applications and data.

Mergers and Acquisitions (M&As) – Two firms involved in a M&A don’t just merge companies; they also merge applications and data, since these are required for operating the modern business in a unified manner. For example, cross-selling between the customer bases of the two firms is a common business goal in a merger, and this is best done with merged and consolidated customer data.
Reorganizations (reorgs) – Some reorgs restructure departments and business units, which in turn can require the restructuring of applications and data.
Redundant Applications – For example, many firms have multiple applications for customer relationship management (CRM) and sales force automation (SFA), as the result of M&As or departmental IT budgets. These are common targets for migration and consolidation, because they work against valuable business goals, such as the single view of the customer and multi-channel customer marketing. In these cases, it’s best to migrate required data, archive the rest of the data, and retire legacy or redundant applications.
Technology Modernization – These range from upgrades of packaged applications and database management systems to replacing old platforms with new ones.
All the above, repeatedly – In other words, data or app migrations and consolidations are not one-off projects. New projects pop up regularly, so users are better off in the long run, if they staff, tool, and develop these projects with the future in mind.

Migration and consolidation projects affect more than applications and data:

Business Processes – The purpose of enterprise software is to automate business processes, to give the organization greater efficiency, speed, accuracy, customer service, and so on. Hence, migrating software is tantamount to migrating business processes, and a successful project executes without disrupting business processes.
Users of applications and data – These vary from individual people to whole departments and sometimes beyond the enterprise to customers and partners. A successful project defines steps for switching over users without disrupting their work.

Application or data migrations and consolidations are inherently risky. This is due to their large size and complexity, numerous processes and people affected, cost of the technology, and (even greater) the cost of failing to serve the business on time and on budget. If you succeed, you’re a hero or heroine. If you fail, the ramifications are dire for you personally and the organization you work for.

Succeed with app/data migrations and consolidations. Success comes from combining the best practices of data management, solution development, and project management. Here are some of the critical success factors Rob and I discussed in the Webinar:

Go into the project with your eyes wide open – Realize there’s no simple “forklift” of data, logic, and users from one system to the next, because application logic and data structures often need substantial improvements to be fit for a new purpose on a new platform. Communicate the inherent complexities and risks, in a factual and positive manner, without sounding like a “naysayer.”
Create a multi-phased plan for the project – Avoid a risky “big bang” approach by breaking the project into manageable steps. Pre-plan by exploring and profiling data extensively. Follow a develop-test-deploy methodology. Coordinate with multi-phased plans from outside your data team, including those for applications, process, and people migration. Expect that old and new platforms must run concurrently for awhile, as data, processes, and users are migrated in orderly groups.
Use vendor tools – Programming (or hand coding) is inherently non-productive as the primary development method for either applications or data management solutions. Furthermore, vendor tools enable functions that are key to migrations, such as data profiling, develop-test-deploy methods, full-featured interfaces to all sources and targets, collaboration for multi-functional teams, repeatability across multiple projects, and so on.
Template-ize your project and staff for repeatability – In many organizations, migrations and consolidations recur regularly. Put extra work into projects, so their components are easily reused, thereby assuring consistent data standards, better governance, and productivity boosts over time.
Staff each migration or consolidation project with diverse people – Be sure that multiple IT disciplines are represented, especially those for apps, data, and hardware. You also need line-of-business staff to coordinate processes and people. Consider staff augmentation via consultants and system integrators.
Build a data management competency center or similar team structure – From one center, you can staff data migrations and consolidations, as well as related work for data warehousing, integration, quality, database administration, and so on.

If you’d like to hear more of my discussion with Informatica’s Rob Myers, please replay the Webinar from the Informatica archive.

Posted by Philip Russom, Ph.D. on March 11, 20150 comments

Great Data for Great Analytics

Evolving Best Practices for Data Management

By Philip Russom, TDWI Research Director for Data Management

I recently broadcast a really interesting Webinar with David Lyle, a vice president of product strategy at Informatica Corporation. David and I had a “fireside chat” where we discussed one of the most pressing questions in data management today, namely: How can we prepare great data for great analytics, while still leveraging older best practices in data management? Please allow me to summarize our discussion.

Both old and new requirements are driving organizations toward analytics. David and I started the Webinar by talking about prominent trends:

Wringing value from big data: The consensus today says that advanced analytics is the primary path to business value from big data and other types of new data, such as data from sensors, devices, machinery, logs, and social media.
Getting more value from traditional enterprise data: Analytics continues to reveal customer segments, sales opportunities, and threats for risk, fraud, and security.
Competing on analytics: The modern business is run by the numbers, not just gut feel, to study markets, refine differentiation, and identify competitive advantages.

The rise of analytics is a bit confusing for some data people. As experienced data professionals do more work with advanced forms of analytics (enabled by data mining, clustering, text mining, statistical analysis, etc.) they can’t help but notice that the requirements for preparing analytic data are similar-but-different as compared to their other projects, such as ETL for a data warehouse that feeds standard reports.

Analytics and reporting are two different practices. In the Webinar, David and I talked about how the two involve pretty much the same data management practices, but in different orders and priorities:

Reporting is mostly about entities and facts you know well, represented by highly polished data that you know well. Squeaky clean report data demands elaborate data processing (for ETL, quality, metadata, master data, and so on). This is especially true of reports that demand numeric precision (about financials or inventory) or will be published outside the organization (regulatory or partner reports).
Advanced analytics, in general, enables the discovery of facts you didn’t know, based on the exploration and analysis of data that’s probably new to you. Preparing raw source data for analytics is simple, though at high levels of scale. With big data and other new data, preparation may be as simple as collocating large data sets on Hadoop or another platform suited to data exploration. When using modern tools, users can further prepare the data as they explore it, by profiling, modeling, aggregating, and standardizing data on the fly.

Operationalizing analytics brings reporting and analysis together in a unified process. For example, once an epiphany is discovered through analytics (e.g., the root cause of a new form of customer churn), that discovery should become a repeatable BI deliverable (e.g., metrics and KPIs that enable managers to track the new form of churn in dashboards). In these situations, the best practices of data management apply to a lesser degree (perhaps on the fly) during the early analytic steps of the process, but then are applied fully during the operationalization steps.

Architectural ramifications ensue from the growing diversity of data and workloads for analytics, reporting, multi-structured data, real time, and so on. For example, modern data warehouse environments (DWEs) include multiple tools and data platforms, from traditional relational databases to appliances and columnar databases to Hadoop and other NoSQL platforms. Some are on premises and others are on clouds. On the downside, this results in high complexity, with data strewn across multiple platforms. On the upside, users get great data for great analytics by moving data to a platform within the DWE that’s optimized for a particular data type, analytic workload, price point, or data management best practice.

For example, a number of data architecture uses cases have emerged successfully in recent years, largely to assure great data for great analytics:

Leveraging new data warehouse platform types gives analytics the high performance it needs. Toward this end, TDWI has seen many users successfully adopt new platforms based on appliances, columnar data stores, and a variety of in-memory functions.
Offloading data and its processing to Hadoop frees up capacity on EDWs. And it also gives unstructured and multi-structured data types a platform that is better suited to their management and processing, all at a favorable cost point.
Virtualizing data assets yields greater agility and simpler data management. Multi-platform data architectures too often entail a lot of data movement among the platforms. But this can be mitigated by federated and virtual data management practices, as well as by emerging practices for data lakes and enterprise data hubs.

If you’d like to hear more of my discussion with Informatica’s David Lyle, please replay the Webinar from the Informatica archive.

Posted by Philip Russom, Ph.D. on February 2, 20150 comments

Q&A RE: Data Warehouse Architecture Issues

Attendees of a recent TDWI Webinar asked excellent questions.
By Philip Russom, TDWI Research Director for Data Management

Recently, on Tuesday April 15, 2014, I broadcasted a TDWI Webinar in which I presented some of the findings from my new TDWI report, Evolving Data Warehouse Architectures in the Age of Big Data. You can download a free copy of the report in a PDF file. And you can replay the Webinar.

Attendees of the Webinar posed several very good questions about various issues in data warehouse architecture. Please allow me to share a few of the attendees’ questions and the answers I sent them via e-mail:

Q. As we update our data warehouse from more reporting to more analytics functions, should we design a brand new data warehouse architecture, or improve from the existing one?

If the existing data warehouse and its architecture fulfill business requirements and technical performance requirements (for speed and scale), then you should try to build out the existing architecture. For that to work, your existing vendor platform under the warehouse must perform well with multiple mixed workloads, including analytic workloads; ask your vendor representative for customer references who’ve succeeded with mixed workloads. Also, building up data sets for advanced analytics typically means loading large data volumes into the warehouse, which may cost more money with some licenses; again, ask your vendor if there are such ramifications under your current license.

If your current core warehouse platform cannot support mixed workloads with high performance (or adding analytic data costs too much money), you may decide to manage and process large data sets for advanced analytics on a separate standalone platform that integrates with your warehouse. But in that case, you still keep your existing data warehouse and most of its data structures intact, just making slight changes for better integration with the new additional platform(s) for advanced analytics.

Q. Given the lack of integration across this multi-platform [data warehouse] environment, how do we avoid the need to replicate DW transactional sources into the big data platforms, as transactions are required in mining?

Good question, and there are number of issues here. First, a well-designed multi-platform environment won’t suffer a “lack of integration.” TDWI’s definition of “logical data warehouse” is that the logical design specifies integration schemes (not just data models) across physically distinct platforms, whether that integration takes a data model approach (as in shared or conformed dimensions, etc.) or a data integration approach (as in jobs for ETL, replication, etc.) or both. Second, I take your point, that replicating data more than needed can lead to a variety of problems, as data gets out of sync and loses integrity. A good architecture can minimize replication, and sometimes alleviate it. Third, for decades, users have faced the same decision you’re looking at: do we store, manage, and analytically process our rich, valuable collection of transactional data in the warehouse proper or on a standalone but integrated platform, such as the usual operational data store (ODS)?

For years, a solution I’ve seen users successfully adopt is to deploy a homegrown ODS that they’ve designed and optimized for transactions. The ODS is on a standalone platform that’s integrated with the core warehouse (plus other ODSs, marts, etc.), running on a relational DBMS atop commodity priced hardware. Note that the upcoming trend is toward ODSs atop Hadoop (but only if the data volumes are massive). The idea is to manage transactional data on a platform that’s much cheaper than the DW, on a standalone platform where the relentless sorting, updating, and processing of that data won’t degrade warehouse performance. Yet, the ODS is easily reached from all tools, plus through data federation and virtualization as well, which minimizes the replication of transactional data.

If you give the ODS the capacity it needs to persist multiple sort orders and data subsets in the ODS, then copying data outside the ODS is further reduced. Also, if you use data mining tools that can work on data “in situ” (i.e., in the ODS’s relational database) without moving data to the tool, then that also reduces copying and moving transactional data.

Q. The need for data warehouses is never going to go away. But isn’t the separation between "operations" and "analytics" starting to blur? In other words, the future isn't DWE; it's a "data environment" that does both.

Operational BI is all about getting operational data into BI faster and more frequently, while also embedding BI functions in operational applications and their processes as well. Operational BI is a very popular practice. It has been for years, and will get even more popular, as organizations adjust their BI efforts to bring them closer to real time (to be more competitive, customer conscious, efficient, etc.). The widespread existence of operational BI corroborates that the line between operations and BI is already quite blurred and will become even more so.

In another trend, many organizations are purposefully evolving toward a more or less loosely unified data environment for most enterprise data. I say “more or less” and “loosely” because early adopters are quick to say that the architecture is not 100 percent of the enterprise and integration is spotty, on an “as needed” basis. As one architect joked, “it’s more archaeology than architecture, because the work usually consists of imposing a logical architecture over mature, preexisting systems.” For early adopters, it makes sense to architect data globally, when customer data and some other data domains are pervasively shared across multiple applications, departments, and processes. It also makes sense in firms where business processes ramble across multiple business units and IT systems. Obviously, there’s an infinitude of resulting enterprise data architectures.

The data warehouse environment (DWE) I’m describing is a local microcosm of such a broad and loosely unified multi-platform data architecture. However, in some organizations today, the data warehouse and similar data platforms are just a few among many other data platforms, integrated on an enterprise scale. But those organizations are as yet the minority, although we at TDWI expect it to be the norm for IT-intense organizations within five years. TDWI’s Vegas conference has been devoted to issues in enterprise-scale data architecture for years, and will continue to be. You might consider attending next February.

Q. Can you point us to white papers on the difference between reporting and analytics [and how that affects DW architecture]?

You can read my blog on the subject. Or you could read the new report on evolving data warehouse architectures, because I adapted material from the blog to become a section in the report, starting on page 24.

Q. What’s the role, or is there a role, for variants like an ODS in the new world [of data warehouse architectures]? Is it part of the real-time world?”

Historically, some of the first standalone systems in a multi-platform data warehouse (going back to the mid-1990s) were ODSs deployed on their own hardware sever with their own DBMS instances. These are still with us, and will continue to be with us, as data warehouse environments evolve into even more platforms used at once. An ODS can be designed and optimized by users for a wide range of data domains and uses (including real-time data), but I’m currently seeing a lot of users deploying ODSs for various types of big data and other data earmarked for advanced analytics.

Q. Saying Inmon vs. Kimball is no longer relevant is like saying Newton is no longer relevant in the world of physics today. It's still important, maybe not as fundamental as 1–2 decades ago.

For decades, Newton practiced alchemy in his copious spare time, because he was convinced that changing lead to gold was possible. Our heroes aren’t always 100 percent right.

Concerning Inmon and Kimball, see the top of page 7 in the report. Also please read the User Story on that same page. “No longer relevant” is your phrase, not mine. In my view, Inmon and Kimball’s innovations are as relevant as ever, and are still being applied daily. And they just keep giving: Inmon has recently extended our understanding of unstructured data and Kimball is currently working new best practices for Hadoop.

It’s the users who’ve changed. Instead of arguing about which to choose, users choose to apply Inmon and Kimball techniques (and others, too) in the same extended warehouse environment. And that’s a wise choice on their part, since hybrids and diversity seem to be winning strategies for a growing number of user organizations and their diversified DW architectures nowadays.

Q. Some organizations consider Hadoop a replacement for their current DW appliance. How is this possible?

As I said in the Webinar, I’ve only found two organizations that took out a data warehouse and put Hadoop in its place. While that corroborates that a replacement is possible, it’s not likely, nor is it a compelling trend.

Instead of replacement, we at TDWI see far more users augmenting their data warehouse environment with the Hadoop Distributed File System (HDFS), plus related Hadoop tools, especially MapReduce, Hive, HBase, and Pig. In short, HDFS handles things that relational warehouses are not designed for, such as unstructured data, algorithmic analytics, millions of files, and petabyte-size data sets. But the relational warehouse is still best for the structured and multidimensional data that goes into standard reports, performance management, and set-based analytics (typically OLAP or SQL-based analytics).

Another possibility is that Hive atop MapReduce and HDFS makes a highly scalable “row store” type of database. Sometimes you don’t need a full-featured (and expensive) relational DBMS, and hence a row store will do just fine. For example, many of the ODSs found today in data warehouse environments are candidates for migration to Hadoop. That includes ODSs that manage large “archives” (I use the word loosely) of transactional data and other operational data that’s persisted and kept long-term for advanced analytics that just need simple tabular structures. Most standalone ODSs of that description today run on mature DBMSs, but could run almost as well (for less money) on Hadoop.

Finally, let’s remember that not all organizations need a data warehouse, as represented by 15 percent of survey respondents.

Q. Can you recommend any sample success stories on how to integrate Hadoop or similar big data into an existing data warehouse [environment]?

Yes, many real-world use cases and user stories are discussed in the 2013 TDWI report Integrating Hadoop into Business Intelligence and Data Warehousing.