By Philip Russom, TDWI Research Director
[NOTE: The following article was published in the TDWI Trip Report of May 2012.]
The Technology Survey that TDWI circulated at the recent World Conference in Chicago asked attendees to answer a few questions about analytic database management systems and how these fit into their overall data warehouse architecture. Here’s some background information about analytic databases, plus a sampling of attendees’ responses to the survey:
A “database management system” (DBMS) is a vendor-built enterprise-class software package designed to manage databases, whereas a “database” is a collection of data managed by a DBMS. Hence, an “analytic DBMS” (ADBMS) is a vendor-built DBMS designed specifically for managing data for analytics. ADBMSs are most often optimized for “Extreme SQL,” which involves complex queries that scan terabytes of data or routines that may include thousands of lines of SQL. SQL aside, some ADBMSs support other in-database analytic processing, such as MapReduce, no-SQL parsing methods, and a variety of user-defined functions for data mining, statistical analysis, natural language processing (NLP), and so on. Some vendors package or market their ADBMSs as data warehouse appliances, columnar DBMSs, analytic accelerators, in-memory DBMSs, and cloud/SaaS-based platforms. Half of organizations surveyed (52%) have no ADBMS.
There are good reasons why some organizations don’t feel the need for a specialized analytic DBMS. (See Figure 1.) Many organizations stick close to reporting, OLAP, and performance management, for which the average enterprise data warehouse (EDW) is more than capable. Others simply haven’t matured into the use of advanced analytics, for which most ADBMSs are designed. Still others have a powerful EDW platform that can handle all data warehouse workloads, including those for advanced analytics. Among the half of respondents that do have one or more ADBMSs, most have between one and five; multiple ADBMSs can result when multiple analytic methods are in use, due to diverse business requirements for analytics. Also, analytics tends to be departmental by nature, so ADBMSs are commonly funded via departmental budgets; and multiple departments investing in analytics leads to multiple ADBMSs. FIGURE 1. Based on 75 respondents. Approximately how many standalone ADBMS platforms has your organization deployed?
52% = Zero
37% = One to five
8% = Six to ten
3% = More than ten Half of organizations surveyed (46%) run analytic workloads on their EDW.
The EDW as a single monolithic architecture is still quite common, despite the increasing diversity of data warehouse workloads for analytics, real-time, unstructured data, and detailed source data. (See Figure 2.) Even so, a third of respondents (34%) offload diverse workloads to standalone DBMSs (often an ADBMS), typically to get workload-specific optimization or to avoid degrading the performance of the EDW. If you compare Figures 1 and 2, you see that half of respondents don’t have an ADBMS (Figure 1) because they run analytic workloads on their EDW (Figure 2). FIGURE 2. Based on 74 respondents. Which of the following best characterizes how data warehouse workloads are distributed in your organization?
46% = One monolithic EDW that supports all workloads in a single DBMS instance
34% = One EDW, plus multiple, standalone DBMSs for secondary workloads
20% = Other Most respondents consider an ADBMS to be a useful complement to an EDW.
Even some users who don’t have an ADBMS feel this way. (See Figure 3.) According to survey results, an ADBMS provides analytic and data management capabilities that complement an EDW (56%), enables the “analytic sandboxes” that many users need (57%), and optimizes more analytic workloads than the average EDW (58%). FIGURE 3. Based on 219 responses from 72 respondents. What are the potential benefits of complementing an EDW with an ADBMS? (
Select all that apply.)
58% = Optimized for more analytic workloads than our EDW
57% = Enables the “analytic sandboxes” that many users need
56% = Provides analytic and data mgt capabilities that complement our EDW
46% = Isolates ad hoc analytic work that might degrade EDW performance
33% = Manages multi-Tb raw source data for analytics better than EDW
29% = Handles real-time data feeds for analytics better than EDW
22% = Takes analytic processing to Big Data, instead of reverse
3% = Other
Posted by Philip Russom, Ph.D. on June 8, 20120 comments
By Philip Russom, TDWI Research Director
High performance continues to intensify as a critical success factor for user implementations in data warehousing (DW), business intelligence (BI), data integration (DI), and analytics. Users are challenged by big data volumes, new and demanding analytic workloads, growing user communities, and business requirements for real-time operation. Vendor companies have responded with many new and improved products and functions for high performance—so many that it’s hard for users to grasp them all.
In other words, just about everything we do in DW, BI, DI, and analytics has some kind of high-performance requirement. Users want quick responses to their queries, analysts need to rescore analytic models as soon as possible, and some managers want to refresh their dashboards on demand. Then there’s scalability, as in the giant data volumes of big data, growing user communities, and the overnight refresh of thousands of reports and analyses. Other performance challenges come from the increasing adoption of advanced analytics, mixed workloads, streaming data, and real-time practices such as operational BI.
Across all these examples, you can see that high-performance data warehousing (HiPerDW) is all about achieving speed and scale, despite increasing complexity and concurrency. This applies to every layer of the complex BI/DW/DI technology stack, as well as processes that unfold across multiple layers.
Luckily, today’s high-performance challenges are being addressed by numerous technical advancements in vendor tools and platforms. For example, there are now multiple high-performance platform architectures available for your data warehouse, including MPP, grids, clusters, server virtualization, clouds, and SaaS. For real-time data, databases and data integration tools are now much better at handling streaming big data, service buses, SOA, Web services, data federation, virtualization, and event processing. 64-bit computing has fueled an explosion of in-memory databases and in-memory analytic processing in user solutions; flash memory and solid-state drives will soon fuel even more innovative practices. Other performance enhancements have recently come from multi-core CPUs, appliances, columnar storage, high-availability features, MapReduce, Hadoop, and in-database analytics.
My next Best Practices Report from TDWI will help users understand new business and technology requirements for high-performance data warehousing (HiPerDW), as well as the many options and solutions available to them. Obviously, performance doesn’t result solely from the data warehouse platform, so the report will also reach out to related platforms for analytics, BI, visualization, data integration, clouds, grids, appliances, data services, Hadoop, and so on. My upcoming TDWI report (to be published in October 2012) will provide tips and strategies for prioritizing your own adoption of high-performance features.
Please help me with the research for the HiPerDW report, by taking its survey, online at: http://svy.mk/HiPerDW
. And please forward this email to anyone you feel is appropriate, especially people who have experience implementing or optimizing the high performance of systems for BI/DW/DI and analytics. If you tweet about HiPerDW, please use the Twitter hash tag #HiPerDW. Thank you!
Posted by Philip Russom, Ph.D. on May 18, 20120 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
All kinds of people have recently weighed in with their definitions and descriptions of so-called “big data,” including journalists, industry analysts, consultants, users, and vendor representatives. Frankly, I’m concerned about the direction that most of the definitions are taking, and I’d like to propose a correction here.
Especially when you read the IT press, definitions stress data from Web, sensor, and social media sources, with the insinuation that all of it is collected and processed via streams in real time. Is anyone actually doing this? Yes, they are, but the types of companies out there on the leading edge of big data (and the advanced analytics that often go with it) are what we usually call “Internet companies.” Representatives from older Internet companies (Google, eBay, Amazon) and newer ones (Comshare, LinkedIn, LinkShare) have stood up at recent TDWI conferences and described their experiences with big data analytics; therefore I know it’s real and firmly established.
So, if Internet companies are successfully applying analytics to big data, what’s my beef? It is exactly this: a definition of big data biased toward best practices in Internet companies ignores big data best practices in more mainstream companies.
For example, I recently spoke with people at three different telcos – you know, telephone companies. For decades, they’ve been collecting big data about call detail records (CDRs), at the rate of millions (sometimes billions) of records a day. In some regions, national laws require them to collect this information and keep it in a condition that is easily shared with law enforcement agencies. But CDRs are not just for regulatory compliance. Telcos have a long history of success analyzing these vast datasets to achieve greater performance and reliability from their utility infrastructure, as well as for capacity planning and understanding their customers’ experiences.
Federal government agencies also have a long history of success with big data. For example, representatives from IRS Research recently spoke at a TDWI event, explaining how they were managing billions of records back in the 1990s, and have recently moved up to multiple trillions of records. (Did you catch that? I said trillions, not billions. And that’s just their analytic datasets!) More to the point, IRS data is almost exclusively structured and relational.
I could hold forth about this interminably. Instead, I’ve summarized my points in a table that contrasts a mainstream company’s big-data environment with that of an Internet-based one. My point is that there’s ample room for both traditional big data and for the new generation of big data that’s getting a lot of press at the moment. Eventually, many businesses (whether mainstream, Internet, or whatnot) will be an eclectic mix of the two.
Traditional Big Data
New Generation Big Data
Tens of Terabytes,
Hundreds of Terabytes,
soon to be measured in Petabytes
Mostly structured and relational data
Mixture of structured, semi-structured, and unstructured data
Data mostly from traditional enterprise applications: ERP, CRM, etc.
Also from Web logs, clickstreams, sensors, e-commerce, mobile devices, social media
Common in mid-to-large companies:
Common in Internet-based companies:
Will eventually go mainstream
Real-time as in Operational BI
Real-time as in Streaming Data
I’m sorry that I’m foisting yet another definition of big data on you. Heaven knows, we have enough of them. But I feel we need a less Internet-biased definition in preference of one that’s broad enough to encompass big-data best practices in mainstream companies, as well. For one thing, let’s give credit where credit is due; and a lot of mainstream companies are successful with a more traditional definition of big data. For another, we run the risk of alienating people in mainstream companies, which could impair the mainstream adoption of big-data best practices. That, in turn, would stymie the cause of leveraging big data (no matter how you define it) for greater business leverage. And that would be a pity.
So, what do you think? Let me know!
Some of the material of this blog came from my recent Webinar: “Big Data and Your Data Warehouse.” You can replay it from TDWI’s Webinar Archive.
Want to learn more about Big Data Analytics? Attend the TDWI Forum on Big Data Analytics, coming in Orlando November 12-13, 2012.
Posted by Philip Russom, Ph.D. on May 1, 20120 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
To raise an awareness of what the Next Generation of Master Data Management (MDM) is all about, I recently issued a series of 35 tweets via Twitter, over a two-week period. The tweets also helped promote a TDWI Webinar on Next Generation MDM. Most of these tweets triggered responses to me or retweets. So I seem to have reached the business intelligence (BI), data warehouse (DW), and data management (DM) audience I was looking for – or at least touched a nerve!
To help you better understand Next Generation MDM and why you should care about it, I’d like to share these tweets with you. I think you’ll find them interesting because they provide an overview of Next Generation MDM in a form that’s compact, yet amazingly comprehensive.
Every tweet I wrote was a short sound bite or stat bite drawn from TDWI’s recent report on Next Generation MDM, which I researched and wrote. Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.
I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs, hashtags, and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.
Defining the Generations of MDM
1. #MDM is inherently a multigenerational discipline w/many life cycle stages. Learn its generations in #TDWI Webinar
2. User maturation, new biz reqs, & vendor advances drive #MDM programs into next generation. Learn more in #TDWI Webinar
3. Most #MDM generations incrementally add more data domains, dep’ts, data mgt tools, operational apps.
4. More dramatic #MDM generations consolidate redundant solutions, redesign architecture, replace platform.
Why Care About NG MDM?
5. Why care about NexGen #MDM? Because biz needs consistent data for sharing, BI, compliance, 360views.
6. Why care about NexGen #MDM? Most orgs have 1st-gen, homegrown solutions needing update or replacement.
The State of NG MDM
7. #TDWI SURVEY SEZ: #MDM adoption is good. 61% of surveyed orgs have deployed solution. Another 29% plan to soon.
8. #TDWI SURVEY SEZ: #MDM integration is not so good. 44% of solutions deployed are silos per dept, app, domain.
9. #TDWI SURVEY SEZ: Top, primary reasons for #MDM: 360-degree views (21%) & sharing data across enterprise (19%).
10. #TDWI SURVEY SEZ: Top, secondary reasons for #MDM: Data-based decisions (15%) & customer intelligence (13%).
11. #TDWI SURVEY SEZ: Other reasons for #MDM: operational excellence, reduce cost, audits, compliance, reduce risk.
12. #TDWI SURVEY SEZ: Top, primary #MDM challenges are lack of: exec sponsor, data gov, cross-function collab, biz driver.
13. #TDWI SURVEY SEZ: Other challenges to #MDM: growing reference data, coord w/other data mgt teams, poor data quality.
MDM’s Business Entities and Data Domains
14. #TDWI SURVEY SEZ: “Customer” is biz entity most often defined via #MDM (77%). But I bet you knew that already!
15. #TDWI SURVEY SEZ: Other #MDM entities (in survey order) are product, partner, location, employee, financial.
16. #TDWI SURVEY SEZ: Surveyed organizations have an average of 5 definitions for customer and 5 for product. #MDM
17. #TDWI TAKE: Multi-data-domain support is a key metric for #MDM maturity. Single-data-domain is a myopic silo.
18. #TDWI SURVEY SEZ: 37% practice multi-data-domain #MDM today, proving it can succeed in a wide range of orgs.
19. #TDWI SURVEY SEZ: Multi-data-domain maturity is good. Only 24% rely mostly on single-data-domain #MDM.
20. #TDWI SURVEY SEZ: A third of survey respondents (35%) have a mix of single- and multi-domain #MDM solutions.
Best Practices of Next Generation MDM
21. #TDWI TAKE: Unidirectional #MDM improves reference data but won’t share. Not a hub unless ref data flows in/out
22. #TDWI SURVEY SEZ: #MDM solutions today r totally (26%) or partially (19%) homegrown. Learn more in Webinar http://bit.ly/NG-MDM #GartnerMDM
23. #TDWI SURVEY SEZ: Users would prefer #MDM functions from suite of data mgt tools (32%) or dedicated #MDM app/tool (47%)
24. #TDWI Survey: 46% claim to be using biz process mgt (BPM) now w/#MDM solutions. 32% said integrating MDM w/BPM was challenging.
25. #TDWI SURVEY SEZ: Half of surveyed organizations (46%) have no plans to replace #MDM platform.
26. #TDWI SURVEY SEZ: Other half (50%) is planning a replacement to achieve generational change. Learn more in Webinar http://bit.ly/NG-MDM
27. Why rip/replace #MDM? For more/better tools, functions, arch, gov, domains, enterprise scope.
28. Need #MDM for #Analytics? Depends on #Analytics type. OLAP, complex SQL: Oh, yes. Data/text mining, NoSQL, NLP: No way.
Quantifying the Generational Change of MDM Features
29. #TDWI SURVEY SEZ: Expect hi growth (27% to 36%) in #MDM options for real-time, collab, ref data sync, tool use.
30. #TDWI SURVEY SEZ: Good growth (5% to 22%) coming for #MDM workflow, analytics, federation, repos, event proc.
31. #TDWI SURVEY SEZ: Some #MDM options will be flat due to saturation (gov, quality) or outdated (batch, homegrown).
Top 10 Priorities for Next Generation MDM
32. Top 10 Priorities for NG #MDM (Pt.1) 1-Multi-data-domain. 2-Multi-dept/app. 3-Bidirectional. 4-Real-time. #TDWI
33. Top 10 Priorities for NG #MDM (Pt.2) 5-Consolidate multi MDM solutions. 6-Coord w/other disciplines. #TDWI
34. Top 10 Priorities for NG #MDM (Pt.3) 7-Richer modeling. 8-Beyond enterprise data. 9-Workflow/process mgt. #TDWI
35. Top 10 Priorities for NG #MDM (Pt.4) 10-Retire early gen homegrown & build NexGen on vendor tool/app. #TDWI
FOR FURTHER STUDY:
For a more detailed discussion of Next Generation MDM – in a traditional publication! – see the TDWI Best Practices Report, titled “Next Generation Master Data Management,” which is available in a PDF file via a free download.
You can also register for and replay my TDWI Webinar, where I present the findings of the Next Generation MDM report.
Philip Russom is the research director for data management at The Data Warehousing Institute (TDWI). You can reach him at firstname.lastname@example.org or follow him as @prussom on Twitter.
Posted by Philip Russom, Ph.D. on April 13, 20120 comments
These days have been a whirlwind of projects. One of the biggest for me is the TDWI Best Practices Report I am working on, entitled “Customer Analytics in the Age of Social Media.” This report looks at what organizations are doing and could be doing to analyze information sources to improve their knowledge of and engagement with customers. Social media data is the revolutionary force in this realm; marketing functions are highly focused on how to take advantage social media both as a new channel and as a critical source of information about customer and market behavior. The heart of this report will be about how customer intelligence and analytics efforts are being reshaped by the influence of social media. This is exciting stuff.
When people talk about “big data,” much of the time they are talking about data generated by human behavior in social networks, blogs, chat rooms, comment fields, and more. Indeed, this can amount to a fast-moving, highly diverse “tsunami” of data that includes both internal (e.g., contact center interactions) and external sources. By discovering insights from this information, organizations can broaden and deepen their understanding of customers and get closer to a 360-degree view.
In addition, organizations can use social media data to gain an early view of the efficacy of marketing campaigns and product introductions. Many organizations are “listening” to such reactions in social media; leading organizations analyze the data rapidly and move quickly to adjust campaigns and engage in the social conversations to improve results.
To be sure, some organizations have serious reservations about social media data. First, not all organizations I have spoken with for the report find social media data to be trustworthy and take such analysis with a heavy grain of salt. My research found that while “gut feel” is losing out to the power of data analysis in most marketing functions, there’s still healthy debate about the real value of social media data to marketing decisions.
Second, while organizations at the leading edge of social media get a lot of attention, in a broad sense we are still in the early days. In our research, just 26 percent of participants said that their organizations are currently analyzing social media data; 22 percent are planning to do so within one year, while 21 percent have no plans to do so.
Where I found that organizations are gaining huge value is in drawing insights from social media to help them get closer to a 360-degree view of customer activity. Data silos are a problem in marketing; each channel often has its own dedicated applications and data. If organizations can correlate what they are seeing in social media with performance data from Web sites and other channels, they can begin to connect the dots across channels.
“Social media for us is not one isolated channel,” a data analyst at a large advertising services firm told me. “We use social media to gain an integrated view of the impact of our marketing across all of our channels, including billboards.” His organization is comparing social media data with their sources on marketing spending, customer transactions by location, and Web site performance. While not complete by itself, social media activity analysis enables a far more current view of marketing campaign performance than organizations have previously had.
“To see and be seen” is the credo of social media engagement. It isn’t enough to just listen; organizations have to be prepared to act. To do so intelligently, however, organizations must use social media data as not just a single source but as part of their integrated view of customer information.
Posted by David Stodder on April 12, 20120 comments
Blog by Philip Russom
Research Director for Data Management, TDWI
[NOTE -- I recently completed a TDWI Best Practices Report titled Next Generation Master Data Management. The goal is to help user organizations understand MDM lifecycle stages so they can better plan and manage them. TDWI will publish the 36-page report in a PDF file in early April 2012, and anyone will be able to download it from www.tdwi.org. In the meantime, I’ll provide some “sneak peeks” by blogging excerpts from the report. Here’s the fifth excerpt, which is the Executive Summary at the beginning of the report.] EXECUTIVE SUMMARY
Master data management (MDM) is one of the most widely adopted data management disciplines of recent years. That’s because the consensus-driven definitions of business entities and the consistent application of them across an enterprise are critical success factors for important cross-functional business activities, such as business intelligence (BI), complete views of customers, operational excellence, supply chain optimization, regulatory reporting, compliance, mergers and acquisitions, and treating data as an enterprise asset. Due to these compelling business reasons, many organizations have deployed their first or second generation of MDM solutions. The current challenge is to move on to the next generation. Basic versus advanced MDM functions and architectures draw generational lines that users must now cross.
For example, some MDM programs focus on the customer data domain, and they need to move on to other domains, like products, financials, partners, employees, and locations. MDM for a single application (such as enterprise resource planning [ERP] or BI) is a safe and effective start, but the point of MDM is to share common definitions and reference data across multiple, diverse applications. Most MDM hubs support basic functions for the offline aggregation and standardization of reference data, whereas they should also support advanced functions for identity resolution, two-way data sync, real-time operation, and approval workflows for newly created master data. In parallel to these generational shifts in users’ practices, vendor products are evolving to support advanced MDM functions, multi-domain MDM applications, and collaborative governance environments. Users invest in MDM to create complete views of business entities and to share data enterprisewide.
According to survey respondents, the top reasons for implementing an MDM solution are to enable complete views of key business entities (customers, products, employees, etc.) and to share data broadly but consistently across an enterprise. Other reasons concern the enhancement of BI, operational excellence, and compliance. Respondents also report that MDM is unlikely to succeed without strong sponsorship and governance, and MDM solutions need to scale up and to cope with data quality (DQ) issues, if they are to succeed over time.
“Customer” is, by far, the entity most often defined via MDM. This prominence makes sense, because conventional wisdom says that any effort to better understand or serve customers has some kind of business return that makes the effort worthwhile. Other common MDM entities are (in survey priority order) products, partners, locations, employees, and financials. Users continue to mature their MDM solutions by moving to the next generation.
MDM maturity is good, in that 60% of organizations surveyed have already deployed MDM solutions, and over one-third practice multi-data-domain MDM today. On the downside, most MDM solutions today are totally or partially homegrown and/or hand coded. But on the upside, homegrown approaches will drop from 45% today to 5% within three years, while dedicated MDM application or tool usage will jump from 12% today to 47%. To achieve generational change, half of organizations anticipate replacing their current MDM platform(s) within five years. The usage of most MDM features and functions will grow in MDM’s next generation.
Over the next three years, we can expect the strongest growth among MDM features and functions for real-time, collaboration, data sync, tool use, and multistructured data. Good growth is also coming with MDM functions for workflow, analytics, federation, repositories, and event processing. Some MDM options will experience limited growth, because they are saturated (services, governance, quality) or outdated (batch processing and homegrown solutions).
This report helps user organizations understand all that MDM now offers, so they can successfully modernize and build up their best practices in master data management. To that end, it catalogs and discusses new user practices and technical functions for MDM, and it uses survey data to predict which MDM functions will grow most versus those that will decline—all to bring readers up to date so they can make informed decisions about the next generation of their MDM solutions.
Please attend the TDWI Webinar where I present the findings of my TDWI report Next Generation MDM, on April 10, 2012 Noon ET. Register online
for the Webinar.
Posted by Philip Russom, Ph.D. on March 30, 20120 comments