TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Blog

TDWI Blog: Data 360

Big Data Analytics: An Overview in 20 Tweets

By Philip Russom, TDWI

To raise an awareness of the new tool features, user techniques, and team structures of Big Data Analytics, I recently issued a series of twenty tweets via Twitter, over a two-week period. The tweets also helped promote a TDWI Webinar on Big Data Analytics. Most of these tweets triggered responses to me or retweets. So I seem to have reached the business intelligence (BI) and data warehouse (DW) audience I was looking for – or at least touched a nerve!

To help you better understand Big Data Analytics and why you should care about it, I’d like to share some of the thoughts from these tweets with you. I think you’ll find them interesting because they provide an overview of Big Data Analytics in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from TDWI’s recent report on Big Data Analytics, which I researched and wrote. Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Defining Big Data, Advanced Analytics, and Big Data Analytics
1. #BigData #Analytics = where advanced analytics operate on big data sets. So, it’s about 2 things. Learn more in Webinar http://bit.ly/qp4wp6
2. Advanced #Analytics = data mining, statistics, extreme SQL, data viz, artificial intell, language processing.
3. Advanced #Analytics = database techs like MapReduce, in-database & in-memory analytics, column stores.
4. Advanced #Analytics = discovering unknown biz facts. Instead of advanced, should call it discovery analytics
5. #BigData = not just multi-terabyte datasets. Also about diverse data types & real-time or streaming data.
6. Bleeding edge of #BigData = data streaming from sensors, robotics, monitor devices, Web logs.

Benefits and Barriers for Big Data Analytics
7. #TDWI SURVEY SEZ: #BigData #Analytics benefits customer relations, BI, most pre-existing analytic apps.
8. #TDWI SURVEY SEZ: Bad skills, sponsors, & database software are leading barriers to #BigData #Analytics.

Organizational Issues and Big Data Analytics
9. #TDWI SURVEY SEZ: 30% consider #BigData a data mgt problem. 70% think it a biz opp when analyzed. Attend #TDWI Webinar http://bit.ly/qp4wp6
10. #TDWI SURVEY SEZ: #BigData #Analytics is owned by BI/DW team (41%), dep’ts (21%), IT/CIO (12%).
11. #TDWI SURVEY SEZ: Business analyst is most common job title for designer of #BigData #Analytics.

The State of Big Data Analytics
12. #TDWI SURVEY SEZ: 74% of orgs have some form of analytics today. But only 34% do #BigData #Analytics.
13. #TDWI SURVEY SEZ: 37% of orgs have 10Tb+ of #BigData just for #Analytics. More on #TDWI Webinar http://bit.ly/qp4wp6
14. #TDWI SURVEY SEZ: 20% of orgs expect to have 500Tb+ of #BigData just for #Analytics by 2013.
15. #TDWI SURVEY SEZ: 64% of orgs today manage #BigData for #Analytics in EDW, 38% outside EDW.
16. #TDWI SURVEY SEZ: 24% claim to have Hadoop today. #TDWI suspects most are experimental downloads. But still impressive
17. #TDWI SURVEY SEZ: #BigData is struc 92%, semi-struc 54%, hier 54%, events 45%, unstruc 35%, social 34%, Web 31%...

Future Trends in Big Data Analytics
18. #TDWI SURVEY SEZ: 33% will replace #Analytics platform within 3 yrs. Another 11% after that. 9% already replaced.
19. #TDWI SURVEY SEZ: Why replace #Analytics platform? Poor scale, loading, query speed, real time, SOA, self service, viz.
20. #TDWI SURVEY SEZ: #BigData #Analytics techs set to grow most: advanced analytics, data viz, in-memory DBs, unstruc data

FOR FURTHER STUDY:
Don’t miss my next TDWI Webinar on Hadoop. I’ll lead a panel of vendor representatives in a discussion of Hadoop and its value for BI, DW, and analytics. Register online, so you can join us December 14, 2011 at noon ET.

For a more detailed discussion of Big Data Analytics – in a traditional publication! – see the TDWI Best Practices Report, titled Big Data Analytics, which is available in a PDF file via a free download.

You can also register for and replay my TDWI Webinar, where I present the findings of the Big Data Analytics report.

Philip Russom is the research director for data management at TDWI. You can reach him at [email protected] or follow him as @prussom on Twitter.

Posted by Philip Russom, Ph.D.0 comments

Master Data Management: Rules for the Next Generation

Blog by Philip Russom
Research Director for Data Management, TDWI

I’m currently researching a TDWI Best Practices Report that will redefine master data management (MDM) by describing what its next generation should look like. As part of the research, I’ve been interviewing users on the phone about their MDM programs.

The news so far is a mix of good and bad. I hate saying it, but half of the organizations I’ve talked with are mired in early lifecycle stages of their MDM programs, unable to get over certain humps and mature into the next generation. On the flip side, the other half is well into the next generation; so I know it can be done.

Allow me to list desirable capabilities of MDM’s next generation, and briefly say why these need to replace similar early phase capabilities. The following list (with a great deal more detail) will probably appear in my Next Generation MDM report that TDWI will publish April 2, 2012. After all, the list defines MDM’s next generation. And my goal is to establish a set of rules (or requirements) that can guide users into the next generation.

Multi-domain MDM. Many MDM solutions address only the customer data domain, and they need to move on to other domains, like products, financials, and locations. Single-data-domain MDM is a barrier to having common, consensus-based entity definitions and standard reference data that would allow you to correlate information across multiple domains. (See my blog The State of Multi-Data-Domain MDM.)

Multi-department, multi-application MDM. MDM for a single application (typically ERP, CRM or BI) is a safe and effective start. But the point of MDM is to share common definitions across multiple, diverse applications and the departments that depend on them. It’s important to overcome organizational boundaries if MDM is move from being a local fix to an enterprise infrastructure.

Bidirectional MDM. "Roach Motel MDM," as I call it, is when you extract reference data and study in a database from which it never emerges (as with many BI/DW systems). One-way MDM is bad whenever you need to improve reference data in a central place, then publish it out to a wide variety of operational applications. (See my article Roach Motel MDM.)

Real-time MDM. The strongest trend in data management today (and BI/DW, too) is toward real-time operation as a complement to batch. Real-time is critical to identity resolution and the immediate application of recent changes to reference data.

Consolidating multiple, competing MDM solutions. How can you have a single view of the customer, if you have multiple customer-domain MDM solutions? How can you correlate reference data across domains, if the domains are treated in separate MDM solutions? For many organizations, next-gen MDM begins with a consolidation of multiple MDM solutions.

Beyond enterprise data. Despite the obsession with customer data that most MDM solutions suffer, almost none of them today incorporate data about customers from Web sites or social media. If you’re truly serious about MDM as an enabler for CRM, next-gen MDM (and CRM, too) must reach into every customer channel.

Richer modeling. Reference data in the customer domain works fine with flat modeling, involving a simple (but very wide) record per customer. However, other domains make little sense without a richer, hierarchical model, as with a chart of accounts in finance of a bill of material in manufacturing. Metrics and KPIs – so common in BI, today – rarely have proper master data in multidimensional models. (See my article MDM for Performance Management.)

Coordination with other disciplines. To achieve next-gen goals, many organizations need to stop practicing MDM in a vacuum. Instead of MDM as merely a technical fix, it also needs to be aligned with business goals for data. And MDM should be coordinated with related data management disciplines, especially data integration and data quality. A solid data governance program can be an effective medium for such coordination. (See my blog MDM Can Learn from Data Quality.)

MDM Workflow. Development and collaborative efforts in MDM today are mostly ad hoc actions with little or no process. For MDM program to scale up and grow, it needs workflow functionality that automates the proposal, review, and approval process for newly created or improved reference and master data. Also, a few MDM programs need the kind of workflow enabled by tools for business process management. Vendor tools and dedicated applications for MDM are starting to support such workflows.

So, what do you think? Do you know of other generational changes that MDM is facing? Let me know.

================================

ANNOUNCEMENTS
Please take the TDWI MDM Survey for my upcoming report about Next-Generation MDM.

David Loshin and I will moderate the TDWI Solution Summit on Master Data, Quality, and Governance, coming up March 4-6, 2012 in Savannah, Georgia. You should attend!

Posted by Philip Russom, Ph.D.0 comments

Big Data Analytics: The News from Informatica

Blog by Philip Russom
Research Director for Data Management, TDWI

Early this morning, Informatica Corporation announced Informatica HParser, a new product for parsing data in Apache Hadoop environments. Instead of repeating the details of the announcement – which you can read on www.informatica.com, etc. – I’d rather use the announcement as a springboard for my own thoughts about the bigger trends and issues in Big Data Analytics and Hadoop that the announcement fits into. The catch is that there are so many myths and misconceptions (i.e., “mythconceptions”) about Hadoop right now, that I can’t bust them all in a short piece like this blog. So I’ll just present the two leading mythconceptions as background, plus a brief rant for color.

First Mythconception. Hadoop is not one, monolithic thing, so we need to stop talking about it that way. It’s actually an open source software library administered by the Apache Software Foundation. (Some Hadoop products are also available via vendor distributions; but that’s another story.) The Apache Hadoop library includes several products and technologies, including (in BI priority order) the Hadoop Distributed File System (HDFS), MapReduce, Hive, Hbase, Pig, Zookeeper, Flume, Sqoop, Oozie, Hue, and so on. It’s up to you to figure out which combination of Apache Hadoop products to implement for a given application. For applications in business intelligence (BI) and Big Data Analytics, HDFS and MapReduce (perhaps with Hbase and Hive) constitute a useful technology stack.

Second Mythconception. Theoretically, HDFS can manage the storage and access of any data type, as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true, and it’s exactly what brings many users to Apache HDFS in the first place. Yet, HDFS’s admirable tolerance for diverse data doesn’t mean that an Apache Hadoop environment operates equally well with all file and data types. According to users I’ve interviewed, if you expect to get speed, scalability, and development simplicity, you need to work with Hadoop’s preference for record-based data. That’s not as limiting as it sounds, because many types of Big Data handled by HDFS are inherently record-based, as in logs from Web servers and sensors or table dumps of call detail records, customer records, transactions, etc. Furthermore, many sources of traditional enterprise data can be converted to records and copied to HDFS for Big Data Analytics and other applications.

And that brings us to Informatica Corporation’s announcement today of the new Informatica HParser. In a Hadoop environment, it’s MapReduce that actually executes the programmatic logic of an application. In the context of Big Data Analytics, the logic is (today) usually hand-coded data transformations or analytic logic. HParser provides an integrated development environment (IDE) for creating data transformation logic, plus ties into MapReduce to ensure that the logic executes in a fully distributed and parallel fashion. Given Apache Hadoop’s preference for record-based data, use cases cited by Informatica focus on how HParser can convert unstructured data into records and tables, plus flatten overly structured or “complex” data (as in the hierarchies of XML and JSON) into records that are more palatable to HDFS and Apache MapReduce. Record structures aside, Informatica HParser also supports a long list of data standards and document types. And Informatica PowerExchange for Hadoop provides additional functionality.

A brief rant. If you’ve been reading my writings on data integration for the last ten years, you know that I consider hand-coded data integration to be non-productive. Hand coding is time-consuming, not very re-usable, hard to update, and inherently feature-poor compared to vendor platforms. Now, we’re faced with Apache MapReduce, which – out of the box – demands huge amounts of hand coding, because it’s a processing engine that manages and provides parallelization for hand-coded routines (whether for analytics, DI, or otherwise). Informatica HParser shows promise for reducing the non-productive hand-coding that open-source environments like Hadoop, MapReduce, and Hive assume.

Conclusion. I feel that the men and women who’ve contributed to open source Hadoop have made an impressive and innovation contribution. And the Apache Software Foundation does a great job enabling the open source community. Thanks to these contributions, Hadoop is successfully used in production, but mostly in large, Internet-based businesses, like Amazon, Comscore, eBay, Google, and LinkedIn. However, for the Hadoop family – and the Big Data Analytics it enables – to become truly useful in a wide range of mainstream organizations across multiple industries, I think that the Hadoop family needs a number of new extensions, improvements, and options for interoperability.

This is why we’re now seeing software vendor companies coming out with various types of support for Apache Hadoop products and technologies. Informatica’s HParser and Informatica PowerExchange for Hadoop are prime examples, and other DI vendors will soon follow suit with similar interfaces and extensions for Hadoop. Some vendors are building administrative tools, which HDFS sorely lacks. And BI and analytic tool vendors are scrambling to sit atop HDFS and MapReduce. Personally, I hope to see more support for Hadoop and soon, because, without it, mainstream user organizations can’t get full value from Hadoop. Hence, they may not adopt it.

So, what do you think? Let me know!

===============================
Do you suffer mythconceptions about Hadoop? If so, TDWI can help you bust them:
• TDWI will soon publish my new Checklist Report on Hadoop, available as a free download on tdwi.org, starting Dec.13, 2011.
• On Dec.14, 2011, I’ll broadcast a TDWI Webinar based on that report. Please register online for the Hadoop Webinar.

Posted by Philip Russom, Ph.D.0 comments

Big Data Analytics: The News from Teradata

Blog by Philip Russom
Research Director for Data Management, TDWI

Just moments ago, Teradata Corporation issued three announcements describing new capabilities, products, and releases. Instead of repeating the details of Teradata’s new stuff -- which you can read on www.teradata.com, etc. -- I’d rather be self-indulgent and use each announcement as a springboard for my own thoughts about the bigger trends in Big Data Analytics these relate to.

Announcement Number One: Teradata Columnar

A few years ago, I was at the Teradata Partners Conference. Instead of attending speaking sessions, I was in a series of meetings for industry analysts and industry influencers. When the topic of columnar databases came up -- and it was my turn to pontificate -- I said something like: “Columnar storage engines will soon be available as just another feature of database management systems from larger, more established vendors.” The room fell quiet, and a cricket chirped in the background. Then, two experts mocked me, while Teradata people were noticeably mum. ;)

Does that make me a prescient visionary? No, not at all. I’ve just been paying attention for the last three decades, as one technology after the next is developed and proved by a small startup, then bought or built by one or more of the leading DBMS vendors. We’ve seen this trend played out with features for everything from security to parallel processing to OLAP to federation to in-memory databases. We’re now seeing the same trend with columnar data stores and other technologies for Big Data Analytics.

Newish vendors like ParAccel and Vertica -- and Sybase long before them -- have proved the usefulness and commercial potential of a columnar approach. Open source DBMSs MySQL and Infobright made similar contributions. In full compliance with the trend I’m describing, IBM and Oracle have released columnar storage engines they built, and now it’s Teradata’s turn. Teradata Columnar is a new capability of Teradata Database 14. What’s new here is that Teradata has integrated both columnar AND row-based tables, thereby making hybrid applications more feasible. All the above is goodness, regardless of vendor, because columnar data stores have compelling advantages for query speed, data compression, bla, bla, bla, and the usual miraculous benefits.

This recurring trend begs the question: What’s the next new innovation that’s on the path to DBMS assimilation? It’s obvious to me that Hadoop and MapReduce are already well down that path. And that brings us to the next Teradata announcement.

Announcement Number Two: Teradata Aster MapReduce Platform

On the upside, MapReduce is the secret sauce that brings advanced analytic capability to a big data repository, whether it’s Hadoop’s file system or a relational database management system (RDBMS). On the downside, MapReduce from most sources is mired in hand-coding and devoid of SQL (to which we’re hand-cuffed in BI). Hence, MapReduce shows great promise for the world of BI, but only if it can evolve to suit the technical requirements of BI and DW professionals.

Evolving MapReduce is what the small vendor Aster Data Systems has always been about, and the evolution continues now that Teradata has acquired Aster. First, Aster showed that MapReduce could be effective with an RDBMS – at least, with its own nCluster database, now called Aster Database 5.0. Aster then showed that MapReduce and SQL can be reconciled, and they received a patent for their innovation in this realm.

Let’s shift gears and look at data warehouse appliances. Despite the term “data warehouse” in the name, these are really “big data analytics appliances.” I say this based on the fact that at least 90% of DW appliance owners use them for multi-terabyte analytics, not data warehousing. Aster is now showing that a MapReduce-based RDBMS can be suited to an appliance, as in the new Aster MapReduce Appliance based on Teradata hardware.

I’ll say more about the evolution of MapReduce in a TDWI Webinar on October 27. Please register online and attend.

Announcement Number Three: Teradata Database 14

Most of the new functionality of Teradata Database 14 seems focused on making the system even more manageable and performable, especially in the context of multiple, diverse, concurrent data warehouse workloads.

The multiple workload problem is a thorny one. From the DW professional’s viewpoint, it’s not easy to optimize a data warehouse for several workloads; so most of EDWs are optimized for a short list of workloads. Since the primary deliverables of the average DW are reports (whether standard or dashboards) and OLAP, most EDW designers consciously decide to optimize for these. But that makes it difficult to add new workloads to a centralized enterprise data warehouse, so new workloads are often distributed to marts, operational data stores, and data staging areas outside the warehouse proper. Examples of “new workloads” include those for real time, detailed source data, non-structured data, and discovery or exploratory analytics (not OLAP).

How DW professionals and vendors are responding to the challenge of multiple workloads constitutes a trend. That’s because the responses affect data warehouse architecture, logical modeling, optimization, performance, platform selection, tool selection, selection of analytic methods, management strategies for big data, and so on.

Note that the multiple workload challenge is both a user design issue and a vendor platform capability issue. Yet, I think the former can win out over the latter. A good design on a weak platform can succeed, though you’ll probably end up with a heavily distributed DW architecture. Conversely a bad design on a strong platform can fail, especially if you expect the platform to be the design. Technology and design issues aside, I must also point out that the placement of a DW workload can be influenced by organizational issues, like sponsorship, funding, and compliance.

So, what do you think? Let me know!

===============================
Want to learn more about Big Data Analytics? Attend the TDWI Forum on Big Data Analytics for Business Insight. There's more information online.

Posted by Philip Russom, Ph.D.0 comments

It's All in the Memory: New Battleground for BI and Analytics

Where is the biggest battleground today in the business intelligence and analytics software market? On the technology front, one of the main battles is in the addressable memory space of systems that feature 64-bit computing and operating system platforms. The “in-memory” revolution is upon us, and no BI or analytics vendor wants to be left out. Large memory platforms will be critical to users working with tools for big data analytics, data discovery, data visualization, and more.

While the development of large-memory computing is not really new, it took a while for the software industry to adapt to 64-bit hardware processing and operating system platforms. Throw in the difficult learning curve for creating software to work with parallel processing, and it’s easy to see why the move from older systems has taken time. When large memory and parallel processing platforms were exotic, the slow pace of adaptation might have been acceptable. Now, with mainstream systems offering up to a terabyte of addressable memory, organizations can’t wait to try them out for BI and analytics.

Traditionally, designers of these systems have had to adjust to the limits of the I/O bottleneck. The preprocessing and design work for indexing and aggregating data has been necessary because of the performance constraints involved in getting data from disk through the I/O bottleneck. If large memory systems can ease or eliminate that constraint for the majority of users’ analysis needs, then the boundaries for analytics applications can be pushed out.

Users can perform “data discovery,” asking questions that lead to more questions, without as much concern for what this iterative, ad hoc style of investigation might mean to overall performance. Unlike with BI reports that simply update standard views of data, users can engage in exploratory data inquiries without knowing exactly where they will end up. Large-memory systems can offer volumes of detailed data on systems deployed closer to users. With the right tools, line-of-business (LOB) decision makers can dive into the data to test predictive models and perform fine-grained analysis on their own rather than wait for IT’s specialized business analysts and statisticians to do it for them.

Data discovery vendors such as QlikTech, Tableau, and TIBCO Spotfire have prospered by jumping first to seize market opportunities. However, the biggest coming battle may be between SAP and Oracle. Earlier this year, SAP introduced HANA, which competes with Oracle’s Exadata by offering in-memory analytics along with traditional disk-based storage in an appliance. Oracle has been readying a response, which will most likely come at Oracle Open World in early October and be aimed at taking in-memory capabilities for BI and analytics further. In the coming year, Oracle and SAP will battle to show which vendor is better at using analytics to increase the business value of ERP investments. In-memory capabilities will make it easier for these and other vendors to deploy rich analytics for ERP that are tailored to vertical industry and LOB requirements.

Large memory is not the whole story when it comes to the future of BI and analytics. However, it is a technology trend that users will notice firsthand through deeper, more visual, and more timely data analysis.

Posted by David Stodder0 comments

Going Mobile with BI and Analytics

On airplanes, at coffee bars, at ballgames, and even while waiting out an oil change, I am, like many of you, encountering people intensely focused on their mobile smartphones and tablets. I can’t say that I’ve been nosy enough to check out whether those I’ve seen are using the devices for business intelligence, but some – at least the fellow at the oil change shop – do seem to be working with spreadsheets and charts, not just enjoying social media or entertainment. As technology and software options evolve, there’s less and less standing in the way of people using the devices for BI. The revolution is coming.

Mobile is on my mind in part because I am working on an upcoming TDWI Best Practices report, “Mobile BI and Analytics: Extending Intelligence to a Mobile Workforce.” If you would still like to participate in the research, we would be glad to have your input. The survey is still open.

Also, I recently had a chance to talk about mobile BI on a CIO Talk Radio program dedicated to this subject. The Internet-based show is aired through Voice America Business Radio and is hosted by Sanjog Aul, vice president of Programs for the Chicago Chapter of the Society for Information Management (SIM). Also appearing on the program was Howard Dresner, chief research officer of Dresner Advisory Services, and well known for his many years as the lead analyst for BI at Gartner. Howard, of course, had a lot of interesting things to say, and I enjoyed our discussion very much. If you would like to hear the program, follow this link

In my initial analysis of the TDWI survey results, I am seeing that senior executives currently dominate as users of mobile BI. This is expected; senior executives often are the first to try “the new toys” for data access and analysis. However, the survey shows that #1 benefit organizations seek to achieve from implementing mobile BI and analytics is the improvement of sales, service, and support. This indicates a strong desire to put mobile BI in the hands of frontline managers and other personnel who are in daily touch with customers.

If you have experiences with mobile BI and analytics or thoughts about how you see this technology evolving, please drop me a line at [email protected].

Posted by David Stodder0 comments